Index
Home
About
Blog
Home
About
Blog
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: MIPS-UNIX-context switch Date: 11 Jul 1995 19:38:02 GMT In article <3teqdm$hk6@data.interserv.net>, levine@amarex.com writes: |> Organization: Amarex Technology - High Speed Glitch Division |> |> I have a question which I would like to direct only to those who have |> MIPS 3000 knowlege. Given a MIPS 3000 chip , an I-chache , a D-cache |> and main mem. If this configuration where made into UNIX based |> machine where would the majority of time be spent for evey context |> switch ? e.g Saving regs, clearing TLB... a) Not saving regs: figure that you save: 33 integer registers [R1-R31 + HI + LO] some number of CP0 registers, let's say 7 and you might or might not arrange to save 32 32-bit FP registers, depending on how your OS wants to work. Assuming the interrupt sequence is in the I-cache, and a good memory system, saving 40 registers = 40 cycles; @ 40Mhz, = 1 micro-second. Real systems would likely be slower, so guess a couple microseconds. Restoring another register set: likely to be cache misses, so takes a few microsecs more. b) You don't need to clear the TLB, since there are Address-Space IDs, such that you only need to flush the TLB every time you see >64 distinct processes. You would normally reload a handful of TLB entries, then let other missed entries fault in. Base cost: a few micro seconds. Most OS's use the trickery of the MIPS TLB direct-mapped region to avoid TLB misses for kernel code. Caches are physically-tagged, so you get wahtever sharing is really there. c) In UNIX, most of the time goes to UNIXy scheduling & overhead, and executing unpredictable code paths and accessing state data likely to be cache misses. Register save/restore is likely a factor only in very tight embedded control systems. -john mashey DISCLAIMER:UUCP: mash@sgi.com DDD: 415-390-3090 FAX: 415-967-8496 USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Stack vs GPR and Multi-threading (was Re: A Series Compilers) Date: 12 Jul 1995 17:44:10 GMT In article <1995Jul12.143336.21769@il.us.swissbank.com>, gerryg@il.us.swissbank.com (Gerald Gleason) writes: |> If I'm interpreting what you are saying correctly, it is that in terms of |> total system performance, register save/restore is a much smaller |> opportunity than the latency associated with bad locality in various |> forms. A multi-threaded processor might be able to fill in most of what |> would be idle time waiting for cache misses doing useful work on another |> thread. The issue of multi-threading is somewhat orthagonal to GPR vs Yes, and there is a reasonable separate thread running on multi-threaded CPUs, including contributions from people who have/are building them. But for sure, I think that so much of the worry many people have about register save/restore is that it's simpler to worry about, than for example, all these latency and probabilistic arguments, i.e., it's the equivalent of the "coffee-fund paradox", i.e., a) If the coffee-pot fund is running low, a committee will debate long and hard about the solution thereof. b) But then committee must vote on $10B appropriations. Little debate: how many people really grasp $10B? :-) This is not to say register save/restore time is unimportant ... but every time I've done the cycle-by-cycle counts on a real implementation, running a general-purpose OS, I got convinced I should worry about other things more. -john mashey DISCLAIMER:UUCP: mash@sgi.com DDD: 415-390-3090 FAX: 415-967-8496 USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 7 Nov 1997 18:31:04 GMT In article <63vhbo$hmk$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> Yes, it is. But even with hardware reloads, a TLB miss is often |> much more expensive than a cache miss (sometimes 5-10 times more). |> With software reloads, they are death on wheels :-( Since "death on wheels" is difficult to evaluate, but clearly conveys the thought that this is a bad idea, let us observe: Software-reloaded TLBs are widely-used; in fact, many of the microprocessor types commonly-used to run large programs on large datasets "happen" to do this, specifically: - PA-RISC & MIPS, from 1986 onward - DEC Alphas, 1992- - Sun UltraSparcs, 1995- Consider the kind of code used to start this example: FORTRAN code with big floating point arrays, an area of interest to RISC chips. Of the 5 major RISC micro families, 4 have chosen to use software-reloaded TLBs, with IBM being the the main exception. Now, when we published info about MIPS RISC in 1986, most people (outside of HP & MIPS) thought software-reloaded TLBs were crazy ... but from 1986 through 2000, I count 6 *new* micro architectures used in systems where large memories & TLBs might be relevant: [PA-RISC, MIPS, SPARC, IBM POWER/PPC, Alpha, IA64], and of those, 6, the current implementations of 4 use software-reloaded TLBs, 1 doesn't, and one (IA64) remains to be seen. There are many reasons of flexibility and debuggability to have a software TLB, of which some were covered in 1986 COMPCON, "Operating System Support on a RISC", DeMoney, Moore, Mashey. It should be no surprise that designers of chips study TLB-miss overhead, and try to allocate resources appropriately. In modern systems: 1) A TLBmiss, in software, may actually take *less* time than actually doing a cache miss. Why is that? TLBmiss: a) Miss b) Trap c) Refill TLB, making one or more memory references, which *may* well hit in the offchip datacache. d) Return Cache miss: a) Miss b) Schedule cache miss to memory, which can be a very long time in some ccNUMA systems, but is easily 300-600ns in many SMPs. With clock cycles in in the 2-5ns range, that's 60-300 clocks, and with 2-4 superscalar chips, thats 120-1200 instructions. Now, of course, there are also TLBmisses that take longer than cache misses, but in fact, whether a refill is done by a trap to software, or by a hardware engine, the time is: T = C + N * M C = ~constant overhead time M = time for cache miss N = number of cache misses caused by doing TLB processing If there are a lot of TLBmisses, the TLBMiss code ends up living in the on-chip L1 I-cache. If the PTE structures for hardware or software versions are the same, there will be about the same number of accesses to memory. In some cases historically, the complexity of TLB-table-walks in memory has demanded that PTEs *not* be cacheable, hence giving up the ability to use the cache as a backing store for the TLB ... which is trivial and straightforward to accomplish in a software-controlled TLB. TLBs are famous for the weird bugs and odd cases in many early micros, which is why OS people were often the ones who preferred software-controlled ones as less troublesome. For the long-term, one can either make TLBs larger (more entries), or allow entries to have multiple sizes ... and the industry seems to be tending towards the latter; the R4000, in 1992, went this way, because we couldn't figure out how to keep up with 4X/3 years in memory sizes, for on-chip data structures that had to be fast. Bottom line: Nick's characterization of software TLBs as "death on wheels", in general, flies in the face of increasing use of this technique by very experienced CPU designers. -- -john mashey DISCLAIMER:EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 7 Nov 1997 21:50:52 GMT In article <63vrmv$nd5$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> >In modern systems: -------^^^^^^ |> >1) A TLBmiss, in software, may actually take *less* time than actually |> >doing a cache miss. Why is that? |> |> Well, on the machines that I have tried (and they DO include some of the |> ones you mentioned), a TLB miss is usually significantly more expensive. |> The factor of 5-10 times was based both on measurement, as well as some |> figures given by current hardware designers for their chips. It would be helpful to quote some of these, since (date of system) is fairly important in this discussion, given that, in 1986, we had 8Mhz (125 ns) single-issue CPUs, and DRAM with raw read times of ~120ns, while we now have 2-4-issue CPUs in the 200-500MHz (5-2ns) range, and raw DRAMs ~60ns, and total cache miss times in the 200-500ns range for SMPs and local nodes, 400-1000 for low-latency ccNUMAs, and maybe 200-3000ns for higher latency ones. |> >Now, of course, there are also TLBmisses that take longer than cache misses, |> >but in fact, whether a refill is done by a trap to software, or by a |> >hardware engine, the time is: |> > T = C + N * M |> > C = ~constant overhead time |> > M = time for cache miss |> > N = number of cache misses caused by doing TLB processing |> |> I think that you are being misleading - in fact, I am certain. In many |> or most architectures, handling a miss in software involves a context |> switch. Not a full context switch, to be sure, but the CPU has to move |> from DAT-on in user mode to DAT-off in kernel mode. This means that the |> constant overhead is potentially a great deal larger than for the |> hardware solution. You may be certain, but you are incorrect. There is no context switch (as most people use the term, i.e., from one user task to another user task.) I don't recall exactly what the Alpha & UltraSPARC folks do, but they're not idiots, so presumably they do something similar to what HP & MIPS have done for a long time: There is a special, low-overhead trap to the OS, and it has nothing to do with turning DATs on & off. HP provided some special registers to make this faster, MIPS used a "hack" of telling user code that there were 2 registers they could expect to be trashed at any time, so the kernel doesn't even have to save/restore these registers; there were enough registers to get away with this. Various chunks of hardware are added to make extractions or virtual references (in Alpha's case, anyway) faster, where the issue is a series of dependent operations that are trivial to do in hardware, leaving the sequencing and control in software. Note: some of the beliefs here come from a long discussion in a Cupertino bar with PA-RISC architects, of the form "why did you do this? we did that..." There was some head-slapping when I said we hadn't had to do special registers, although I had tried to get 3 rather than 2 for the kernel, but the compiler people wouldn't give me the third one. I wrote the original MIPS version of such code in early 1986, Steve Stone tuned it up, and we identified various simple hardware that could help. I have the original code somewhere, but couldn't find it, here was Steve's version as of April 1 1985: From scs Mon Apr 1 16:02:19 1985 From: scs (Steve Stone) Subject: user TLB miss. I have been trying to reduce the number of instructions involved in resolving a user tlbmiss. The best that I can do (with some hardware changes assumed) is around 15 cycles (assuming no cache misses). The following is a first cut at the problem. The following hardware features are assumed: - There are seperate UTLBMISS/KTLBMISS cause bits. - The EPC is predecremented by hardware if the branch delay bit is set in the SR. I know this is difficult to implement. One possible way around this is to seperate out UTLBMISS in a branch delay slot from other UTLBMISSes. - At the time of a UTLBMISS, the TLBENHI register is set up correctly (the TLBPID is or'd in and the VPN is correct). - There are two registers usable by the kernel only. The state of these registers is never saved and can only be trusted while interrupts are disabled (called RT1 and RT2). Here is the exception handler code: /* * Grab the cause bits. User tlbmiss should be handled quickly * if possible (i.e. the only cause for the exception). */ mfcause RT1 sub RT1,CAUSE_UTLBMISS bne RT1,r0,exc_noutlbmiss /* * - Grab the VPN/TLBPID register from CP0. * - Isolate the VPN in the low order bits * 4. * - Add in the USERPTBASE constant (in kseg3). The high order * bit of the VPN will have been set in the TLBENHI. This * should be taken into consideration when choosing the * USERPTBASE location. */ mfc0 RT1,TLBENHI lsr RT1,TLBPIDSZ-2 and RT1,~3 la RT2,USERPTBASE add RT1,RT2 /* * We now have a pointer to the TLB entry. Grab it. A fault * may occur here. If so, the KTLBMISS handler will have to * be smart enough to reset RT1 to be the original PTE pointer * and reset the c0 registers so the following code will work. */ lw RT1,0(RT1) /* * If the PTE is invalid, handle the long way. */ and RT2,TLB_V,RT1 beq RT2,r0,exc_upteinval mtc0 RT1,TLBENLO c0 TLBWRITE nop rfe nop |> You are effectively saying that this case has been optimised so much |> that it is no longer significant. That is most interesting. Actually, I didn't say that. It is sometimes significant for certain programs. However, truly big programs often want big pages anyway, so once you figure out how to do that in the general case, you are better off than shaving a few cycles off something with a terrible TLB miss rate. This problem gets studied every time, and the general approach is to give the TLB some more resource, but worry a lot more about cache misses, which are way more frequent for most codes. a) If a program has a low TLB-miss rate a1) if the cache-miss rate is low, all is well. a2) if the cache-miss rate is high, then that's the problem. b) If the program has a high TLB-miss rate. b1) If the cache-miss rate is high, you're down to DRAM speed, and either you have a problem for a vector machine, or you need to be doing cache-blocking anyway. b2) If the cache-miss rate is low, then the TLB is actually the bottleneck. Many designers have never been able to find enough (b2) programs to justify huge amounts of hardware to help the TLB. Note of course, that IBM RS/6000s have a fairly different philosophy in various ways. |> >TLBs are famous for the weird bugs and odd cases in many early micros, |> >which is why OS people were often the ones who preferred |> >software-controlled ones as less troublesome. |> |> Don't you really mean that the bugs are easier to fix, and hence less |> embarrassing :-) Not exactly. What I meant was that almost any OS person involved in the design of the first round of RISC chips had had experience with early micros, and running into weird-case bugs late in the development cycle, with complex hardware logic that took full chip spins to fix. it isn't a question of embarrasment, it's a question of whether or not you can ship a product. The following has been known to happen, when designing new systems with brand new micros: (a) System comes up. (b) Debug it, looks good. (c) Get a bunch of systems ready, be running QA. (d) Fix a bug in C compiler. (e) Some instruction moves 2 bytes, crosses a page boundary, regression tests start breaking 4 weeks before shipment; it takes 2 weeks to figure out exactly what is happening. (f) Then you realize that the odd case could potentially happen with any user-compiled program, and it is a bug in the microcode, and it's going to be 3 months before it gets fixed ... and you're dead. The MIPS utlbmiss codes have often been diddled to work around some odd hardware error, so that you can get beyond the first ones to see what else there is. |> >Bottom line: Nick's characterization of software TLBs as "death on wheels", |> >in general, flies in the face of increasing use of this technique by |> >very experienced CPU designers. |> |> I accept your correction! I stand by my point that TLB misses are |> generally "death on wheels", but it is very likely that I have been |> using software implementations that I thought were hardware :-) |> |> I also take your point that TLB misses are becoming less expensive as |> time goes on, in a way that cache misses are not. But I don't believe |> that the turnover point has yet arrived! Hmmm. I thought your point was that "death on wheels" was equivalent to "software-reloaded TLBs are a bad idea and should be done away with." Was that a misinterpretation? I'm not sure what "turnover point" means. A CPU designer has to provide a set of facilities, which for systems-type chips, includes cache + MMU, with various tradeoffs. All that's been happening is that countless studies have convinced many designers that they can avoid a bunch of complex microcode, or worse, a lot of random logic with touchy special cases, in favor of a low-overhead trap to a small piece of code, and that if it takes a few more cycles to do the logic, it takes less die space, is more flexible, and the times are increasingly dominated by the time to fetch PTEs from memory anyway. -- -john mashey DISCLAIMER:EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 9 Nov 1997 06:31:38 GMT In article <641ecg$ot1$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> Also, I am talking about the TOTAL effect on application speed, and not |> just the raw cost of processing the problem. The problem with TLB misses |> (and, generally, anything that needs a trap) is that the indirect costs |> are often larger than the direct ones. Things like conflict for the |> first-level cache, interference with coprocessors and so on. |> Well, in MY book, that is a partial context switch, and the TLB refilling |> is being done by a hybrid hardware/software solution! But I accept your |> point that the TLB miss handler 'context' is both minimal and permanently |> available. ... |> >Hmmm. I thought your point was that "death on wheels" was equivalent |> >to "software-reloaded TLBs are a bad idea and should be done away with." |> >Was that a misinterpretation? |> |> Yes and no. It IS what I meant, but we were clearly talking about |> different things! I have no problem with the solutions that you have |> described, but I would call them hybrid solutions. "But `glory' doesn't mean `a nice knock-down argument,'" Alice objected. "When *I* use a word," Humpty Dumpty said, in a rather scornful tone, "it means just what I choose it to mean-neither more nor less." Occasionally, discussion threads get going where people attempt to modify "standard" terminology, resulting in massive confusion. Usually I stop reading the thread at that point. *I* use the terms "context switch", "trap", "hardware TLB", "software-reloaded TLB (or just software TLB)" the same way as other people do, who actually design chips and OS's for a living, and I propose to people reading this newsgroup that more things will make sense if they do the same, that is: 1) A context-switch switches state from one process/task to another. Maybe someone uses the term "partial context-switch" to mean "trap"; I'll admit I've never heard it. 2) A *trap* directs the program flow to a kernel address, which: - Takes action and returns very quickly, as in a normal MIPS UTLBMISS trap. - Takes action and returns more slowly, as in a 2-level UTLBMISS, or some system calls - Takes action that eventually turns into a context-switch, as in a system call that causes a real I/O & a reschedule to another process, or UTLBMISS that is discovered to actually be a page fault. 3) A "hardware TLB" usually means a TLB, which, if the desired entry is not present, performs a tablewalk, or other appropriate mechanism to reload the TLB entry from memory, without causing a trap for normal refills. Such mechanisms were used in 360/67, 370..., VAX, and most early micro TLBs. Depending on the implemen641ecg$ot1$1@lyra.csx.cam.ac.uk>
63vrmv$nd5$1@lyra.csx.cam.ac.uk>63vhbo$hmk$1@lyra.csx.cam.ac.uk>1995Jul12.143336.21769@il.us.swissbank.com>3teqdm$hk6@data.interserv.net>