Skip to content Skip to footer

Login orRegister

0 items - $0.00 0

Software-Refilled TLBs by luu

0CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

Software-Refilled TLBs by luu

ByHackTech February 26, 2022

Share This Article

Share Post

Newsletter

Sed ut perspiciatis unde.

Index
Home
About
Blog

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: MIPS-UNIX-context switch
Date: 11 Jul 1995 19:38:02 GMT

In article <3teqdm$hk6@data.interserv.net>, levine@amarex.com writes:
|> Organization: Amarex Technology - High Speed Glitch Division
|> 
|> I have a question which I would like to direct only to those who have
|> MIPS 3000 knowlege.  Given a MIPS 3000 chip , an I-chache , a D-cache
|> and main mem.  If this configuration where made into UNIX based
|> machine where would the majority of time be spent for evey context
|> switch ?  e.g Saving regs, clearing TLB...

a) Not saving regs: figure that you save:
	33 integer registers [R1-R31 + HI + LO]
	some number of CP0 registers, let's say 7
	and you might or might not arrange to save 32 32-bit FP registers,
		depending on how your OS wants to work.
	Assuming the interrupt sequence is in the I-cache, and a good memory
	system, saving 40 registers = 40 cycles; @ 40Mhz, = 1 micro-second.
	Real systems would likely be slower, so guess a couple microseconds.

	Restoring another register set: likely to be cache misses, so
	takes a few microsecs more.

b) You don't need to clear the TLB, since there are Address-Space IDs, such
that you only need to flush the TLB every time you see >64 distinct processes.
You would normally reload a handful of TLB entries, then let other missed
entries fault in.  Base cost: a few micro seconds.  Most OS's use the trickery
of the MIPS TLB direct-mapped region to avoid TLB misses for kernel code.
Caches are physically-tagged, so you get wahtever sharing is really there.

c) In UNIX, most of the time goes to UNIXy scheduling & overhead, and
executing unpredictable code paths and accessing state data likely to be cache
misses.
	Register save/restore is likely a factor only in very tight embedded
	control systems.

-john mashey    DISCLAIMER: 
UUCP:    mash@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Stack vs GPR and Multi-threading (was Re: A Series Compilers)
Date: 12 Jul 1995 17:44:10 GMT

In article <1995Jul12.143336.21769@il.us.swissbank.com>,
gerryg@il.us.swissbank.com (Gerald Gleason) writes:

|> If I'm interpreting what you are saying correctly, it is that in terms of  
|> total system performance, register save/restore is a much smaller  
|> opportunity than the latency associated with bad locality in various  
|> forms.  A multi-threaded processor might be able to fill in most of what  
|> would be idle time waiting for cache misses doing useful work on another  
|> thread.  The issue of multi-threading is somewhat orthagonal to GPR vs  

Yes, and there is a reasonable separate thread running on multi-threaded
CPUs, including contributions from people who have/are building them.
But for sure, I think that so much of the worry many people have about
register save/restore is that it's simpler to worry about, than for example,
all these latency and probabilistic arguments,  i.e., it's
the equivalent of the "coffee-fund paradox", i.e.,
	a) If the coffee-pot fund is running low, a committee will debate
	   long and hard about the solution thereof.
	b) But then committee must vote on $10B appropriations.
	   Little debate: how many people really grasp $10B? :-)

This is not to say register save/restore time is unimportant ... but
every time I've done the cycle-by-cycle counts on a real implementation,
running a general-purpose OS, I got convinced I should worry about other
things more.

-john mashey    DISCLAIMER: 
UUCP:    mash@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 7 Nov 1997 18:31:04 GMT

In article <63vhbo$hmk$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> Yes, it is.  But even with hardware reloads, a TLB miss is often
|> much more expensive than a cache miss (sometimes 5-10 times more).
|> With software reloads, they are death on wheels :-(

Since "death on wheels" is difficult to evaluate, but clearly conveys the
thought that this is a bad idea, let us observe:

Software-reloaded TLBs are widely-used; in fact, many of the microprocessor
types commonly-used to run large programs on large datasets "happen" to do
this, specifically:

- PA-RISC & MIPS, from 1986 onward
- DEC Alphas, 1992-
- Sun UltraSparcs, 1995-

Consider the kind of code used to start this example: FORTRAN code with
big floating point arrays, an area of interest to RISC chips.
Of the 5 major RISC micro families, 4 have chosen to use software-reloaded
TLBs, with IBM being the the main exception.

Now, when we published info about MIPS RISC in 1986, most people
(outside of HP & MIPS) thought software-reloaded TLBs were crazy ...
but from 1986 through 2000, I count 6 *new* micro architectures used
in systems where large memories & TLBs might be relevant:
[PA-RISC, MIPS, SPARC, IBM POWER/PPC, Alpha, IA64],
and of those, 6, the current implementations of 4 use software-reloaded
TLBs, 1 doesn't, and one (IA64) remains to be seen.

There are many reasons of flexibility and debuggability to have a software
TLB, of which some were covered in 1986 COMPCON, "Operating System
Support on a RISC", DeMoney, Moore, Mashey.

It should be no surprise that designers of chips study TLB-miss overhead,
and try to allocate resources appropriately.

In modern systems:
1) A TLBmiss, in software, may actually take *less* time than actually
doing a cache miss.  Why is that?
TLBmiss:
	a) Miss
	b) Trap
	c) Refill TLB, making one or more memory references, which *may*
	well hit in the offchip datacache.
	d) Return
Cache miss:
	a) Miss
	b) Schedule cache miss to memory, which can be a very long time
		in some ccNUMA systems, but is easily 300-600ns in many
		SMPs.  With clock cycles in in the 2-5ns range, that's
		60-300 clocks, and with 2-4 superscalar chips, thats
		120-1200 instructions.

Now, of course, there are also TLBmisses that take longer than cache misses,
but in fact, whether a refill is done by a trap to software, or by a
hardware engine, the time is:
	T = C + N * M
	C = ~constant overhead time
	M = time for cache miss
	N = number of cache misses caused by doing TLB processing

If there are a lot of TLBmisses, the TLBMiss code ends up living in the
on-chip L1 I-cache.  If the PTE structures for hardware or software versions
are the same, there will be about the same number of accesses to memory.
In some cases historically, the complexity of TLB-table-walks in memory
has demanded that PTEs *not* be cacheable, hence giving up the ability to
use the cache as a backing store for the TLB ... which is trivial and
straightforward to accomplish in a software-controlled TLB.

TLBs are famous for the weird bugs and odd cases in many early micros,
which is why OS people were often the ones who preferred software-controlled
ones as less troublesome.

For the long-term, one can either make TLBs larger (more entries), or
allow entries to have multiple sizes ... and the industry seems to be
tending towards the latter; the R4000, in 1992, went this way, because
we couldn't figure out how to keep up with 4X/3 years in memory sizes,
for on-chip data structures that had to be fast.

Bottom line: Nick's characterization of software TLBs as "death on wheels",
in general, flies in the face of increasing use of this technique by
very experienced CPU designers.

--
-john mashey    DISCLAIMER: 
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 7 Nov 1997 21:50:52 GMT

In article <63vrmv$nd5$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

|> >In modern systems:
-------^^^^^^
|> >1) A TLBmiss, in software, may actually take *less* time than actually
|> >doing a cache miss.  Why is that?
|>
|> Well, on the machines that I have tried (and they DO include some of the
|> ones you mentioned), a TLB miss is usually significantly more expensive.
|> The factor of 5-10 times was based both on measurement, as well as some
|> figures given by current hardware designers for their chips.

It would be helpful to quote some of these, since (date of system) is
fairly important in this discussion, given that, in 1986, we had
8Mhz (125 ns) single-issue CPUs, and DRAM with raw read times of ~120ns,
while we now have 2-4-issue CPUs in the 200-500MHz (5-2ns) range, and
raw DRAMs ~60ns, and total cache miss times in the 200-500ns range for
SMPs and local nodes, 400-1000 for low-latency ccNUMAs, and maybe
200-3000ns for higher latency ones.

|> >Now, of course, there are also TLBmisses that take longer than cache misses,
|> >but in fact, whether a refill is done by a trap to software, or by a
|> >hardware engine, the time is:
|> >        T = C + N * M
|> >        C = ~constant overhead time
|> >        M = time for cache miss
|> >        N = number of cache misses caused by doing TLB processing
|>
|> I think that you are being misleading - in fact, I am certain.  In many
|> or most architectures, handling a miss in software involves a context
|> switch.  Not a full context switch, to be sure, but the CPU has to move
|> from DAT-on in user mode to DAT-off in kernel mode.  This means that the
|> constant overhead is potentially a great deal larger than for the
|> hardware solution.

You may be certain, but you are incorrect.
There is no context switch (as most people use the term, i.e., from one user
task to another user task.)  I don't recall exactly what the Alpha &
UltraSPARC folks do, but they're not idiots, so presumably they do something
similar to what HP & MIPS have done for a long time:

There is a special, low-overhead trap to the OS, and it has nothing
to do with turning DATs on & off.  HP provided some special registers
to make this faster, MIPS used a "hack"  of telling user
code that there were 2 registers they could expect to be trashed
at any time, so the kernel doesn't even have to save/restore these
registers; there were enough registers to get away with this.
Various chunks of hardware are added to make extractions or virtual
references (in Alpha's case, anyway)  faster, where the issue is
a series of dependent operations that are trivial to do in hardware,
leaving the sequencing and control in software.

Note: some of the beliefs here come from a long discussion
in a Cupertino bar with PA-RISC architects, of the form
"why did you do this? we did that..." There was some head-slapping when
I said we hadn't had to do special registers, although I had tried to
get 3 rather than 2 for the kernel, but the compiler people wouldn't give
me the third one.

I wrote the original MIPS version of such code in early 1986, Steve Stone
tuned it up, and we identified various simple hardware that could help.
I have the original code somewhere, but couldn't find it, here was Steve's
version as of April 1 1985:

From scs Mon Apr  1 16:02:19 1985
From: scs (Steve Stone)
Subject: user TLB miss.

   I have been trying to reduce the number of instructions involved in
resolving a user tlbmiss.  The best that I can do (with some hardware
changes assumed) is around 15 cycles (assuming no cache misses).

   The following is a first cut at the problem.  The following hardware
features are assumed:

        - There are seperate UTLBMISS/KTLBMISS cause bits.
        - The EPC is predecremented by hardware if the branch delay bit
          is set in the SR.  I know this is difficult to implement.  One
          possible way around this is to seperate out UTLBMISS in a
          branch delay slot from other UTLBMISSes.
        - At the time of a UTLBMISS, the TLBENHI register is set up
          correctly (the TLBPID is or'd in and the VPN is correct).
        - There are two registers usable by the kernel only.  The state
          of these registers is never saved and can only be trusted
          while interrupts are disabled (called RT1 and RT2).

   Here is the exception handler code:

        /*
         * Grab the cause bits.  User tlbmiss should be handled quickly
         * if possible (i.e. the only cause for the exception).
         */
        mfcause RT1
        sub     RT1,CAUSE_UTLBMISS
        bne     RT1,r0,exc_noutlbmiss
        /*
         * - Grab the VPN/TLBPID register from CP0.
         * - Isolate the VPN in the low order bits * 4.
         * - Add in the USERPTBASE constant (in kseg3).  The high order
         *   bit of the VPN will have been set in the TLBENHI.  This
         *   should be taken into consideration when choosing the
         *   USERPTBASE location.
         */
        mfc0    RT1,TLBENHI
        lsr     RT1,TLBPIDSZ-2
        and     RT1,~3
        la      RT2,USERPTBASE
        add     RT1,RT2

        /*
         * We now have a pointer to the TLB entry.  Grab it.  A fault
         * may occur here.  If so, the KTLBMISS handler will have to
         * be smart enough to reset RT1 to be the original PTE pointer
         * and reset the c0 registers so the following code will work.
         */
        lw      RT1,0(RT1)
        /*
         * If the PTE is invalid, handle the long way.
         */
        and     RT2,TLB_V,RT1
        beq     RT2,r0,exc_upteinval
        mtc0    RT1,TLBENLO
        c0      TLBWRITE
        nop
        rfe
        nop

|> You are effectively saying that this case has been optimised so much
|> that it is no longer significant.  That is most interesting.

Actually, I didn't say that.  It is sometimes significant for certain
programs.  However, truly big programs often want big pages anyway,
so once you figure out how to do that in the general case, you are better
off than shaving a few cycles off something with a terrible TLB miss rate.
This problem gets studied every time, and the general approach is to give the
TLB some more resource, but worry a lot more about cache misses, which are
way more frequent for most codes.
a) If a program has a low TLB-miss rate
	a1) if the cache-miss rate is low, all is well.
	a2) if the cache-miss rate is high, then that's the problem.
b) If the program has a high TLB-miss rate.
	b1) If the cache-miss rate is high, you're down to DRAM speed,
	and either you have a problem for a vector machine, or you need to
	be doing cache-blocking anyway.
	b2) If the cache-miss rate is low, then the TLB is actually the
	bottleneck.

Many designers have never been able to find enough (b2) programs to justify
huge amounts of hardware to help the TLB.  Note of course, that IBM RS/6000s
have a fairly different philosophy in various ways.

|> >TLBs are famous for the weird bugs and odd cases in many early micros,
|> >which is why OS people were often the ones who preferred
|> >software-controlled ones as less troublesome.
|>
|> Don't you really mean that the bugs are easier to fix, and hence less
|> embarrassing :-)

Not exactly.  What I meant was that almost any OS person involved in the
design of the first round of RISC chips had had experience with
early micros, and running into weird-case bugs late in the development
cycle, with complex hardware logic that took full chip spins to fix.
it isn't a question of embarrasment, it's a question of whether or not
you can ship a product.  The following has been known to happen, when designing
new systems with brand new micros:
	(a) System comes up.
	(b) Debug it, looks good.
	(c) Get a bunch of systems ready, be running QA.
	(d) Fix a bug in C compiler.
	(e) Some instruction moves 2 bytes, crosses a page boundary,
	regression tests start breaking 4 weeks before shipment;
	it takes 2 weeks to figure out exactly what is happening.
	(f) Then you realize that the odd case could potentially happen
	with any user-compiled program, and it is a bug in the microcode,
	and it's going to be 3 months before it gets fixed ... and you're dead.
The MIPS utlbmiss codes have often been diddled to work around some
odd hardware error, so that you can get beyond the first ones to see what
else there is.

|> >Bottom line: Nick's characterization of software TLBs as "death on wheels",
|> >in general, flies in the face of increasing use of this technique by
|> >very experienced CPU designers.
|>
|> I accept your correction!  I stand by my point that TLB misses are
|> generally "death on wheels", but it is very likely that I have been
|> using software implementations that I thought were hardware :-)
|>
|> I also take your point that TLB misses are becoming less expensive as
|> time goes on, in a way that cache misses are not.  But I don't believe
|> that the turnover point has yet arrived!

Hmmm.  I thought your point was that "death on wheels" was equivalent
to "software-reloaded TLBs are a bad idea and should be done away with."
Was that a misinterpretation?

I'm not sure what "turnover point" means. A CPU designer has to provide
a set of facilities, which for systems-type chips, includes cache + MMU,
with various tradeoffs. All that's been happening is that countless
studies have convinced many designers that they can avoid a bunch of
complex microcode, or worse, a lot of random logic with touchy special cases,
in favor of a low-overhead trap to a small piece of code, and that if it
takes a few more cycles to do the logic, it takes less die space,
is more flexible, and the times are increasingly dominated by the time to
fetch PTEs from memory anyway.

--
-john mashey    DISCLAIMER: 
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 9 Nov 1997 06:31:38 GMT

In article <641ecg$ot1$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> Also, I am talking about the TOTAL effect on application speed, and not
|> just the raw cost of processing the problem.  The problem with TLB misses
|> (and, generally, anything that needs a trap) is that the indirect costs
|> are often larger than the direct ones.  Things like conflict for the
|> first-level cache, interference with coprocessors and so on.

|> Well, in MY book, that is a partial context switch, and the TLB refilling
|> is being done by a hybrid hardware/software solution!   But I accept your
|> point that the TLB miss handler 'context' is both minimal and permanently
|> available.

...

|> >Hmmm.  I thought your point was that "death on wheels" was equivalent
|> >to "software-reloaded TLBs are a bad idea and should be done away with."
|> >Was that a misinterpretation?
|>
|> Yes and no.  It IS what I meant, but we were clearly talking about
|> different things!  I have no problem with the solutions that you have
|> described, but I would call them hybrid solutions.

"But `glory' doesn't mean `a nice knock-down argument,'" Alice objected.

"When *I* use a word," Humpty Dumpty said, in a rather scornful tone,
"it means just what I choose it to mean-neither more nor less."

Occasionally, discussion threads get going where people attempt to
modify "standard" terminology, resulting in massive confusion.
Usually I stop reading the thread at that point.

*I* use the terms "context switch", "trap", "hardware TLB", "software-reloaded
TLB (or just software TLB)" the same way as other people do, who actually
design chips and OS's for a living, and I propose to people reading this
newsgroup that more things will make sense if they do the same, that is:

1) A context-switch switches state from one process/task to another.
   Maybe someone uses the term "partial context-switch" to mean "trap";
   I'll admit I've never heard it.

2) A *trap* directs the program flow to a kernel address, which:
   - Takes action and returns very quickly, as in a normal MIPS UTLBMISS trap.
   - Takes action and returns more slowly, as in a 2-level UTLBMISS,
      or some system calls
   - Takes action that eventually turns into a context-switch, as in
      a system call that causes a real I/O & a reschedule to another process,
      or UTLBMISS that is discovered to actually be a page fault.

3) A "hardware TLB" usually means a TLB, which, if the desired entry is not
present, performs a tablewalk, or other appropriate mechanism to reload the
TLB entry from memory, without causing a trap for normal refills.
Such mechanisms were used in 360/67, 370..., VAX, and most early micro TLBs.
Depending on the implemen

Tags: Software-Refilled

Written by

HackTech

View all posts by HackTech

Leave a comment

Leave a comment Cancel reply

You must be logged in to post a comment.

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Log in to your account