vsta: ChatSMP


= SMP Suport for VSTa =

Although VSTa has been designed to use spinlocks, the implementation does not support 2 or more processors on the i386 architecture. Warren Toomey is keen on finishing off the SMP implementation, and his (and others) notes are below.

== Nov 21 2003 ==

'''Warren:''' It's getting to the end of the academic semester here, so I'm going to have to put the SMP work down for a bit. It's probably also a good time to pass it to my student Gabriel, who has suffered because I keep stealing all of his things to do.

I have written up a status report of the work so far, and placed a copy of the source code on [http://minnie.tuhs.org/VSTa/SMP/index.html this web site] in the hope that some other eyes might look at it and discover some of the bugs that are plaguing me at the moment. Any suggestions are most welcome!

== Nov 18 2003 ==

'''Warren:''' I haven't found the cause of the DISCONNECT error yet, but I think I have found a race condition in free_proc(). Specifically, if 2 or more threads try to free_proc() concurrently, we'll have errors in hash_delete() and maybe also in close_ports(). My suggested replacement is below. Andy, could you eyeball this for me?

   * Andy Valencia 11/19/03

Hmmm, so you think/are seeing multiple threads thinking they're "last"?  If this is the case, a lot can go wrong out in do_exit() even before
you get here.  Note that process exit tends to be the stickiest part of most SMP kernel implementations.

{{{
static void
free_proc(struct proc *p)
{
        /*
         * Unhash our PID, but check the return from
         * hash_delete(). If the PID has already been
         * deleted, someone else has run (or is running)
         * free_proc()
         */
        p_sema(&pid_sema, PRIHI);
        if (hash_delete(pid_hash, p->p_pid)==1) {
		v_sema(&pid_sema);
                return;
	}

        /*
         * Delete us from the "allproc" list
         */
        p->p_allnext->p_allprev = p->p_allprev;
        p->p_allprev->p_allnext = p->p_allnext;
        if (allprocs == p) {
                ASSERT_DEBUG(p->p_allnext != p, "free_proc: empty");
                allprocs = p->p_allnext;
        }
        v_sema(&pid_sema);

        /*
         * Close both server and client open ports
         */
        close_ports(p->p_ports, PROCPORTS);
        close_portrefs(p->p_open, PROCOPENS);
        if (p->p_prefs) {
                ASSERT_DEBUG(hash_size(p->p_prefs) == 0,
                        "free_proc: p_prefs not empty");
                hash_dealloc(p->p_prefs);
#ifdef DEBUG
                p->p_prefs = 0;
#endif
        }
}}}

and everything else after that is unchanged.

Andy, also, it appears that servers are trying to remove a sender from their hash table more often then they are adding, i.e. there are more dead_client()s than new_client() + dup_client()s. I'm trying to determine where M_CONNECT, M_DUP and M_DISCONNECT are sent. Am I right in assuming that only only msg_connect() in the kernel sends M_CONNECT, only dup_port() in the kernel sends M_DUP, and only shut_client() in the kernel sends M_DISCONNECT?

== Nov 17 2003 ==

'''Warren:''' I didn't do much over the weekend. I commented out some deadlock
assertions in mach/mutex.c; now that we have 2 CPUs, spinlock can take the value of 1 at any time :-)

Andy asked below about the cost of determining a CPU's identity. It's really only a 32-bit read from the local APIC followed by a bit shift. Right now I have it coded as a C function, but it could become either an in-line C function or even an assembler macro. I'll worry about that once we have most of the SMP race bugs out of the road.

Servers are getting the wrong sender id from the kernel, which is why they end up dereferencing a NULL pointer. I have to investigate the message passing code, especially msgcon.c

== Nov 13 2003 Part Two ==

'''Warren:''' At last I'm finally getting somewhere! I've been able to boot to the login prompt with 2 CPUs, login and do a gls -lrt. The reason processes are getting page faults is that for some reason a DISCONNECT message is being sent to them. They then do a filehash operation to determine the matching struct file pointer. However, this is NULL, and then they go off and dereference the NULL pointer.

For now, I've recompiled cons, env and dos to retun immediately in dead_client() if they receive a null file pointer. However, I need to investigate exactly why they are getting a NULL pointer in the first place.

== Nov 13 2003 ==

'''Warren:''' I thought I had a solution to the private pages problem as noted below. From the Pentium Pro onwards, you can mark pages as ''global'' in the MMU. Once a global page is set, the MMU entry won't be altered on a context switch. This seemed to be the way of keeping a private page per CPU, regardless of what thread it was running.

It turns out that VSTa's use of segments make the global page idea useless. We can mark a page to be locked into the MMU. However, VSTa uses segments for user processes. This means that a process has a page at 0x400000 as well as the kernel. If
I lock 0x400000 in as a global page, then when a thread tries to access this user-space memory, it gets the kernel global page and dies.

I think it's time to put aside the idea of private pages.

== Nov 12 2003 ==

'''Warren:''' After the success of the 6th Nov, I have to report that I believe private pages are not going to work in VSTa. In fact, the success of 6th Nov was partially due to me having removed private pages. I have been trying to add them back in over the last 6 days, and I'm pretty sure that they will not work.

Why private pages? I wanted each CPU to have at least one private page of data, where it could store curthread, the percpu struct, its own idle stack etc., and which it could access without being slowed down by a) determining its own CPU identity and b) having to index into an array to get at this private data.

The idea of private pages was to give each CPU its own page map. The kernel parts of the page maps would be identical except for one 4K page which would hold the private data.

However, I forgot to consider multiple threads within a process. In VSTa, each process has its own page map too. The kernel half is bcopy()'d from the CPU's page map. The user half is set up as required. Now, threads in a process share this common page map.

So, when 2 CPUs are running threads from the same process, they must set up the same address space, and so must use the same page map. On the Pentium, this means that both CPUs set up the same cr3 register. Unfortuately, with the same page map, it becomes impossible to set up private pages, one per CPU.

I can't see a solution that will allow private pages to be kept. We can't give each thread its own page map, as we end up with memory inconsistency problems between the threads' vas, which are now different. We could store and reload half of the Level 1 page page during resume(), but now we need to keep hundreds or thousands of pte_t entries in each hatvas structure.

I'd appreciate any ideas at this point!

   * Andy Valencia

Hi, this is a really interesting result.  I see now that I had unconsciously carried along the baggage of my days being a kernel engineer
at Sequent (a nice SMP family of boxes).  But they never had kernel threads!

I'm tending to agree with your observation.  All we really need is an efficient way to index per-CPU context; if the page map trick
won't work then hopefully there's an easily used CPU register (some sort of APIC ID perhaps?) which can index to the needed data
structures for per-CPU state.  Hopefully it'll be fast--single digit CPU cycles--so that the scheduler and
process dispatch doesn't get mired in complex calculations.

== Nov 06 2003, Part 2 ==

After solving the TSS problem below, I can now happily report that I can boot VSTa with 2 CPUs [on Bochs], and at least some of the times I have reached the login prompt before the system crashes. Mostly though, one of the boot processes dies first! But it means that both CPUs are successfully entering the kernel, scheduling processes and (mostly) not interfering with each other. This is excellent news.

== Nov 06 2003 ==

'''Warren:''' I've had many struggles and only a few successes this week. Right now I can run VSTa with the Boot CPU only, or spin the Boot CPU and run VSTa with the second CPU only, but not both.

What is happening right now is that CPU1 is falling into hat_addtrans() from vas_fault(), doing an alloc_page() and then bzero()ing the page. Now the values from alloc_page() seem fine, and the bzero() seems to work.

However, when bzero() returns, it does not load the old EIP back into the CPU on the RET instruction. Instead, it seems to load the current stack pointer (or a nearby value) into the EIP. This causes the CPU to execute whatever is on the stack, and we quickly get a page fault.

''Update:'' Ah, I think both CPUs are sharing the same TSS task gate, and so obtain the same stack pointer when entering the kernel. This is obviously going to cause stack corruption. I'll add code to create separate TSSs now.

== Nov 01 2003 ==

'''Warren:''' The last entry about successful delivery of interprocessor interrupts was a bit premature, but it's now working fine. When the second processor now disables interrupts, the IPIs are also disabled. I'm now diagnosing problems in the existing VSTa spinlock code, where there are some uniprocessor assumptions :-)

What I'm trying to do also is keep the existing code and #ifdef SMP the new SMP code. But in mach/mutex.?, this might get too ugly. If this occurs, I'll ask Andy for some style advice.

We're at the point where the second CPU is ready to enter swtch().

== Oct 26 2003 ==

'''Warren:''' Quick update. Bochs is broken w.r.t delivering external interrupts to all CPUs. While I work at fixing the bug [aaargh, the code's so ugly..], I've implemented interprocessor interrupts (IPIs) in VSTa, but I had to define T_IPI 48
and wire it up to look like an interrupt not a trap. Now the boot processor can IPI the second processor, which prints a debug message. For now I'll use this to distribute hardclock() events, but in the long run we need symmetric interrupts.
And right now, with 2 CPUs running, I'm hitting kernel deadlocks. Stage 2 has arrived :-)

== Oct 22 2003 ==

'''Warren:''' I've quickly modified <mach/vm.h> and mach/trap.c to build separate TSS entries in the GDT, one for each CPU. I've got the second processor ldt()ing the correct entry. Yay, we can now cross the kernel-mode/user-mode barrier presented by retuser()! This means we are reaching the end of Stage 1 (dealing with the hardware issues) and we now move to the more interesting Stage 2 (finding race conditions). However, the interrupt problem and probably making separate idle stacks needs to be done before we can definitely say that the second CPU is really entering the VSTa kernel and running processes.

Quick update: each CPU now has its own idle stack, which is now located within
the private percpu page. So now we can focus on fixing interrupts.

'''HELP''' Does anybody know if the intr_mask variable in mach/isr.c is shareable amongst all CPUs, or should there be a separate intr_mask for each CPU? And if anybody has any idea why the boot CPU is getting interrupts but the other one isn't, I'd love to hear from them!!!

   * Andy: interrupt handling MUST be symmetric across CPU's.
   * Andy: It's been too long since I read the APIC spec, but basically that's the beast you need to configure WRT your interrupt distribution.

== Oct 21 2003 ==

'''Warren:''' Both Gabriel and I are going to be busy this week with academic work, so progress will slow down somewhat. I've been reading up on Intel task selectors. It looks like each CPU has to have its own TSS entry in the GDT which can point at the tss struct that the global pointer tss points to.

At present there is only a single entry in the global GDT pointed to by gdt, i.e. the TSS entry at index GDT_BOOT32. I'm thinking of moving what is now GDT_BOOT32 from position 3 down to the end of the GDT, and writing a macro something like this:

#define GDT_TSS(cpuid) ((5+cpuid) << 3)

where cpuid= 0, 1, 2, 3 etc. setup_gdt() will then loop from 0 up to ncpu-1 initialising all the TSS entries, and each CPU will ltr(GDT_TSS(n)), where n is specific to each CPU.

Andy, I've just noticed in <mach/vm.h> that there is no Entry 4 in the GDT; Entry 3 is GDT_BOOT32 and Entry 5 is GDT_UDATA. Is there any reason for the missing #4?


== Oct 20 2003 ==

'''Warren:''' While I work on the user-space problem with the second processor, Gabriel and I determined that the second CPU isn't receiving any interrupts yet, so he is going to look at the I/O APIC and anything else that is preventing interrupts.

== Oct 19 2003 ==

'''Warren:''' Status report. I've implemented alloc_private_page() in mach/vm.c which works. To do this, all CPUs need their own L1 page table and a specific L2 page table each, which covers the virtual addresses 4Meg to 8meg (on the Pentium). The boot processor now grabs an extra page frame during init_machdep() to do this. When the second processor boots, it calls alloc_page() twice to get its own L1PT and private L2PT, and constructs suitable entries for the L1PT. Entry 0, 3 and up are identical to the boot processor. Entry 2 is recursive, and entry 1 points at the private L2PT.

With this working, I have also written a new routine in init_machdep() called init_percpu(), which does an alloc_private_page() and uses this private page for the struct percpu object that cpu points at. All processors call this. Right now, the boot processor boots, sets its own page tables in init_machdep(), does init_page(), then init_percpu(), and runs VSTa. The other processor builds its own page tables, calls init_percpu(), watches the boot processor set upyet==1 and then halts. I'm nearly ready to turn interrupts on in the AP and choose a ready thread to execute!

However, I've hit a chicken and egg problem. cpu is now a global variable which points at the percpu struct. Each CPU should have its own percpu struct. However, to create the percpu struct we have to incr cpu->pc_locks as part of alloc_page(). Of course, it doesn't exist yet. Or, worse still, cpu has a value (because it was set up by the boot processor), but this points at an invalid frame just as the second processor tries to allocate a page. And once the boot processor sets cpu, we can't touch it as the boot processor is running VSTa!

Maybe the solution is for the boot processor to get a signal back from the second processor (i.e. each other processor) as to when it has done the init_percpu(), via a shared global variable.

'''Further update''': The boot CPU and the others now sync each other to obtain pages when there is no cpu struct. I've borrowed enough code from init_trap() to set the IDT and I can now turn on interrupts in the second processor while the first one runs VSTa. But when I get the second processor to call swtch(), it falls into never-never land when it irets at the end of resume(). That seems to suggest that I don't have any user-space mapped at that point. That's something to investigate :-)

== Oct 15 2003 ==

'''Warren:''' In an off-Wiki e-mail conversation, Andy has tentatively agreed that separate page tables might be the way to go, as long as there is a clean API for using them. I've created a proposed set of functions and put them up on the Wiki
[[http://www.vsta.org:8080/Chat/Kernel/SMP/PrivateAPI here]]. Comments please :-)
== Oct 14 2003 ==

'''Warren:''' Status report: I've redone the mapping of the local APIC into kernel virtual memory using the correct VSTa routines, so it's now mapped into the Utility area. I've also removed the pc_next concept from struct percpu, and made cpu into a pointer. I now allocate a 4K percpu page and point cpu at this page, and the system boots with this change. And a few minutes ago, I wrote the code to get the other processors to allocate their own page, and remap it to where the boot processor put the percpu page. So now we have separate percpu structs for each processor. However, to do this, I had to split the initialisation of the percpu struct into a separate function, init_percpu(), and run this after init_page(), as the vmap isn't set up until after init_page().

Further status report: I think I was being naive with my ideas on paging. I now suspect that each processor is going to need its own Level 1 page table and its own set of Level 2 page tables. There is no other way of supporting mappings of 2 physical page frames to the same virtual kernel page by 2 processors. We're also going to need separate maps for when we are running user processes.

That probably means, in the second processor, I need to allocate and copy the L1 and L2 page tables from the boot processor, point my own cr3 at them, and then remap the percpu page. Comments and criticisms most welcome here!

== Oct 13 2003 ==

'''Warren:''' I've brought some of my questions to Andy's reponses up here so there's no confusion. Andy wrote about the cpu struct:

   * Andy: Actually, you *keep* the cpu variable, put it at a 4k-aligned address, and map this address uniquely on each CPU.
   * Warren: That's a cool idea, I didn't think of that :-) However, this means that only one percpu struct will be visible to each processor. That seems to imply that we need to alter preempt(), as we can't keep a circular list of percpu structs anymore. Will this be a problem? Can we get rid of the pc_next field as well if there is only one visible percpu struct?

      * Andy: the parts that are seen only by the owning CPU live in "cpu".  It's still quite possible to have a per-cpu data structure globally addressable; just have a pointer to it in the private CPU struct.

Also, as each processor needs its own idle stack, and the percpu struct is < 4K in size, can I put the idle stack space into the percpu struct, so that different mappings with give each CPU its own percpu struct and idle stack?

   * Andy: this sounds good, although a little care needs to be taken if a > ~4k idle stack is ever needed (I can't think why offhand).

Finally, I went looking for a way to map a known physical page frame (i.e the local APIC) to a spare kernel page. alloc_vmap() isn't right, as it allocates a spare page frame and I can't give it a fixed physical base address. Andy, can you also comment on the purpose (or lack thereof) for the unused Entry 1 in the L1PT, i.e 4M to 8M? Is it left unused for a purpose?

   * Andy: the way the debugger maps pages should work for this.  Get the virtual page, then set its mapping.  I'd give you the actual routine except I don't have source code handy at the moment!



== Oct 12 2003 ==

'''Warren''': I'm now pondering what next to do. The first thing is to introduce multiple percpu structs, i.e. a struct percpu cpulist[]. I'd also like to have multiple idle stacks, and insert a pointer to the correct idle stack in the percpu struct. The next problem is that the old cpu variable has to go; in fact, even a global pointer called cpu isn't going to work as every processor will use it. I think what has to be done is for the processor to determine its identity whenever we enter the kernel (syscall, interrupt, exception), and then somehow code the use of cpu to be relative to this identity. Again, we can't store this in a global variable :-)

   

I'm also still unsure of the consequences of letting all interrupts through to multiple processors. Won't this cause each processor to try to deliver a message? We will still need the clock ticks to get through so as to allow pre-emption, but I need some interrupt enlightenment.

   * The SMP PIC design of the x86 works fine for arbitrating the interrupt workload across the CPU's.
   * As you note, each CPU still needs to get its own clock tick--there is allowance for this, I believe.
   * If not, whichever CPU gets the interrupt will need to send an inter-CPU interrupt to distribute it.  But I seem to recall this isn't necessary.

== Oct 11 2003 ==

'''Warren''' I fixed the problems with the initial bootstrap code on the second CPU. I wasn't setting the GDT and IDT up with the values used by the 1st CPU. I also found that the assembler can't generate far jumps properly and set CS, so I'm now pushing CS and the jump address, and using {{{lret}}} to do what I need.

The second CPU can now set up GDT & IDT, switch to protected mode, far jump to a location above 1M, set up CS, DS etc. to point to the correct selectors, and call a C function. Finally, some progress!

   * Andy: The lret approach sounds fine.  It's exciting to hear that pretty soon you'll be into actual OS SMP code (rather than just setup)!
   *  Please remember to remove your memory mapping hack soon.  It'll bite you and waste a bunch of time if you let it float around indefinitely.
   * Warren: Ok, shall do!

== Oct 10 2003 ==

'''Warren''' Ok, we've got the APIC mapped into virtual memory. I decided to manually steal the top 4K page just below the 4M virtual address; this means that nearly all of the bottom 4M of RAM is mapped 1:1, but that the top 4K page maps the APIC. It's kludgey but we can fix it later. With the mapping, we can start the second CPU in the SMP system.

The next problem needs someone who can drive GNU as, as I need to work out the correct syntax. I've enabled protected mode on the CPU, and now I need to do a long jump to flush the CPU pipeline. I also want to jump to a location in the kernel, i.e up above the 1Meg address. The code so far is:

{{{
.code16
. . .
movl  %eax, %cr0
orb   $al,  1
movl  %cr0, %eax
.code32
ljmp  *_trampoline_pm
}}}

where _trampoline_pm is the address 0x1006ca. However, I'm seeing this instruction: {{{jmp DS:001006ca}}}, and then the CPU crashes. The values for CS and DS are: 

{{{
cs:s=0x1100, dl=0x1000ffff, dh=0x9b01, valid=1
ds:s=0x0, dl=0xffff, dh=0x9300, valid=7
}}}

but the orb instruction just above this worked, and it used DS. I suspect either a) I should be doing {{{jmp CS:001006ca}}}, or b) the jump is out of the range of DS, and I need to do something to it.

If anybody has some ideas, like how I write the gas syntax to use CS: not DS: in the jump, I'd very much appreciate it! Maybe an off-Twiki conversation by e-mail would be more appropriate; my e-mail address is wkt AT tuhs DOT org.

== Oct 09 2003 ==

'''Warren:''' Gabriel and I are trying to read from the local APIC which has a physical address of 0x0fee00000. However, we are doing this after init_machdep() runs. I have slowly read through init_machdep(), and would like to confirm that the resulting physical RAM and the resulting virtual address map looks like
[[http://minnie.tuhs.org/Vstabook/Figs/init_virt_mem.gif this]] and
[[http://minnie.tuhs.org/Vstabook/Figs/phys_mem_layout.gif this]].

If we want to map the APIC onto a virtual page of its own, where would be a good place to do this? Obviously, somewhere between 4K and 2G virtual. Would the top of the 2G kernel memory be a good place? Finally, during init_machdep(), where in physical RAM is the stack that the CPU is using?

   * You should use alloc_vmap() (see how init_debug() does it) to get a virtual page.  The way dbg_utl is used should show you how to map an arbitrary physical address.

== Sep 22 2003 ==

'''Warren:''' Status report. I have a student Gabriel who is working on the SMP coding. We have a real 2-CPU box to play on as well as Bochs. Gabriel has access to the Chat areas here, so he can also ask questions.

Big questions: what is going to be the right way of setting up a second kernel stack for the second CPU? Also, we need to run some real-mode code to bootstrap the second CPU. It's going to have to be in low memory, under 1 Meg. Again, what's the right way to force this? Can we force the linker to place code at a certain position?

   * Andy Valencia 9/22/03

   The model should probably be that the first CPU will bring the entire system up, and then subsequent CPU's can assume appropriately initialized
   memory free lists and what-not, thus can use MALLOC.  a.out does not support non-contiguous memory segments (a feature, IMHO) so you should embed
   hand-written assembly using PIC and then copy it down to low memory.

== Jul 26 2003 ==

'''Gavin:''' I have VSTa installed on a removable drive that can be plugged into various machines. I booted it
on a 2x1GHz PIII, and recompiled the kernel. It's almost certainly only using a single CPU, but the good news is
that it didn' crash and burn!   '''Warren''': Good to know Gavin. I've booted SMP Minix (see below) in the Bochs simulator with 2 CPUs, and that worked, so I'll see how that was done and try to do the same for VSTa.

== Jul 24 2003 ==

'''Warren:''' So far I've added a new file called ''mp_machdep.c'' to src/os/mach and I've added code to detect the Intel MP table in the BIOS. I'm reading through the Intel MP specification, the [http://www.freebsd.org/smp/ FreeBSD SMP] code and also the code for [http://webepcc.unex.es/~jalvarez/minixsmp/ Minix SMP].

I'm still learning how VSTa works, and my biggest weakness at present is the virtual memory mapping. So some of these questions fall into this area.

Questions:
   * Will each CPU need its own kernel stack, and will it also need its own idle stack?
   * Will we need to propagate page invalidations from one kernel to the others?
   * In my last set of questions, I asked about interrupt propagation. Maybe I need to be a bit clearer. Should all CPUs get all interrupts, or should we attempt to mask out those which will be useless, e.g. mask out disk interrupts for a CPU which didn't issue a disk command? Or should we let interrupts go to all CPUs, and let them race to find and wake up the correct thread?
   * In vall_sema(), the XXX comment indicates that the code will be a race on SMP systems. Is it good enough to p_lock(&s->s_lock) before the while loop and release the spinlock afterwards?

== Jul 21 2003 ==

'''Warren:''' I noticed that the p_lock() primitives should not really be called spinlocks, because they do not spin. I've modified them to use the i386 ''xchg'' instruction and a C loop to properly spin while waiting to set the spinlock. I've recompiled the kernel and it boots fine.

Questions:
   * We obviously want all CPUs to share all (most?) of the kernel data structures, but each CPU will have its own percpu struct. However, I assume that each CPU has its own MMU and cache, and these will need configuring at boot time. Where in os/kern/main.c or os/mach/init.c is the best place to a) detect each CPU and b) configure it, in such a way that the shared kernel data is not initialised twice.
      * Somewhere in or around init_machdep(), I'd assume
   * Booting: we don't want to run /etc/rc twice, and we don't want to run the server modules twice. Do we need to tell each CPU if it is the boot CPU, and to only let one CPU do these things?
      * Usually you have only one CPU come up, and then enable the others
   * Interrupt handling: I don't know enough of i386 SMP nor the VSTa design yet. If all CPUs have interrupts enabled, will they all get interrupted by the one interrupt from an I/O device? If so, does this mean that we are going to have to mask out certain interrupts on certain CPUs? Or is there a better approach?
      * The interrupt controller knows how to do distribution for you.


= Private Page API for VSTa Kernel =

Here is a proposal for two new functions for the VSTa kernel, which will provide a processor with the ability to have `private' kernel virtual pages that are not visible to other processors. The pre-requisites to use the routines are:

1. The processor has created its own page tables in such a way that the `root' kernel pages and the kernel utility pages are globally visible to all CPUs, but there is a region of virtual kernel memory set aside for private pages.

2. init_pages() has been called by the boot processor, so that the vmap has been initialised.

On the Pentium platform, the second entry in the Level 1 Page table must point at a different Level 2 page table for each processor. This allows private pages to be mapped in to the kernel between the virtual addresses 4M to 8M.

void *alloc_private_pages(void *vaddr, uint npg)

Allocate npg pages of contiguous virtual memory to a processor. The memory will be mapped privately into the processor's kernel address space, and it will not be visible to other processors. The memory will be marked as kernel memory (i.e. it cannot be paged out). If vaddr is non-zero, the base of the memory allocation must begin at this base address. If vaddr is zero, then the function can choose an arbitrary virtual address to place the allocation. The function returns the virtual address of the base of the allocated memory, or zero if the allocation could not be performed (e.g. not enough free pages, vaddr was not page-aligned in the private page region, or if there are pages already mapped within the desired area).

* Andy: Under what conditions do you foresee the need for something to choose an address?  Are you going to bind "cpu" to a vaddr in this space, and use this API to pluck that vaddr out of the pool during startup?
* Warren: yes, I envisage the boot processor doing cpu=(struct perproc *)alloc_private_pages(0,1), and the other boot processors doing alloc_private_pages(cpu,1).

* Andy: I'm not sure this'll work unless the semantics are that the needed private physical pages are allocated on all processors immediately.
* Warren: I'm not sure why. If each CPU has its own page table, then it can allocate private pages without having to co-ordinate with other CPUs, except the locking required to obtain the page frames.
* Andy: I really think things will be cleaner if the boot processor can just call this to set up the per-CPU data structures, and there are no additional actions required of the other processors.  All data structures need to exist on all CPU's, at the same address and with the same size.  I suggest making it impossible to violate this rule, rather than provide an API whose use must be done carefully to preserve this rule.

* Andy: With this API set, you can't create aliases (two vaddrs for one physical location).  This might be a feature, since some processors (like MIPS) have a problem with aliasing in the general case anyway.  Is it intentional that you didn't define an alloc_private_page()?
* Warren: I wanted it to be general, as per alloc_pages(). I could see potential problems where a processor wanted a run of N contiguous pages, and got halfway through allocating 1 page at a time and hit an error.

void free_private_pages(void *vaddr, uint npg)

Unmap and release the npg number of contiguous pages mapped at private address vaddr. The function returns no result, so it is silent about possible errors (e.g. there are not enough contiguous pages starting at vaddr). The freed pages are returned to the free page list.

* Andy: Do you really need this?  I'd try to avoid it, since you'll be faced with TLB coherency issues.
* Warren: no, I don't forsee a need for this, but I thought I should at least suggest it.

Both of the above two functions will be implemented by calling existing kernel memory management functions (e.g. alloc_pages(), free_pages()). I don't know if we need the second function; perhaps free_pages() will already do what we need.

* Andy: I don't see how you can implement these using alloc_pages().  You're going to need your own resource map and techniques to enumerate the needed slots in the L2 PTE's of each CPU?
* Warren: I need alloc_pages() to manage the page frame resource. You are right, I'm going to need to manage the private mappings. I was hoping, at least on the Pentium, to inspect the contents of the (per-cpu) L2 page table to determine if a mapping was possible and also to choose the address when vaddr is 0.
* Andy: Use machine independent data structures (rmap in this case).

'''Warren:''' Here is the actual routine I implemented, which is working fine.

{{{
/*
* alloc_private_page()
*      Allow a processor to allocate a page which is private.
*      The page will appear in the range 4M to 8M.
*/
void *
alloc_private_page(void *vaddr)
{
pte_t pt;
uint pg;

/*
* If vaddr is present, check that it is within the
* private address range. Return error if the vaddr
* already has a mapping.
*/
if (vaddr) {
ASSERT((vaddr >= PERCPULOW) && (vaddr < PERCPUHIGH),
	"alloc_private_pages: bad vaddr");
pt= kern_findtrans(vaddr);
if (pt!=0) return((void *)0);
} else {
/*
 * Walk the entries from PERCPULOW to PERCPUHIGH to
 * find an empty vaddr. Return error if none spare.
 */
for (vaddr=PERCPULOW; vaddr<PERCPUHIGH; vaddr+=NBPG) {
	if (kern_findtrans(vaddr)==0) break;
}
if (vaddr>=PERCPUHIGH) return((void *)0);
}

/*
* Obtain an unused page frame. Make it a kernel page.
* Map it into the desired address.
*/
pg = alloc_page();
core[pg].c_flags |= C_SYS;
kern_addtrans(vaddr, pg);
parprintf("Private page 0x%x mapped at address 0x%x\n", pg, vaddr);
return(vaddr);
}
}}}

parprintf() is a debug routine I use to print stuff on the parallel port, which comes out as a separate file in Bochs. It will go away.