The last post introduced the mechanisms that xv6 uses for scheduling and context
switches. User processes can transfer control to kernel code with system calls,
potentially switching into the scheduler with sleep()
or exit()
to find
another process to run. But there are many other system calls besides those two
Kernel code can also be invoked during hardware interrupts or software
exceptions; these three together are collectively referred to as traps.
We'll go over traps now to understand them more generally. First, about the terminology: depending on the source, interrupts might mean hardware interrupts specifically or any trap generally; similarly, exceptions might mean errors arising from the code, or traps in general. It's super frustrating because it makes it really hard to know what's meant by a word like "interrupt" or "exception" in whatever specification or source you happen to be reading. So I'm gonna try my best to save you that kind of pain in this post by sticking to "interrupt" for the hardware interrupts only, "exception" for software errors, and "trap" for those two combined with system calls.
Imagine if, after every single time some user code carried out a division, the processor stopped, context switched into the kernel, and asked the kernel to check if there was a division by zero and handle it if necessary. Or every time a hardware interrupt happened, the kernel had to start polling all the devices to figure out which one just yelled. No. Just no. Running kernel code for all this would be way too slow.
So it's the processor that will have to detect traps and decide how to handle them. But what exactly it should do for a specific trap depends on all kinds of of particulars about that OS, e.g. a disk saying it's done reading from a file might require updating some file system data or storing the disk data in a specific buffer or something. That's too much responsibility for the processor.
Okay, so the kernel will set up a bunch of handler functions for every possible type of trap. Then it tells the hardware, "Okay, so if you get a disk interrupt, here are my instructions to handle that. For timer interrupts, use these instructions. If a process tries to access an invalid page, do this..." From then on, the processor can handle the traps without further input from the kernel by looking up the interrupt number in a big table to get the trap handler function that the kernel set up, then just running it.
In the x86 architecture, that table is called the interrupt descriptor table
or IDT. I know, I'm sorry, I promised I'd say "trap" for the general case, but
the x86 specs give it the official name of IDT even though it handles all the
traps. Sigh. It has 256 entries (so that's the maximum number of distinct traps
we can define); each one specifies a segment descriptor (ugh segmentation again,
you know what that means: opaque code) and an instruction pointer (%eip
) that
tell the processor where it can find the corresponding trap handler
function.
xv6 won't use all 256 entries; it'll mostly use trap numbers 0-31 (software exceptions), 32-63 (hardware interrupts), and 64 (system calls), all defined in traps.h. But we do have to stick all 256 in the IDT anway, so we're the unlucky fools who get to write 256 functions' worth of assembly code by hand. Nah, just kidding: xv6 uses a script in a high-level language to do that for us and spit out the entries into an assembly file.
Unfortunately for us, that high-level language is Perl. Sigh. Perl is infamous as a "write-only" language, so I guess instead we're just the unlucky fools who get to try reading Perl.
Okay, I'm not gonna assume you know Perl, and either way I really don't wanna go over every single line of this file. The syntax is similar enough to C's (except that somehow they managed to make it even worse than C), so you can read it on your own if you want.
Now, no script will be able to generate 256 completely unique assembly functions with enough detail to handle each trap correctly, so each function in the script has to be pretty generic. They're all gonna call the same assembly helper function, which will call a C function where we can more comfortably code up how to handle each interrupt.
The gist of this Perl script is that it prints a bunch of stuff using a for loop
with 256 iterations. The xv6
Makefile
will run it from the command line with ./vectors.pl > vectors.S
so that the
output gets saved in an assembly file, which will then get assembled together
with all the other kernel code in OBJS
.
The resulting assembly file will look like this:
.globl alltraps
.globl vector0
vector0:
pushl $0
pushl $0
jmp alltraps
.globl vector1
vector1:
pushl $0
pushl $1
jmp alltraps
.globl vector2
vector2:
pushl $0
pushl $2
jmp alltraps
# ...
Except that a handful of entries (8, 10 through 14, and 17) will skip one line (I'll explain why below):
# ...
.globl vector8
vector8:
pushl $8
jmp alltraps
# ...
Then at the end, it defines an array vectors
with each of those entries above:
# ...
.data
.globl vectors
vectors:
.long vector0
.long vector1
.long vector2
# ...
Okay, so those are all the handler functions; the vectors
array holds a
pointer to each one. They're all more or less the same: most of them push zero
onto the stack, then all they push a trap number to indicate which trap
just happened, and then they jump to a point in the code called alltraps
;
that's the assembly helper function I mentioned earlier.
A handful of the entries don't push zero on the stack: these are trap numbers
8 (a double fault, which happens when the processor encounters an error while
handling another trap), 10 (an invalid task state segment), 11 (segment
not present), 12 (a stack exception), 13 (a general protection fault), 14 (a
page fault), and 17 (an alignment check). These are special because the
processor will actually push an error code on the stack before calling into the
corresponding handler function in vectors
. It doesn't push any error codes on
the stack for the others, so we just push 0 ourselves to make them all match up.
The processor needs to run the trap handler in kernel mode, which means we have
to save some state for the process that's currently running so we can return to
it later (similar to the struct context
we saw before), then set things up to
run in kernel mode. The alltraps
routine does just that.
Remember how we said the IDT holds segment selectors for %cs
and %ss
, plus
and instruction pointer %eip
? (I know we haven't seen the code to create the
IDT and store the entries of vectors
in it yet; we'll get to that below.) The
processor will start using those segments (and save the old ones) before running
the trap handler function. Each trap handler function in vectors
above pushed
an error code (or 0) followed by a trap number. Now we have to push all the
other segment selectors on the stack one at a time, then push all the general-
purpose registers at once with the x86 instruction pushal
.
.globl alltraps
alltraps:
pushl %ds
pushl %es
pushl %fs
pushl %gs
pushal
# ...
Cool, all the registers are saved now. So now we'll set up the %ds
and %es
registers for kernel mode (%cs
and %ss
were already done by the processor).
# ...
movw $(SEG_KDATA<<3), %ax
movw %ax, %ds
movw %ax, %es
# ...
Now we're ready to call the C function trap()
that's gonna do most of the
work. That function expects a single argument: a pointer to the process's saved
register contents. Well, we just pushed them all on the stack, so we can just
use %esp
as that pointer.
# ...
pushl %esp
call trap
# ...
That function will return back here when it's done, so let's ignore the return value by moving the stack pointer just above it (essentially popping it off the stack).
# ...
addl $4, %esp
# ...
We've talked about this function before; when we create a new process, it starts
executing in forkret()
, which then returns into trapret()
. More generally,
any call to trap()
will return here as well.
This function just restores everything back to where it was before, popping
stored registers off the stack in reverse order. We can skip the trap number and
error code; we won't need them anymore. Then we use the iret
or "interrupt
return" (though you should read that as "trap return") instruction to close out,
return to user mode, and start executing the process's instructions again.
.globl trapret
trapret:
popal
popl %gs
popl %fs
popl %es
popl %ds
addl $0x8, %esp # skip the trap number and error code
iret
Okay, on to the main part of the code! We have to do two things here: stick the
trap handler functions in vectors
into an IDT, and figure out what to do with
each interrupt type.
At the top, we've got four global variables. The IDT is represented as an array
of struct gatedesc
s, defined in
mmu.h. It's worth
taking a look at because it uses an obscure C feature (bit fields); we'll do that
in the next section.
Then we declare the vectors
array of trap handler (with an extern
keyword,
since it's defined in an assembly file), a global counter ticks
that tracks
the number of timer interrupts so far (basically a rough timer), and a lock to
use with ticks
.
struct gatedesc idt[256];
extern uint vectors[];
struct spinlock tickslock;
uint ticks;
// ...
This section will get deep into the weeds, so feel free to skip it if you're having a nice day and don't want to spoil it by reading about a bunch of C standards.
So far, we've used bit flags with regular integers by manually doing some bit
arithmetic to set one bit at a time. For example, the flags for page table and
page directory entries are defined as powers of 2 (e.g., PTE_P
is 0x1, PTE_W
is 0x2, PTE_U
is 0x4, etc.) so that we can set a specific bit using a bitwise-
OR like pte |= PTE_U
or test whether it's set with a bitwise-AND like
pte & PTE_P
.
But sometimes that can get annoying and hard to keep track of; wouldn't it be nice if we could just have variables that represent a single bit? Or two bits, or any number of bits we want?
The trouble is that most computer architectures don't work with a single bit at
a time; they operate on bytes, words (2 bytes), long/double words (4 bytes), or
quad words (8 bytes), so it would be nontrivial to compile a line of C like
a = 1
if a
is a nonstandard size.
In fact, accessing variables that aren't aligned to a standard size (4 bytes on
x86 or 8 bytes on x86_64) is much slower than when they are aligned. Compilers
often optimize code to correct for this by padding struct
s so that they'll
line up along those standard sizes. For example, one like
struct nopadding {
int n;
};
is probably left the same on x86, but one like this:
struct padding {
char a;
int n;
char b;
};
is probably converted by the compiler into this:
struct padding {
char a;
char pad0[3];
int n;
char b;
char pad1[3];
};
WARNING: We're entering the dark arts of C's unspecified and implementation- defined behavior here. Note that these are different from undefined behavior: undefined behavior means you did something BAD BAD BAD like dereferencing a null pointer, freeing a memory region twice, using a variable after freeing it, accessing an out-of-bounds index in a buffer, or overflowing a signed data type. Implementation-defined and unspecified behavior aren't as dangerous as undefined behavior is, but they can cause portability issues.
The C standard is a huge document with a bunch of legalese rules about what makes C, well, C. People who write C compilers need to know exactly how C code should behave under all kinds of different circumstances, so the C standard spells most of it out. But there are some parts it intentionally leaves out.
Implementation-defined behavior means the C standard doesn't set any fixed requirements about how a compiler should handle some behavior or feature; the developers of a C compiler get to decide how to write that part of the code with total freedom. One example is the number of bits in a byte; we've been assuming it's 8, but there are some (dumb) architectures where it's different.
Unspecified behavior, on the other hand, means that the C Standard provides some specific options, and compiler developers have to choose from those options for each instance of the behavior in the code they're compiling (that means, don't assume it's always gonna be the same, even with the same compiler).
Structure padding is implementation-defined, and there are often implementation-
defined ways to modify it or disable it altogether (i.e., to pack the struct
instead of padding it), usually with stuff like __attribute__
s or #pragma
directives for the preprocessor.
Wait weren't we gonna talk about bit manipulation? Why are we talking about
struct
s? Well, C does have a workaround to make bit manipulation a little
easier by avoiding that slightly-annoying bit arithmetic you have to do to set
or clear flags in an int
or unsigned int
: it's called a bit field, and it
takes advantage of struct
padding.
You can specify the number of bits that a field of a struct
should occupy by
adding a colon and a size after the field name:
struct bitfield_example {
unsigned char a : 1;
unsigned char b : 7;
};
This way, you can set the single-bit flag a
with simple variable assignments
like var.a = 1
, and the compiler will figure out any necessary magic similar
to structure padding to make that happen. Awesome, right? So why haven't we been
using it all the time instead of all that opaque bit arithmetic with arcane
operators like <<
, >>
, |
, and &
?
Well, there are some big downsides to bit fields. First, the C standard sets
some strict rules on their use to make sure that compilers can figure out how to
handle them. Bit fields are only allowed inside of structures. You're not
allowed to create arrays of bit fields or pointers to bit fields. Functions
aren't allowed to return a bit field. You're not allowed to get the address of a
bit field with the &
operator. You can only operate on a single bit field at a
time in any statement; that means you can't set one bit field to equal another,
and you can't compare the values of two bit fields.
Second, they're extremely implementation-defined. Each implementation (read: compiler + architecture combo) determines what data types and sizes are allowed to be used in bit fields. The data types you can use might have different signedness rules from the usual ones for signed and unsigned types. How they're laid out, ordered, and padded in memory can differ. In short: the low-level details are a total black box that you can probably only figure out by reading deep into the compiler's specifications.
Now imagine trying to do something that requires specific protocols like sending data over a network, and you come across a bit field. Lolwut. Who knows what you'd have to do. Bit fields make it impossible to port your code.
BUT! Bit arithmetic is annoying, so let's use bit fields anyway!
Okay, so back to struct gatedesc
. IDT entries have to contain a 16-bit code
segment selector (%cs
), 16 low bits and 16 high bits for an offset in that
segment, the number of arguments for the handler function, a type, a system/
application flag, a descriptor privilege level (0 for kernel, 3 for user), and a
"present" flag. And x86 is very particular about how it's all laid out, so we
have to set up struct gatedesc
in the exact right order.
struct gatedesc {
uint off_15_0 : 16;
uint cs : 16;
uint args : 5;
uint rsv1 : 3;
uint type : 4;
uint s : 1;
uint dpl : 2;
uint p : 1;
uint off_31_16 : 16;
};
Well, okay, that's it for now.
This function loads all the assembly trap handler functions in vectors
into
the IDT. The SETGATE()
macro in
mmu.h will organize
each entry correctly. We said before that the IDT needs a code segment selector,
an instruction pointer (from vectors
), and a privilege level (0 for kernel
mode), so we'll stick those in.
void tvinit(void)
{
for (int i = 0; i < 256; i++) {
SETGATE(idt[i], 0, SEG_KCODE << 3, vectors[i], 0);
}
// ...
}
We're basically done now, but there's one last hiccup: user code needs to be able to generate system calls, but we just set all the privilege levels so only the kernel and processor can generate traps. So we'll fix the entry for system calls as a special case.
void tvinit(void)
{
// ...
SETGATE(idt[T_SYSCALL], 1, SEG_KCODE << 3, vectors[T_SYSCALL], DPL_USER);
// ...
}
Oh and while we're at it, let's just go ahead and initialize the lock for the tick counter.
void tvinit(void)
{
// ...
initlock(&tickslock, "time");
}
The last function stored all the trap vectors in the IDT, so now we need to tell
the processor where to find the IDT. There's a special assembly instruction for
that in x86 called lidt
.
void idtinit(void)
{
lidt(idt, sizeof(idt));
}
This last function is the one that gets called by the assembly code in alltraps
;
it's responsible for figuring out what to do based on the trap number we pushed
on the stack before. Heads up: it's gonna do that by calling a bunch of other
functions, many of which we haven't seen yet. I'll just give a quick summary
when we come across them, and we'll get to them later on.
The only argument is a pointer to a struct trapframe
. Wait, hang on. Up above
in the assembly code, the argument we pushed on the stack was %esp
, the stack
pointer, not a pointer to any struct trapframe
. What's up with that? Did we
pass the wrong kind of argument in?
Let's check out the definition for struct trapframe
, found in
x86.h. It's got a
bunch of fields, starting off with the general purpose registers (those are the
fields from %edi
to %eax
). Then it has four segment registers (fields %gs
through %ds
), plus some unused padding bits in between them to round the 16-
bit segment registers up to 32 bits. The next two fields are a trap number and
an error code.
All that should sound familiar. Take another look at
trapasm.S: so
far, those are the exact same things we pushed on the stack! The other fields
are what the processor pushed on the stack before calling the handler function
in the IDT. So basically, we're never gonna construct a struct trapframe
in C
code; we already constructed it manually in assembly. It just describes
everything that's already on the stack by the time this trap()
function gets
called. In that sense, the %esp
we pushed as an argument really is a pointer
to a struct trapframe
. It's a clever way to read values off the stack.
So we said we're gonna check the trap number and decide which kernel function to
call based on that, right? Let's start by checking if the trap number indicates
this is a system call (trap number 64, or T_SYSCALL
).
void trap(struct trapframe *tf)
{
if (tf->trapno == T_SYSCALL) {
// ...
}
// ...
}
Well how should we handle system calls? xv6 will have several, and we don't even
know what they all are yet. So let's procrastinate again and just call some
other function syscall()
to handle the work of figuring out which system call
to execute. Now we'll store the pointer to the struct trapframe
in that
process's struct proc
, obtained with a call to myproc()
. Also, processes
need to be killed once they're done, or if they cause an exception; that happens
by setting a killed
flag in the struct proc
. So we'll check for that before
and after carrying out the system call and close the process out with exit()
if it's due to be killed.
void trap(struct trapframe *tf)
{
if (tf->trapno == T_SYSCALL) {
if (myproc()->killed) {
exit();
}
myproc()->tf = tf;
syscall();
if (myproc()->killed) {
exit();
}
return;
}
// ...
}
Okay, now we have all the other trap numbers to think about. We could do them
with a ton of if
statements, but that would be a pain; we'll use a switch
statement instead. If you haven't seen switch
statements, they replace big
if-else
blocks with cases instead. The cases can only be indexed by integers,
and you have to stick a break
statement at the end or else you'll fall through
to the next case and execute the code found there as well. (To be honest, I
don't see a reason why the system call case wasn't just included in this same
switch statement; if you see a reason for that, let me know.)
void trap(struct trapframe *tf)
{
// ...
switch (tf->trapno) {
// cases go here
}
// ...
}
First up is the trap number for timer interrupts; the main function of timer
interrupts is to schedule a new process, but that will come further down in this
function. For now, we'll just increment the ticks
counter then call wakeup()
,
which checks if any processes went to sleep until the next tick; it'll switch to
running any process it finds. There's one detail to deal with here: the system
may have multiple processors, each with their own timer and interrupts. We want
to use the ticks
counter as a rough timer, but we don't know whether all the
CPU timers will be synchronized, so we'll only update ticks
using the first
CPU to avoid those issues.
If you read the post on interrupt controllers then you'll be familiar with
lapiceoi()
; if you didn't (or you forgot), it just tells the local interrupt
controller that we've read and acknowledged the current interrupt so it can
clear it and get ready for more interrupts.
void trap(struct trapframe *tf)
{
// ...
switch (tf->trapno) {
case T_IRQ0 + IRQ_TIMER:
if (cpuid() == 0) {
acquire(&tickslock);
ticks++;
wakeup(&ticks);
release(&tickslock);
}
lapiceoi();
break;
// ...
}
// ...
}
Later on, we'll see some interrupt handler functions for various devices:
ideintr()
handles disk interrupts, kbdintr()
for key presses and releases,
and uartintr()
for serial port data. We'll direct the corresponding interrupts
to those functions, then acknowledge and clear them with lapiceoi()
. Also,
devices occasionally generate spurious interrupts due to hardware malfunctions;
we'll either ignore them (if they're coming from the Bochs emulator) or print a
message about it to the console.
void trap(struct trapframe *tf)
{
// ...
switch (tf->trapno) {
// ...
case T_IRQ0 + IRQ_IDE: // disk interrupt
ideintr();
lapiceoi();
break;
case T_IRQ0 + IRQ_IDE + 1: // spurious Bochs disk interrupt
break;
case T_IRQ0 + IRQ_KBD: // keyboard interrupt
kbdintr();
lapiceoi();
break;
case T_IRQ + 7: // spurious interrupt-no break, FALL THROUGH
case T_IRQ + IRQ_SPURIOUS: // spurious interrupt
cprintf("cpu%d: spurious interrupt at %x:%x\n",
cpuid(), tf->cs, tf->eip);
lapiceoi();
break;
// ...
}
// ...
}
Okay, so now we've dealt with system calls and hardware interrupts, so any other
trap must be a software exception. switch
statements allow a catch-all case
with default
, so we'll use that to catch the rest of the trap numbers. Now,
this may have come from a kernel error or a misbehaving user process. We can
check with myproc()
, which returns a null pointer if we were running kernel
code or a pointer to a struct proc
if we were in user space, or by checking
the current privilege level in the code segment selector. Depending on the
source, we'll print out an appropriate error message and either panic (if in the
kernel) or mark the process so it gets killed soon.
void trap(struct trapframe *tf)
{
// ...
switch (tf->trapno) {
// ...
default:
if (myproc() == 0 || (tf->cs & 3) == 0) {
// Kernel code exception
cprintf("unexpected trap %d from cpu %d eip %x (cr2=0x%x)\n",
tf->trapno, cpuid(), tf->eip, rcr2());
panic("trap");
}
// User process exception
cprintf("pid %d %s: trap %d err %d on cpu %d "
"eip 0x%x addr 0x%x--kill proc\n",
myproc()->pid, myproc()->name, tf->trapno,
tf->err, cpuid(), tf->eip, rcr2());
myproc()->killed = 1;
}
// ...
}
The reason we don't kill it immediately is because the process might be executing some kernel code right now; for example, system calls allow other interrupts and exceptions to occur while they're being handled. Killing it now might corrupt whatever it's doing. So instead we just give it the kiss of death for now and come back to finish the job later.
So next up, we'll check if this trap was generated by a user process that's due
to be killed, and that process is running in ring 3. If so, we finally do
the deed with exit()
; otherwise if it's running in ring 0, it'll live for now
and get killed the next time it generates a trap instead.
void trap(struct trapframe *tf)
{
// ...
if (myproc() && myproc()->killed && (tf->cs & 3) == DPL_USER) {
exit();
}
// ...
}
Up above, the only thing a timer interrupt did was increment ticks
. But we
know a really important function of timer interrupts is to force a process to
let go of the CPU and let someone else run. It's time to do that. We'll check if
the process's state is RUNNING
and the trap was a timer interrupt; if so, we
call yield()
to let another process get scheduled on this CPU.
void trap(struct trapframe *tf)
{
// ...
if (myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0 + IRQ_TIMER) {
yield();
}
// ...
}
Now we have one last check: a process that yielded, then got picked up again later might have been marked as killed in the meantime, so if it was, we need to finish it off now. So we do the exact same check as above again, and then we're done.
void trap(struct trapframe *tf)
{
// ...
if (myproc() && myproc()->killed && (tf->cs & 3) == DPL_USER) {
exit();
}
}
Note that this function will return into trapret
in the assembly code, which
will then send it back to user mode.
Let's take a moment to assess how much of xv6 we've already covered. Remember, the xv6 kernel has four main functions: (1) finishing the boot process that the boot loader started, (2) virtualizing resources in order to isolate processes from each other, (3) scheduling processes to run, and (4) interfacing between user processes and hardware devices. Let's take that as a checklist and go through those items now.
We've already seen some of the initialization routines that get run on boot in
main()
; most of the code there sets up virtual memory and all the hardware
devices. We still have a few more devices to talk about: the keyboard, serial
port, console, and disk; each of those has its own boot function that we'll need
to go over in order to wrap up point (1).
On the other hand, we're already done with (2) and (3): we spent a lot of time going over virtual memory and paging, and the last post on scheduling showed us how xv6 virtualizes the CPU as well as it runs processes.
The code we saw in this post was our introduction to point (4). Traps are the primary mechanism for user processes to communicate with the hardware; the kernel coordinates that communication by setting up trap handler functions. The code we've seen here basically acts like an usher, directing traps to the right trap handler function depending on its type.
When a trap occurs (x86 instruction int
), the processor will stop executing
code, find the IDT, and looks up the entry for that trap number. The script that
xv6 uses to generate the IDT entries just makes them all point to the same
function alltraps()
, which saves all the process's registers, switches into
kernel mode, and calls trap()
. Then that function uses the trap number to
figure out how the kernel wants it to respond to this particular trap. So any
hardware interrupt, software exception, or user system call will get funneled
into the functions here before getting dispatched to some other appropriate
kernel code that will know what to do with it.
We haven't finished point (4) yet, though: we have to actually see what each of
those trap handler functions does. But we did see some of them: for example, we
saw that a software exception either kills the process that caused it or panics
if it occurred in kernel code. That already takes care of one of the three types
of traps, so we're left with hardware interrupts and system calls. All the
system calls got redirected to a syscall()
function which we haven't seen yet.
We have seen how some of the hardware interrupts are dealt with: a timer
interrupt increments a ticks
counter (if it's on CPU 0), then calls yield()
to force a process to give up the CPU until the next scheduling round. Spurious
interrupts either get ignored or print a message to the console. But we've
procrastinated some of the others: disk interrupts call an ideintr()
function
to handle them, keyboard interrupts call kdbintr()
, and serial port interrupts
call uartintr()
, none of which we've gone over.
So in order to wrap up the xv6 kernel, we still have to understand how system calls are routed in general, as well as how devices are initialized at boot and how the kernel responds to specific system calls that require use of those devices. The general system call routing mechanism is up next.