Skip to content

Commit

Permalink
lazy stack: inline assembly to pre-fault stack
Browse files Browse the repository at this point in the history
This is the 1st of the total of eight patches that implement optional
support of so called "lazy stack" feature. The lazy stack is well explained
by the issue #143 and allows to save substantial amount of memory if
application spawns many pthreads with large stacks by letting stack grow
dynamically as needed instead of getting pre-populated ahead of time.

The crux of this solution and the previous versions is based on the observation
that OSv memory fault handler requires that both interrupts and preemption must
be enabled when fault is triggered. Therefore if stack is dynamically mapped
we need to make sure that stack page faults NEVER happen in the relatively
few places of kernel code that executes with either interrupts or preemption
disabled. And we satisfy this requirement by "pre-faulting" the stack by reading
a byte page (4096) down per stack pointer just before preemption or interrupts
are disabled. Now, the problem is complicated by the fact that the kernel code A
that disables preemption or interrupts may nest by calling another kernel
function B that also disables preemption or interrupts in which case the function
B should NOT try to pre-fault the stack otherwise the fault handler will abort
due to violated constraint stated before. In short we cannot "blindly"
or unconditionally pre-fault the stack in all places before interrupts or
preemption are disabled.

Some of the previous solutions modified both arch::irq_disable() and
sched::preempt_disable() to check if both preemption and interrupts
are enabled and only then try to read a byte at -4096 offset down. Unfortunately,
this makes it more costly than envisioned by Nadav Har'El - instead of single
instruction to read from memory, compiler needs 4-5 to read data if preemption
and interrupts are enabled and perform relevant jump. To make it worse, the
implementation of arch::irq_enabled() is pretty expensive at least in x64
and uses stack with pushfq. To avoid it the previous solutions would add new
thread local counter and pack irq disabling counter together with preemption one.
But even with this optimization I found this approach to deteriorate the
performance quite substantially. For example the memory allocation logic
disables preemption in quite many places (see core/mempool.cc) and corresponding
test - misc-free-perf.cc - would show performance - number of malloc()/free()
executed per second - degrade on average by 15-20%.

So this latest version implemented by this and next 7 patches takes different
approach. Instead of putting the conditional pre-faulting of stack in both
irq_disable() and preempt_disable(), we analyze OSv code to find all places
where irq_disable() and/or preempt_disable() is called directly (or indirectly
sometimes) and pre-fault the stack there if necessary or not. This makes
it obviously way more laborious and prone to human error (we can miss some places),
but would make it way more performant (no noticable performance degradation noticed)
comparing to earlier versions described in the paragraph above.

As we analyze all call sites, we need to make some observations to help us
decide what exactly to do in each case:
- do nothing
- blindly pre-fault the stack (single instruction)
- conditionally pre-fault the stack (hopefully in very few places)

Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread.

Rule 2: Do nothing if call site executes on populated stack - includes the above
        but also code executing on interrupt, exception or syscall stack.

Rule 3: Do nothing if call site executes when we know that either interrupts
        or preemption are disabled. Good example is an interrupt handler
        or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks.

Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts
        are enabled. These in most cases can only be deduced by analysing where
        the particular function is called. In general any such function called by
        user code like libc would satisfy the condition. But sometimes it is tricky
        because kernel might be calling libc functions, such as malloc().

Rule 5: Otherwise pre-fault stack by determining dynamically:
        only if sched::preemptable() and irq::enabled()

One general rule is that all potential stack page faults would happen
on application thread stack when some kernel code gets executed down
the call stack.

In general we identify the call sites in following categories:
- direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints)
- direct calls to sched::preempt_disable()
- code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type
- code using WITH_LOCK(preempt_lock)
- code using WITH_LOCK(osv::rcu_read_lock)

The above locations can be found with simple grep but also with an IDE like
CLion from JetBrains that can help more efficiently find all direct but also
more importantly indirect usages of the call sites identified above.

So this patch lays a ground work by defining the inline assembly to pre-fault
the stack where necessary and introduces two build parameters -
CONF_lazy_stack and CONF_lazy_stack_invariant - that are disabled by default.
The first one is used in all places to enable the lazy stack logic and the second
one is used to add code with some related invariants that will help us to reason
about the code and whether we should do nothing, pre-fault stack "blindly" or
conditionally.

The remaining 7 patches mostly add the pre-fault code in relevant places
but also annotate code with some invariants using assert().

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
  • Loading branch information
wkozaczuk committed Oct 16, 2022
1 parent c83247e commit f5684d9
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 1 deletion.
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,8 @@ $(out)/bsd/%.o: INCLUDES += -isystem bsd/
# for machine/
$(out)/bsd/%.o: INCLUDES += -isystem bsd/$(arch)

configuration-defines = conf-preempt conf-debug_memory conf-logger_debug conf-debug_elf
configuration-defines = conf-preempt conf-debug_memory conf-logger_debug conf-debug_elf \
conf-lazy_stack conf-lazy_stack_invariant

configuration = $(foreach cf,$(configuration-defines), \
-D$(cf:conf-%=CONF_%)=$($(cf)))
Expand Down
14 changes: 14 additions & 0 deletions arch/aarch64/arch.hh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,20 @@ namespace arch {
#define INSTR_SIZE_MIN 4
#define ELF_IMAGE_START (OSV_KERNEL_VM_BASE + 0x10000)

#if CONF_lazy_stack
inline void ensure_next_stack_page() {
u64 i, offset = -4096;
asm volatile("ldr %0, [sp, %1]" : "=r"(i) : "r"(offset));
}

inline void ensure_next_two_stack_pages() {
u64 i, offset = -4096;
asm volatile("ldr %0, [sp, %1]" : "=r"(i) : "r"(offset));
offset = -8192;
asm volatile("ldr %0, [sp, %1]" : "=r"(i) : "r"(offset));
}
#endif

inline void irq_disable()
{
processor::irq_disable();
Expand Down
13 changes: 13 additions & 0 deletions arch/x64/arch.hh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,19 @@ namespace arch {
#define INSTR_SIZE_MIN 1
#define ELF_IMAGE_START OSV_KERNEL_BASE

#if CONF_lazy_stack
inline void ensure_next_stack_page() {
char i;
asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
}

inline void ensure_next_two_stack_pages() {
char i;
asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
asm volatile("movb -8192(%%rsp), %0" : "=r"(i));
}
#endif

inline void irq_disable()
{
processor::cli();
Expand Down
3 changes: 3 additions & 0 deletions conf/base.mk
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,6 @@ conf-DEBUG_BUILD=0
conf-debug_elf=0
conf_hide_symbols=0
conf_linker_extra_options=

conf-lazy_stack=0
conf-lazy_stack_invariant=0

0 comments on commit f5684d9

Please sign in to comment.