Integration with sanitizers #19

mkow · 2021-01-21T00:40:29Z

Description of the problem

It would be good to enable some sanitizers in CI (e.g. AddressSanitizer from Clang) to allow for better and more reliable detection of low-level bugs.

Current plan:

Clang support (Clang support graphene#1794; done, but not tested in CI yet).
Enable Clang's UBSan in CI (Prepare for UBSan and Clang graphene#2657, [WIP] UBSan graphene#2653).
Enable Clang's ASan in CI for heap allocations (First version of AddressSanitizer #32).
Enable Clang's ASan in CI for global variables (ASan: Enable for globals #251).
Enable Clang's ASan in CI for stack allocations
- Enable stack sanitization (Enable stack sanitization #240)
- ~~(if feasible) Detect also use-after-return~~ - skipped, see comment

The tricky part is that we're running without stdlib and manage memory on our own (especially on SGX), so most likely we'll need some hacks to get ASan working.

Previous work: @yamahata and @stefanberger ran GCC's UBSan (see gramineproject/graphene#724, gramineproject/graphene#871, gramineproject/graphene#2089, gramineproject/graphene#2094, gramineproject/graphene#2097, gramineproject/graphene#2101, gramineproject/graphene#2102, gramineproject/graphene#2103). It's much less powerful than ASan, but still something! I think we can try to add it to CI as a first step, to have at least something.

stefanberger · 2021-01-21T00:54:27Z

I tried to enable ASan as well but I ended up having to disable it in all functions that I ran it and ended up giving up...

mkow · 2021-01-21T01:40:35Z

Yeah, this definitely will require substantial efforts, but also give us quite huge gains.

pwmarcz · 2021-11-23T14:27:29Z

Bugs found by UBSan (including earlier efforts by @stefanberger):

Alignment issues (c18469c, ee41364, c7ac847, dde7712, eb3c58f, 11dbcaf)
Pointer overflow (d35f016), arithmetic on null pointer (419009c)
Signed integer overflow (fb9582f)
Standard library functions invoking UB (ca469e4)
Buffer overflow in a test case (7b6b1a9)

Bugs found by ASan:

Use-after-free in linked list traversal (b8c121c)
Off-by-one in stack allocation without a null terminator (d2645b4)
Use after scope of a stack-allocated buffer in /proc (9570347)
Buffer overflow when generating /proc file content (5b84296)
Buffer overflow in IPC code calculating wrong message size (6ec28c5)
Bug/design flaw in protected files memory mapping (Unmapping protected files does not work properly #231)

pwmarcz · 2021-12-03T10:34:54Z

`use-after-return`

I investigated implementing use-after-return in AddressSanitizer, and my conclusion right now is it's not worth doing. Here is a dump of what I found.

How it works

(Assumes some knowledge about how AddressSanitizer works, see the comments in asan.h).

This mode is for detecting access to local variables where the function has already returned (such as char foo[10]; return &foo;). The straightforward solution would be to poison such variables, as we do with everything else. However, the stack memory will soon be reused.

Instead, LLVM allocates an object called fake stack frame, and stores local variables there. On function return, this stack frame is marked as retired, and memory occupied by it is poisoned (but the memory is left alone for some time before reusing, the same as heap allocations).

The detailed implementation is more complex:

There's a set of callbacks, __asan_stack_mallocN(size) (where N is called "class id" and is between 0 and 10). The size is the size of memory requested, but the whole fake frame is expected to have size 1 << (6 + N) (i.e. class IDs correspond to sizes 64, 128, 256 etc).
For large frames, the corresponding callbacks __asan_stack_freeN(ptr, size) are supposed to mark the frame as retired and poison the memory.
However, for small frames, the ASan-instrumented code poisons the memory directly, and expects to have a flag pointer stored at the very end of frame. It writes 0 through that pointer, marking the frame as retired. Something like:

size_t frame_size = 1 << (6 + N);
uint8_t* flag_ptr = *(uint8_t**)(frame_addr + frame_size - sizeof(uint8_t*));
*flag_ptr = 0;

See asan_fake_stack.h and asan_fake_stack.cpp in LLVM for details.

Handling no-return

In addition, we need to detect situations where we never return, but a frame should be retired anyway (longjmp, throw, etc.) I haven't investigated too deeply, but it looks like LLVM's solution is:

A fake stack is kept for each thread, with a pointer in thread-local storage.
Under some circumstances (e.g. after a noreturn function), we perform a GC (garbage collection) run on the fake stack.
The GC run goes through all frames, and checks if their corresponding "real stack" values are lower than for the current stack frame. If so, these frames are marked as retired (since even though we didn't explicitly return from them, we abandoned them).
(Presumably this handles switching stacks e.g. during signal handling somehow - when we handle a signal on an alternate stack, we don't want to garbage-collect the normal stack...)

What would it take to implement in Gramine?

I think we would need:

Means of allocating/deallocating memory. The current ASan library is compiled separately for PAL/LibOS. However, PAL and LibOS functions often run on the same stack. So the handlers called from PAL would need to free memory allocated by LibOS handlers, and vice versa.
- Note that we cannot easily e.g. use PAL's allocator, or in fact any PAL callbacks, for this. Most functions would call __asan_stack_malloc, including PAL's implementation of malloc and memory mapping. Our malloc takes a global lock, so it would deadlock.
- Maybe ASan should manage its own range of address space?
Thread-local context: AddressSanitizer would need to remember the fake stack for a given thread.
- That probably means a single pointer in TCB.
Clean up when abandoning stack: probably instead of the scheme outlined above, we can call a custom function like asan_clean_current_stack.
SGX memory: The fake stack for in-enclave code needs to reside in enclave memory. That's because various parts of Gramine assume that local variables are created in-enclave, and verify that (and SGX instructions such as EREPORT or EGETKEY also require in-enclave pointers).
- This means we cannot keep the same memory area for e.g. PAL and URTS (outer PAL). While they don't share the same stack, that could make implementation more difficult.
- This is different than ASan shadow memory, which is allocated outside enclave.

Experiment

I implemented a very quick-and-dirty version of this:

Memory allocator that just returns new memory from a huge, preallocated chunk of address space, and never frees the memory
gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)
No noreturn handling: most "interesting" functions return anyway

It found no issues on our regression tests, LTP, and when running some examples (bash, busybox, redis).

It's possible that noreturn handling was more important, or my implementation was flawed somehow. But overall, it looks underwhelming.

Conclusion

Based on the above, I think this is NOT worth the complexity right now. Even if we managed to solve the problems I mentioned, the end result would be easy to break and hard to debug.

There's the implementation problem of managing memory for ASan, which I don't know how to solve easily. The requirement for SGX memory also complicates things.
The simple experiment found no bugs in gramine-direct.
Overall, I think the integration would be too "invasive":
- Taking stack allocations out of stack might break assumptions that our code might hold.
- If ASan handlers call back into rest of Gramine for non-trivial things, more things can go wrong.
It's discouraging that this part of interface between libasan and instrumented code is so ugly. It might also be hard to get right.

dimakuv · 2021-12-03T14:04:13Z

Thanks for the great analysis of use-after-return! I agree that this is too ugly, intrusive and complicated. Let's not do it. I'm especially skeptical about abandoned/switched stacks, this will probably require a lot of ugliness.

Could you mark this checkbox as done & strike-through or something in the very first message? Or @mkow should do it, I guess.

boryspoplawski · 2021-12-03T15:13:01Z

gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)

I guess we could add this only to debug builds (so that it can be insecure) and then have such memory outside of the enclave in PAL SGX. This way we could uniformly handle this and not worry about in-enclave memory constraints. The minus would be it's constrained to debug builds, but you don't want to run asan on production anyway.

But that's just for the future in case we ever come back to this idea.

pwmarcz · 2021-12-03T16:06:02Z

gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)

I guess we could add this only to debug builds (so that it can be insecure) and then have such memory outside of the enclave in PAL SGX. This way we could uniformly handle this and not worry about in-enclave memory constraints. The minus would be it's constrained to debug builds, but you don't want to run asan on production anyway.

Security is not the problem, the shadow memory for ASan is already outside of enclave. The problem is that I did not get this approach to work at all, because some places depend on local variables being in-enclave. If they're not, then some PAL functions error out on such variables (because they accept only in-enclave pointers), SGX instructions such as EREPORT and EGETKEY fail... and I haven't checked further but possibly more surprises lie ahead. I'd rather not have to circumvent/fix all of these places.

Since the quick-and-dirty approach I'm talking about also had a fatal flaw of, well, never freeing the memory, probably it would be better to keep a memory pool inside the enclave. But... even if we got this integration to work, it sounds annoying to maintain and easy to break.

mkow · 2021-12-03T23:55:26Z

Thanks for the detailed summary, Paweł!
It seems that we're done with ASan for now, so we can finally resolve this issue 🎉

pwmarcz self-assigned this Sep 1, 2021

mkow transferred this issue from gramineproject/graphene Sep 9, 2021

mkow added feature request P: 1 labels Sep 9, 2021

mkow added this to the release v1.0 milestone Sep 9, 2021

mkow mentioned this issue Sep 9, 2021

Some ASAN fixes #18

Merged

pwmarcz mentioned this issue Sep 14, 2021

First version of AddressSanitizer #32

Merged

mkow mentioned this issue Sep 15, 2021

Security mitigations and hardening #54

Closed

4 tasks

pwmarcz mentioned this issue Nov 22, 2021

Enable stack sanitization #240

Merged

mkow closed this as completed Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with sanitizers #19

Integration with sanitizers #19

mkow commented Jan 21, 2021 •

edited by pwmarcz

Loading

stefanberger commented Jan 21, 2021

mkow commented Jan 21, 2021

pwmarcz commented Nov 23, 2021

pwmarcz commented Dec 3, 2021

dimakuv commented Dec 3, 2021

boryspoplawski commented Dec 3, 2021

pwmarcz commented Dec 3, 2021

mkow commented Dec 3, 2021

Integration with sanitizers #19

Integration with sanitizers #19

Comments

mkow commented Jan 21, 2021 • edited by pwmarcz Loading

Description of the problem

stefanberger commented Jan 21, 2021

mkow commented Jan 21, 2021

pwmarcz commented Nov 23, 2021

pwmarcz commented Dec 3, 2021

use-after-return

How it works

Handling no-return

What would it take to implement in Gramine?

Experiment

Conclusion

dimakuv commented Dec 3, 2021

boryspoplawski commented Dec 3, 2021

pwmarcz commented Dec 3, 2021

mkow commented Dec 3, 2021

mkow commented Jan 21, 2021 •

edited by pwmarcz

Loading

`use-after-return`