Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with sanitizers #19

Closed
7 tasks done
mkow opened this issue Jan 21, 2021 · 8 comments
Closed
7 tasks done

Integration with sanitizers #19

mkow opened this issue Jan 21, 2021 · 8 comments

Comments

@mkow
Copy link
Member

mkow commented Jan 21, 2021

Description of the problem

It would be good to enable some sanitizers in CI (e.g. AddressSanitizer from Clang) to allow for better and more reliable detection of low-level bugs.

Current plan:

The tricky part is that we're running without stdlib and manage memory on our own (especially on SGX), so most likely we'll need some hacks to get ASan working.

Previous work: @yamahata and @stefanberger ran GCC's UBSan (see gramineproject/graphene#724, gramineproject/graphene#871, gramineproject/graphene#2089, gramineproject/graphene#2094, gramineproject/graphene#2097, gramineproject/graphene#2101, gramineproject/graphene#2102, gramineproject/graphene#2103). It's much less powerful than ASan, but still something! I think we can try to add it to CI as a first step, to have at least something.

@stefanberger
Copy link
Contributor

I tried to enable ASan as well but I ended up having to disable it in all functions that I ran it and ended up giving up...

@mkow
Copy link
Member Author

mkow commented Jan 21, 2021

Yeah, this definitely will require substantial efforts, but also give us quite huge gains.

@pwmarcz pwmarcz self-assigned this Sep 1, 2021
@mkow mkow transferred this issue from gramineproject/graphene Sep 9, 2021
@mkow mkow added this to the release v1.0 milestone Sep 9, 2021
@mkow mkow mentioned this issue Sep 9, 2021
@pwmarcz
Copy link
Contributor

pwmarcz commented Nov 23, 2021

Bugs found by UBSan (including earlier efforts by @stefanberger):

Bugs found by ASan:

@pwmarcz
Copy link
Contributor

pwmarcz commented Dec 3, 2021

use-after-return

I investigated implementing use-after-return in AddressSanitizer, and my conclusion right now is it's not worth doing. Here is a dump of what I found.

How it works

(Assumes some knowledge about how AddressSanitizer works, see the comments in asan.h).

This mode is for detecting access to local variables where the function has already returned (such as char foo[10]; return &foo;). The straightforward solution would be to poison such variables, as we do with everything else. However, the stack memory will soon be reused.

Instead, LLVM allocates an object called fake stack frame, and stores local variables there. On function return, this stack frame is marked as retired, and memory occupied by it is poisoned (but the memory is left alone for some time before reusing, the same as heap allocations).

The detailed implementation is more complex:

  • There's a set of callbacks, __asan_stack_mallocN(size) (where N is called "class id" and is between 0 and 10). The size is the size of memory requested, but the whole fake frame is expected to have size 1 << (6 + N) (i.e. class IDs correspond to sizes 64, 128, 256 etc).
  • For large frames, the corresponding callbacks __asan_stack_freeN(ptr, size) are supposed to mark the frame as retired and poison the memory.
  • However, for small frames, the ASan-instrumented code poisons the memory directly, and expects to have a flag pointer stored at the very end of frame. It writes 0 through that pointer, marking the frame as retired. Something like:
size_t frame_size = 1 << (6 + N);
uint8_t* flag_ptr = *(uint8_t**)(frame_addr + frame_size - sizeof(uint8_t*));
*flag_ptr = 0;

See asan_fake_stack.h and asan_fake_stack.cpp in LLVM for details.

Handling no-return

In addition, we need to detect situations where we never return, but a frame should be retired anyway (longjmp, throw, etc.) I haven't investigated too deeply, but it looks like LLVM's solution is:

  • A fake stack is kept for each thread, with a pointer in thread-local storage.
  • Under some circumstances (e.g. after a noreturn function), we perform a GC (garbage collection) run on the fake stack.
  • The GC run goes through all frames, and checks if their corresponding "real stack" values are lower than for the current stack frame. If so, these frames are marked as retired (since even though we didn't explicitly return from them, we abandoned them).
  • (Presumably this handles switching stacks e.g. during signal handling somehow - when we handle a signal on an alternate stack, we don't want to garbage-collect the normal stack...)

What would it take to implement in Gramine?

I think we would need:

  • Means of allocating/deallocating memory. The current ASan library is compiled separately for PAL/LibOS. However, PAL and LibOS functions often run on the same stack. So the handlers called from PAL would need to free memory allocated by LibOS handlers, and vice versa.
    • Note that we cannot easily e.g. use PAL's allocator, or in fact any PAL callbacks, for this. Most functions would call __asan_stack_malloc, including PAL's implementation of malloc and memory mapping. Our malloc takes a global lock, so it would deadlock.
    • Maybe ASan should manage its own range of address space?
  • Thread-local context: AddressSanitizer would need to remember the fake stack for a given thread.
    • That probably means a single pointer in TCB.
  • Clean up when abandoning stack: probably instead of the scheme outlined above, we can call a custom function like asan_clean_current_stack.
  • SGX memory: The fake stack for in-enclave code needs to reside in enclave memory. That's because various parts of Gramine assume that local variables are created in-enclave, and verify that (and SGX instructions such as EREPORT or EGETKEY also require in-enclave pointers).
    • This means we cannot keep the same memory area for e.g. PAL and URTS (outer PAL). While they don't share the same stack, that could make implementation more difficult.
    • This is different than ASan shadow memory, which is allocated outside enclave.

Experiment

I implemented a very quick-and-dirty version of this:

  • Memory allocator that just returns new memory from a huge, preallocated chunk of address space, and never frees the memory
  • gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)
  • No noreturn handling: most "interesting" functions return anyway

It found no issues on our regression tests, LTP, and when running some examples (bash, busybox, redis).

It's possible that noreturn handling was more important, or my implementation was flawed somehow. But overall, it looks underwhelming.

Conclusion

Based on the above, I think this is NOT worth the complexity right now. Even if we managed to solve the problems I mentioned, the end result would be easy to break and hard to debug.

  • There's the implementation problem of managing memory for ASan, which I don't know how to solve easily. The requirement for SGX memory also complicates things.
  • The simple experiment found no bugs in gramine-direct.
  • Overall, I think the integration would be too "invasive":
    • Taking stack allocations out of stack might break assumptions that our code might hold.
    • If ASan handlers call back into rest of Gramine for non-trivial things, more things can go wrong.
  • It's discouraging that this part of interface between libasan and instrumented code is so ugly. It might also be hard to get right.

@dimakuv
Copy link

dimakuv commented Dec 3, 2021

Thanks for the great analysis of use-after-return! I agree that this is too ugly, intrusive and complicated. Let's not do it. I'm especially skeptical about abandoned/switched stacks, this will probably require a lot of ugliness.

Could you mark this checkbox as done & strike-through or something in the very first message? Or @mkow should do it, I guess.

@boryspoplawski
Copy link
Contributor

gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)

I guess we could add this only to debug builds (so that it can be insecure) and then have such memory outside of the enclave in PAL SGX. This way we could uniformly handle this and not worry about in-enclave memory constraints. The minus would be it's constrained to debug builds, but you don't want to run asan on production anyway.

But that's just for the future in case we ever come back to this idea.

@pwmarcz
Copy link
Contributor

pwmarcz commented Dec 3, 2021

gramine-direct only (SGX cannot use the same approach because the chunk is outside enclave)

I guess we could add this only to debug builds (so that it can be insecure) and then have such memory outside of the enclave in PAL SGX. This way we could uniformly handle this and not worry about in-enclave memory constraints. The minus would be it's constrained to debug builds, but you don't want to run asan on production anyway.

Security is not the problem, the shadow memory for ASan is already outside of enclave. The problem is that I did not get this approach to work at all, because some places depend on local variables being in-enclave. If they're not, then some PAL functions error out on such variables (because they accept only in-enclave pointers), SGX instructions such as EREPORT and EGETKEY fail... and I haven't checked further but possibly more surprises lie ahead. I'd rather not have to circumvent/fix all of these places.

Since the quick-and-dirty approach I'm talking about also had a fatal flaw of, well, never freeing the memory, probably it would be better to keep a memory pool inside the enclave. But... even if we got this integration to work, it sounds annoying to maintain and easy to break.

@mkow
Copy link
Member Author

mkow commented Dec 3, 2021

Thanks for the detailed summary, Paweł!
It seems that we're done with ASan for now, so we can finally resolve this issue 🎉

@mkow mkow closed this as completed Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants