-
Notifications
You must be signed in to change notification settings - Fork 261
Stress-ng test suite causing memory fault and heap corruption failure in Graphene #2419
Comments
Thanks @jinengandhi-intel for the detailed bug report! This is easy to reproduce. I spent a couple hours debugging this, and the culprit is always the same: our IPC communication code at the LibOS layer. In particular, all that stuff in The immediate bug is something like this: we have lists of objects and operate on them using The most interesting part here is that So yes, there is some memory corruption happening somewhere inside At this point I decided to stop, because this is the code that @boryspoplawski is currently fixing. Borys, if you want a test for your IPC rewrite, this |
Some quick comments on this The program spawns several processes. Each process has a bunch of threads (around 6, why this number I don't understand). Each thread does the same -- e.g., repeatedly opens the file Specifying Specifying Note that |
Btw, putting watchpoints on @pwmarcz Do you know about this behavior of GDB? Do you have an idea why it behaves this way? Do you know of any workarounds? |
Not sure if this exactly, but sometimes GDB outright freezes for me with some applications that use fork (even simple regression tests). I haven't been able to debug it yet. |
@dimakuv @jinengandhi-intel However mknod issue still exists and procfs throws different error Procfs:
mknod:
|
Interesting, thanks for testing! @pwmarcz You'll want to take a look at this. |
I wasn't able to reproduce the procfs issue, but I fixed crash in I was able to reproduce the mknod one and I'm working on a fix. |
@pwmarcz @mkow I don't see the Internal memory fault with the mknod test but I see it 3/5 times with procfs test. Logs below:
|
Initially I couldn't reproduce the procfs crash, but I found out that it appears only in Ubuntu 18.04 (stress-ng 0.09.25), not Ubuntu 20.04 (stress-ng 0.11.07). The exact line in stress-ng causing the crash is here: stress-procfs.c:129 It looks like we're calling Some later commits (ColinIanKing/stress-ng@b6c62a3, ColinIanKing/stress-ng@5a598e5) replace this global pointer with a global array, and it looks like stress-ng 0.11.07 uses locks correctly. In conclusion, it looks like a bug in stress-ng, as the path changes while Graphene is processing it. It's not good that Graphene crashes on it, though: I guess we could mitigate the impact of such bugs by making sure that we read user data only once. Or at least copy the filename at the beginning of path lookup function? CC @dimakuv, I remember you investigating exactly this issue before. What do you think? |
@pwmarcz I did only superficial analysis of this, but here's my dump: The assert fires here:
I am looking at the code of Looking at stress-ng changes, it indeed looks like there was a bug in the locking scheme (
I think we should copy the filename at the beginning of each syscall-entry function that uses Though I would highly prefer the explicit copying in each syscall-entry func. |
I guess we do not do that currently for performance reasons as we do not really have to do that (if the app is not bugged). Linux has to do that for security reasons (but we do not consider user app vs Graphene to be a security boundary). |
@boryspoplawski True, but sometimes we have this kind of buggy apps. Plus, performance drop should be negligible. So sounds like a reasonable thing to do. |
I agree with Borys, Linux has to do this, but that's because its security model is opposite to ours.
We shouldn't try to patch around such bugs in apps. It's just another complexity for which I don't see a good enough justification. |
Ok, I agree that it doesn't make sense to "hide" bugs in the user application. On the other hand, I don't think it has any detrimental effect on performance or complexity. But I agree, if there is no good justification to add something to Graphene, then we shouldn't add it. And here we don't have a good justification. So I guess we should just recommend to use stress-ng with a minimal version of 0.11.07. |
I think there is a justification: the bug causes Graphene to behave in internally inconsistent ways (break assertions, etc.), this is confusing to us as Graphene developers, and we waste time trying to find a bug in Graphene[1]. That could be avoided if we copied the path, and could depend on it not changing afterwards. We already usually check if user memory is readable, and return [1] @dimakuv suspected that "somewhere in pseudo/proc FS code, we have temporary substitute like this |
This also leads to an interesting question: if an application (perhaps inadvertently) relies on a side-effect of the Linux implementation, do we emulate this side-effect in Graphene? Or do we declare this application buggy/not conformant? |
I guess that depends on the complexity and impact of such quirks. Currently we do support some and I would say we should as we emulate Linux (Graphene is not just some UNIX system), but if the implementation would be hard and the particular quirk weird/hard I would say we could ignore it. As for adding a path copying to open-like syscalls: I do not directly oppose it (because at some level I agree with @pwmarcz that this could save us - Graphene devs - some debugging), but I would definitely not make this a rule (trying to circumvent app bugs inside Graphene). |
Actually, shouldn't this particular bug also fail sporadically on Linux? And in general, this whole class of bugs. If the path changes in the meantime, then in can also start changing during kernel copy and the kernel would get a partially updated version? |
One of the commit messages mentions unexplained errors on some systems, so maybe it did fail on Linux. But kernel copy should take much less time, so it would rarely fail anyway. Graphene lookup takes locks and performs I/O while traversing the path, so there is more time for the string to change. If we add the proposed fix to Graphene and run the old version of |
So, at least to me, it seems that:
So, overall I'm rather opposed to implementing this, mostly because of the first point. |
Ok, looks like the consensus is that bugs in the app are allowed to crash/trigger asserts in Graphene. Then the solution for this issue is to ask to use newer stress-ng version. And there is nothing to do in Graphene. Should we close this issue then, @jinengandhi-intel ? |
Sure, we will upgrade the systems and try the newer version in the coming sprint and raise a separate issue if we still see similar errors. For now, closing the issue. |
Description of the problem
While trying to enable the different stressors with Graphene I am seeing some failures Memory fault, Heap corruption and other failures from Graphene.
Failure 1: Testing procfs.
Run 1:
$ graphene-direct stress-ng --procfs 8 --timeout 40s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 procfs
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
[P15695:T2:stress-ng] error: Internal memory fault at 0x100000013 (IP = +0x2d234, VMID = 15695, TID = 2)
[P15697:i1:stress-ng] error: IPC worker: unexpected event (4) on exit handle
[P15711:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 15711, TID = 0)
[P15718:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 15718, TID = 0)
[P15725:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 15725, TID = 0)
[P15732:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 15732, TID = 0)
[P15739:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 15739, TID = 0)
stress-ng: info: [1] unsuccessful run completed in 40.06s
Run 2:
$ graphene-direct stress-ng --procfs 8 --timeout 40s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 procfs
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
[P16061:i1:stress-ng] error: Internal memory fault at 0x100000012 (IP = +0x39e67, VMID = 16061, TID = 0)
[P16059:T2:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x2d234, VMID = 16059, TID = 2)
[P16075:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 16075, TID = 0)
[P16089:i1:stress-ng] error: Internal memory fault at 0x100000014 (IP = +0x39e67, VMID = 16089, TID = 0)
error: *** Unexpected memory fault occurred inside PAL (PID = 16093, TID = 16102, RIP = +0x00004c30)! ***
error: *** Unexpected memory fault occurred inside PAL (PID = 16103, TID = 16109, RIP = +0x00004c30)! ***
[P16068:i1:stress-ng] error: Internal memory fault at 0x100000016 (IP = +0x39e67, VMID = 16068, TID = 0)
[P16082:T6:stress-ng] error: Failed to send IPC msg to 16075: -32
stress-ng: info: [1] unsuccessful run completed in 40.08s
Failure 2: Testing getdents
Run 1:
$ graphene-direct stress-ng --getdent 8 --timeout 60s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 getdent
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
[P14981:T6:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
[P14990:i1:stress-ng] error: Internal memory fault at 0x100000017 (IP = +0x39e67, VMID = 14990, TID = 0)
[P14970:T2:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
[P14972:T3:stress-ng] Heap corruption detected: invalid heap level 8
stress-ng: info: [1] unsuccessful run completed in 60.12s (1 min, 0.12 secs)
Run 2:
$ graphene-direct stress-ng --getdent 8 --timeout 60s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 getdent
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
[P15426:T6:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
[P15429:T7:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
[P15432:T8:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
[P15435:T9:stress-ng] assert failed ../LibOS/shim/include/../../../common/include/slabmgr.h:400 *m == SLAB_CANARY_STRING
stress-ng: info: [1] unsuccessful run completed in 60.09s (1 min, 0.09 secs)
Failure 3: Testing mknod
$ graphene-direct stress-ng --mknod 8 --timeout 40s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 mknod
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: fail: [2] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [2] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [2] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [2] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [2] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
info: 5 failures reached, aborting stress process
stress-ng: fail: [3] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [3] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [3] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [3] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
[P26573:T3:stress-ng] error: Internal memory fault at 0xc0000005b (IP = +0x1bdd9, VMID = 26573, TID = 3)
stress-ng: fail: [4] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [4] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [4] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [4] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
[P26577:T4:stress-ng] error: Internal memory fault at 0x400000053 (IP = +0x1bdd9, VMID = 26577, TID = 4)
stress-ng: fail: [5] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [5] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [5] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [5] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [5] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
info: 5 failures reached, aborting stress process
stress-ng: fail: [6] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [6] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [6] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [6] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [6] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
info: 5 failures reached, aborting stress process
stress-ng: fail: [7] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [7] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [7] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [7] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [7] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
info: 5 failures reached, aborting stress process
stress-ng: fail: [8] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [8] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
[P26591:T8:stress-ng] error: Internal memory fault at 0x400000053 (IP = +0x1bdd9, VMID = 26591, TID = 8)
stress-ng: fail: [9] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [9] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [9] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
stress-ng: fail: [9] stress-ng-mknod: mknod failed, errno=22 (Invalid argument)
[P26594:T9:stress-ng] error: Internal memory fault at 0xc0000005b (IP = +0x1bdd9, VMID = 26594, TID = 9)
Failure 4: Testing rename.
graphene-direct stress-ng --rename 8 --timeout 40s
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
error: Forwarding host environment variables to the app is enabled. Graphene will continue application execution, but this configuration must not be used in production!
stress-ng: info: [1] dispatching hogs: 8 rename
stress-ng: error: [1] glob on regex "/sys/devices/system/cpu/cpu0/cache/index[0-9]*" failed: 1
stress-ng: info: [1] cache allocate: using built-in defaults as unable to determine cache details
error: *** Unexpected memory fault occurred inside PAL (PID = 16073, TID = 16073, RIP = +0x00005a4d)! ***
error: *** Unexpected memory fault occurred inside PAL (PID = 16053, TID = 16053, RIP = +0x00005a4d)! ***
error: *** Unexpected memory fault occurred inside PAL (PID = 16061, TID = 16061, RIP = +0x00005a4d)! ***
stress-ng: info: [1] unsuccessful run completed in 28.54s
Steps to reproduce
Installation of the tool is simple, just run the following command: apt install stress-ng
Manifest files are attached to the issue.
stress-ng.manifest.template.txt
stress-ng.manifest.txt
Expected results
Actual results
The text was updated successfully, but these errors were encountered: