Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS tracer fails on too large stacks in returnvalue #23669

Closed
holiman opened this issue Oct 1, 2021 · 19 comments
Closed

JS tracer fails on too large stacks in returnvalue #23669

holiman opened this issue Oct 1, 2021 · 19 comments
Labels

Comments

@holiman
Copy link
Contributor

holiman commented Oct 1, 2021

There appears to be a regression, which concerns alpine builds, which causes crashes like #23660 .
This seems to happen only when runing via docker, and it can be reproed like as shown below. This was once fixed by #22857, no idea why we're seeing it now.

Run it

docker run ethereum/client-go --nodiscover --maxpeers 0 --dev 

Also


docker exec -it beautiful_banach  /bin/sh

/ # geth attach /tmp/geth.ipc 
Welcome to the Geth JavaScript console!

instance: Geth/v1.10.10-unstable-f2491c5e/linux-amd64/go1.17.1
coinbase: 0xb2337989567fba76b5e50d8244c29aa38f33b2e3
at block: 0 (Thu Jan 01 1970 00:00:00 GMT+0000 (UTC))
 datadir: 
 modules: admin:1.0 clique:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

To exit, press ctrl-d or type exit
> eth.sendTransaction({from:eth.accounts[0], to:eth.accounts[0], value:1 })
"0x1f1f731660e34a9e71bc8d2fe809334529ea8f6ce997d7fcdd34180c669d5c4f"

> tracer = "{step: function() {}, fault: function() {}, result: function() { var o={}; var x=o; for (var i=0; i<1000; i++){o.foo={}; o=o.foo; } return x; }}"
"{step: function() {}, fault: function() {}, result: function() { var o={}; var x=o; for (var i=0; i<1000; i++){o.foo={}; o=o.foo; } return x; }}"
> debug.traceTransaction("0x1f1f731660e34a9e71bc8d2fe809334529ea8f6ce997d7fcdd34180c669d5c4f", {"tracer": tracer})
Error: EOF
	at web3.js:6357:37(47)
	at web3.js:5091:62(37)
	at <eval>:1:23(7)

^ process killed.

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

tx = eth.sendTransaction({from:eth.accounts[0], to:eth.accounts[0], value: 1})


for(i = 0; i < 1000; i++){
        console.log("testing ", i)
        tracer = "{step: function() {}, fault: function() {}, result: function() { var o={}; var x=o; for (var i=0; i<"+i+"; i++){o.foo={}; o=o.foo; } return x; }}"
        debug.traceTransaction(tx, {"tracer": tracer})

}

It gets killed here:

testing  428
testing  429
GoError: Error while compiling or running script: Error: EOF at web3.js:6357:37(47)
	at native
	at <eval>:1:12(1)

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

backtrace courtesy of @gballet

#0  0x00007ffff7fb30b0 in printf_core (f=f@entry=0x0, 
    fmt=fmt@entry=0x18d9a3e "%p", ap=ap@entry=0x7fffccce70c0, 
    nl_arg=nl_arg@entry=0x7fffccce7150, nl_type=nl_type@entry=0x7fffccce70d8)
    at src/stdio/vfprintf.c:431
#1  0x00007ffff7fb3a3c in vfprintf (f=f@entry=0x7fffccce7248, 
    fmt=fmt@entry=0x18d9a3e "%p", ap=<optimized out>)
    at src/stdio/vfprintf.c:668
#2  0x00007ffff7fb5bbe in vsnprintf (s=s@entry=0x7fffccce7370 "", 
    n=n@entry=256, fmt=fmt@entry=0x18d9a3e "%p", ap=ap@entry=0x7fffccce7358)
    at src/stdio/vsnprintf.c:54
#3  0x00000000012a538f in duk__try_push_vsprintf (ap=0x7fffccce7358, 
    fmt=0x18d9a3e "%p", sz=256, buf=0x7fffccce7370, thr=0x7fffccce3070)
    at duk_api_stack.c:4765
#4  duk_push_vsprintf (thr=0x7fffccce3070, fmt=0x18d9a3e "%p", 
    ap=0x7fffccce74c0) at duk_api_stack.c:4817
#5  0x00000000012a5504 in duk_push_sprintf (thr=thr@entry=0x7fffccce3070, 
    fmt=fmt@entry=0x18d9a3e "%p") at duk_api_stack.c:4849
#6  0x00000000012befb4 in duk__enc_objarr_entry (js_ctx=0x7fffccd07480, 
    entry_top=0x7fffccce75f4) at duk_bi_json.c:1764
#7  0x00000000012c801e in duk__enc_object (js_ctx=0x7fffccd07480)
    at duk_bi_json.c:1833
#8  0x00000000012c7a9a in duk__enc_value (js_ctx=0x7fffccd07480, 
    idx_holder=1291) at duk_bi_json.c:2203

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

This same bug can be reproed all the way back to 1.9.1. I checked 1.8.1 too, that one crashes even earlier.

@gballet
Copy link
Member

gballet commented Oct 1, 2021

This problem is related to the musl libc library as much as to duktape. I am not a 100% sure what goes wrong just yet, but at the low level, duktape passes some invalid input to musl's vsnprintf function when trying to print a pointer.

The segfault is triggered in the prologue of musl's print_core. The function goes back on the stack, and it seems to go back too far, crossing a page boundary.

   0x00007ffff7fb309a <+0>:	push   %r15
   0x00007ffff7fb309c <+2>:	push   %r14
   0x00007ffff7fb309e <+4>:	push   %r13
   0x00007ffff7fb30a0 <+6>:	push   %r12
   0x00007ffff7fb30a2 <+8>:	push   %rbp
   0x00007ffff7fb30a3 <+9>:	mov    %rdi,%rbp
   0x00007ffff7fb30a6 <+12>:	push   %rbx
   0x00007ffff7fb30a7 <+13>:	xor    %ebx,%ebx
   0x00007ffff7fb30a9 <+15>:	sub    $0xa8,%rsp
   0x00007ffff7fb30b0 <+22>:	mov    %rdx,0x10(%rsp)

Before subtracting 0xa8, rsp is at 0x7fffd0ded070 and before executing the faulty mov %rdx,0x10(%rsp) instruction, rsp is at 0x7fffd0decfc0. The segfault is completely normal, since that new location is write(and read)-protected, unlike the stack page that vsnprintf is created with.

7fffd0deb000-7fffd0ded000 ---p 00000000 00:00 0 
7fffd0ded000-7fffd108e000 rw-p 00000000 00:00 0

So this is what happens at the very lowest level. Root cause analysis still needs to be done to understand why this came to be. The program is clearly running out of stack, as print_core is trying to allocate 168 bytes on a stack that it doesn't have. This problem doesn't occur on systems running a more standard libc (ran it on FreeBSD, the problem doesn't occur). The first conclusion I could reach is that the system needs more stack. I'll see if I can enforce that for duktape.

@gballet
Copy link
Member

gballet commented Oct 1, 2021

After increasing the thread stack size when linking-in duktape, the behavior is the expected one:

testing  997
testing  998
testing  999
{
  foo: {
    foo: {
      foo: {
        foo: {...}
      }
    }
  }
}

So that seems to be the problem: musl doesn't offer too much thread stack space (see here) for a nice comparison table. Because other distribs have a higher stack size, duktape usually hits the limit of 1000 allowed recursions before it runs out of stack. Not so with alpine linux, that offers a smaller stacksize by default.

@gballet
Copy link
Member

gballet commented Oct 1, 2021

@holiman Looks like go won't let me hardcode the linker flags to increase the stack size on alpine, so we have two courses of action to avoid this issue:

  1. Reduce the maximum number of allowed recursions in duktape, possibly through a fork; or
  2. Change build/ci.go to build geth with an increased stack size on alpine linux, as done in PR build: increase C thread stack size on linux #23676

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

Heh, that's pretty spot on: the table says 128KiB per thread. Well, we hit the error after roughty 430 nestings, and each nesting uses 256B stack buffer. 256 * 430 / 1024 = 107.5 KB, so taking some overhead into account, it looks like we're hitting exactly that limit.

@gballet
Copy link
Member

gballet commented Oct 1, 2021

429 to be precise

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

With that in mind, btw, it would be good if we could run the CI testsuite on alpine too.

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

Another alternative to 1), would be to fork duktape to not 'cheat' and use a stack buffer: https://github.com/svaarala/duktape/blob/master/src-input/duk_api_stack.c#L4768 , or make it smaller than 256 (https://github.com/svaarala/duktape/blob/master/src-input/duk_api_stack.c#L4737) . But it would decrease performance.

@holiman
Copy link
Contributor Author

holiman commented Oct 1, 2021

Would also be good to produce a minimum PoC and submit upstream to duktape, maybe they wouldn't mind figuring something out on their end.

@gballet
Copy link
Member

gballet commented Oct 1, 2021

Would also be good to produce a minimum PoC and submit upstream to duktape, maybe they wouldn't mind figuring something out on their end.

I thought about that, but Go doesn't let you set external ld flags via a #cgo directive. It has to be build externally 😠

@holiman
Copy link
Contributor Author

holiman commented Oct 2, 2021

I thought about that, but

I meant maybe just an alpine docker image with a minimal duktape interpreter which just loads a piece of code and crashes. Not sure what you mean is problematic?

@gballet
Copy link
Member

gballet commented Oct 3, 2021

I thought you meant: provide a fix using the cgo directive, which is not possible. Just reproducing it isn't problematic.

@max-mapper
Copy link

max-mapper commented Oct 6, 2021

Hope this is the right thread. Since upgrading to v1.10.9 in our testing env we are seeing errors tracing certain txs.

curl -X POST -H "Content-Type: application/json" --data '{"method":"debug_traceBlockByHash","params":["0xb846bc49fcbead2fb6deec3a3283506bd0edc83c54c53e069f57c0163a40ebe9",{"tracer":"callTracer","timeout":null}],"id":15293,"jsonrpc":"2.0"}' http://localhost:8545

This returns a number of txs with the error "error": "TypeError: cannot read property 'calls' of undefined in server-side tracer function 'fault'". We are compiling and executing geth inside a docker container based on amazonlinux which has a default 8MB thread size. We confirmed by running this program inside our container:

pthread size program
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

int main() {
   size_t stacksize;
   pthread_attr_t a;
   pthread_attr_getstacksize(&a, &stacksize);
   printf("Current stack size = %d\n", stacksize);
   return 0;
}

Which returned 4195888. We have also been running this patch on duktape for around 9 months https://github.com/olebedev/go-duktape/pull/58/files, but we switched back to gopkg.in/olebedev/go-duktape.v3 v3.0.0-20200619000410-60c24ae608a6 just to confirm, and still got the above errors. We also tried #23676. There have been 6 transactions so far that v1.10.8 and below have been unable to trace, so we are eager to get the new tracing performance fixes into prod :)

Untraceable txs

0x430874b252f837eecef4eefd51f94019d31b0cb2fa8cba78ebc98f33c660ade2
0xa76cdf87291f4f38b9ff4100b24c9070aa4c3548fa358d468a7fe495e9754d75
0x66bf740ad2a56c9d1d57e1bc5b0a019414bb25aa73a25e9223f92dff76eab56d
0x86cad91250a6c96a7af1f32a23ba329b9cca85cccd8f982847a9e731920e8fc3
0x709622690357809f821592ffab5326404b17ac6a258e5479d525d5c4b517d5fe
0xd22027b361a97d68fa8ec9166a42d73fa490da05945557ae68ffa9c2e35df9d1

@s1na
Copy link
Contributor

s1na commented Oct 6, 2021

@maxogden the error you're referring to is fixed on master (PR #23667), but hasn't been released yet.

@s1na
Copy link
Contributor

s1na commented Oct 6, 2021

I made this into a unit test in https://github.com/s1na/go-ethereum/tree/test/tracer-deep-nested but the crash wasn't happening for me in an alpine docker. We tried with @gballet's setup and at first the test was passing (i.e. no segfault). Then he tried with compiling the test (i.e. go test -c ./eth/tracers) which caused the segfault. I still don't understand what makes running the tracer as part of a RPC call (or console) different from doing it in a test.

@jwasinger
Copy link
Contributor

jwasinger commented Oct 6, 2021

Can also confirm that the test passes on my machine (host machine running ubuntu 20.04 fwiw). Tried running it on golang:1.14-alpine and golang:alpine images.

@holiman
Copy link
Contributor Author

holiman commented Oct 20, 2021

Closed by #23676

@holiman holiman closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants