Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled trie error in Trie.Prove with experimental pbss #26293

Closed
yorickdowne opened this issue Dec 3, 2022 · 24 comments
Closed

Unhandled trie error in Trie.Prove with experimental pbss #26293

yorickdowne opened this issue Dec 3, 2022 · 24 comments
Assignees

Comments

@yorickdowne
Copy link

yorickdowne commented Dec 3, 2022

System information

Geth version: 1.11.0-unstable
CL client & version: Teku v22.11.0
Commit hash : eb57884596df203c40c3144ed51542bb7e9ef3e8, PR #26274

Observation

On a fresh sync on mainnet with --trie.path-based=true, Geth is following chain head after sync and prints errors about missing trie node 746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab

This is a test system. It's understood pbss is experimental and not production-ready.

Logs

Logs when this error first occurred. I don't have debug logs, alas. This is not quite 2 hours after initial sync finished.

execution_1                  | INFO [12-03|04:11:24.563] Imported new potential chain segment     number=16,101,815 hash=10f4c6..2c31c7 blocks=1       txs=119         mgas=15.294  elapsed=245.149ms    mgasps=62.387  dirty=388.29MiB
execution_1                  | INFO [12-03|04:11:24.582] Chain head was updated                   number=16,101,815 hash=10f4c6..2c31c7 root=97b340..1bf986 elapsed=4.236644ms
execution_1                  | INFO [12-03|04:11:32.553] Generating state snapshot                root=99d3be..61f40b in=6f898e..db63e9 at=8dc388..9b5775 accounts=83,119,824           slots=375,464,680           storage=30.58GiB   dangling=0 elapsed=1h49m15.223s eta=2h21m30.298s
execution_1                  | INFO [12-03|04:11:37.649] Aborting state snapshot generation       root=99d3be..61f40b in=6f898e..db63e9 at=d513d8..396ca5 accounts=83,119,824           slots=375,746,220           storage=30.60GiB   dangling=0 elapsed=1h49m20.320s eta=2h21m36.9s
execution_1                  | INFO [12-03|04:11:37.653] Resuming state snapshot generation       root=e1dd39..66e8e8 in=6f898e..db63e9 at=d513d8..396ca5 accounts=83,119,824           slots=375,746,220           storage=30.60GiB   dangling=0 elapsed=1h49m20.324s eta=2h21m36.905s
execution_1                  | INFO [12-03|04:11:37.670] Imported new potential chain segment     number=16,101,816 hash=ee0821..317480 blocks=1       txs=89          mgas=8.874   elapsed=217.279ms    mgasps=40.842  dirty=387.90MiB
execution_1                  | INFO [12-03|04:11:37.681] Chain head was updated                   number=16,101,816 hash=ee0821..317480 root=9ad43a..a7f54f elapsed=3.060608ms
execution_1                  | INFO [12-03|04:11:45.655] Generating state snapshot                root=e1dd39..66e8e8 at=6fa9cc..35d51c accounts=83,213,530           slots=376,135,293           storage=30.63GiB   dangling=0 elapsed=1h49m28.325s eta=2h21m30.262s
execution_1                  | INFO [12-03|04:11:50.062] Aborting state snapshot generation       root=e1dd39..66e8e8 at=6fc8d1..c9e879 accounts=83,303,042           slots=376,301,884           storage=30.65GiB   dangling=0 elapsed=1h49m32.732s eta=2h21m19.625s
execution_1                  | INFO [12-03|04:11:50.064] Resuming state snapshot generation       root=c11999..0ebd8f at=6fc8d1..c9e879 accounts=83,303,042           slots=376,301,884           storage=30.65GiB   dangling=0 elapsed=1h49m32.734s eta=2h21m19.628s
execution_1                  | INFO [12-03|04:11:50.078] Imported new potential chain segment     number=16,101,817 hash=130697..77ceba blocks=1       txs=133         mgas=14.488  elapsed=263.507ms    mgasps=54.982  dirty=388.42MiB
execution_1                  | INFO [12-03|04:11:50.100] Chain head was updated                   number=16,101,817 hash=130697..77ceba root=5b344f..72a765 elapsed=9.124248ms
execution_1                  | INFO [12-03|04:11:58.066] Generating state snapshot                root=c11999..0ebd8f in=6ff11f..0be9b4 at=2898a1..eb153c accounts=83,420,643           slots=376,686,491           storage=30.68GiB   dangling=0 elapsed=1h49m40.736s eta=2h21m8.755s
execution_1                  | INFO [12-03|04:12:00.457] Aborting state snapshot generation       root=c11999..0ebd8f in=6ff621..e6b9af at=77f358..94748f accounts=83,435,273           slots=376,953,690           storage=30.70GiB   dangling=0 elapsed=1h49m43.127s eta=2h21m9.202s
execution_1                  | INFO [12-03|04:12:00.459] Resuming state snapshot generation       root=a747ed..9abd12 in=6ff621..e6b9af at=77f358..94748f accounts=83,435,273           slots=376,953,690           storage=30.70GiB   dangling=0 elapsed=1h49m43.129s eta=2h21m9.205s
execution_1                  | INFO [12-03|04:12:00.474] Imported new potential chain segment     number=16,101,818 hash=9c5f0e..9f1f48 blocks=1       txs=114         mgas=11.670  elapsed=206.845ms    mgasps=56.420  dirty=389.31MiB
execution_1                  | INFO [12-03|04:12:00.496] Chain head was updated                   number=16,101,818 hash=9c5f0e..9f1f48 root=dd18cf..f513e5 elapsed=3.635276ms
execution_1                  | ERROR[12-03|04:12:05.076] Unhandled trie error in Trie.Prove       err="missing trie node 746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab (owner 7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75) (path 000b) unexpected node 57c165b4d08f768f5714db8af1b8b1c2ec10bffd839cec024d9ba9a94b578e3c!=746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab(7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75 [0 11])"
execution_1                  | INFO [12-03|04:12:12.423] Resuming state snapshot generation       root=c2e408..494bc8 in=6ff621..e6b9af at=77f358..94748f accounts=83,481,616           slots=377,281,232           storage=30.73GiB   dangling=0 elapsed=1h49m55.093s eta=2h21m24.596s
execution_1                  | INFO [12-03|04:12:12.443] Imported new potential chain segment     number=16,101,819 hash=67c747..e62623 blocks=1       txs=67          mgas=24.353  elapsed=272.196ms    mgasps=89.469  dirty=391.03MiB
execution_1                  | INFO [12-03|04:12:12.467] Chain head was updated                   number=16,101,819 hash=67c747..e62623 root=6d4a4c..484398 elapsed=3.452427ms
execution_1                  | ERROR[12-03|04:12:16.551] Unhandled trie error in Trie.Prove       err="missing trie node 746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab (owner 7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75) (path 000b) unexpected node 57c165b4d08f768f5714db8af1b8b1c2ec10bffd839cec024d9ba9a94b578e3c!=746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab(7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75 [0 11])"
execution_1                  | INFO [12-03|04:12:25.172] Resuming state snapshot generation       root=9346eb..939cb9 in=6ff621..e6b9af at=77f358..94748f accounts=83,527,959           slots=377,608,774           storage=30.75GiB   dangling=0 elapsed=1h50m7.842s  eta=2h21m40.998s
execution_1                  | INFO [12-03|04:12:25.190] Imported new potential chain segment     number=16,101,820 hash=1a700a..3ebe53 blocks=1       txs=134         mgas=21.304  elapsed=322.445ms    mgasps=66.071  dirty=391.97MiB
execution_1                  | INFO [12-03|04:12:25.227] Chain head was updated                   number=16,101,820 hash=1a700a..3ebe53 root=ce8a5c..6fbd90 elapsed=16.55591ms
execution_1                  | ERROR[12-03|04:12:29.434] Unhandled trie error in Trie.Prove       err="missing trie node 746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab (owner 7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75) (path 000b) unexpected node 57c165b4d08f768f5714db8af1b8b1c2ec10bffd839cec024d9ba9a94b578e3c!=746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab(7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75 [0 11])"
execution_1                  | INFO [12-03|04:12:36.382] Resuming state snapshot generation       root=3deb32..45e55a in=6ff621..e6b9af at=77f358..94748f accounts=83,574,302           slots=377,936,316           storage=30.78GiB   dangling=0 elapsed=1h50m19.052s eta=2h21m55.42s
execution_1                  | INFO [12-03|04:12:36.394] Imported new potential chain segment     number=16,101,821 hash=85ef21..e5c311 blocks=1       txs=127         mgas=11.942  elapsed=316.151ms    mgasps=37.772  dirty=392.35MiB
execution_1                  | INFO [12-03|04:12:36.407] Chain head was updated                   number=16,101,821 hash=85ef21..e5c311 root=685b26..07bacd elapsed=4.217528ms
execution_1                  | ERROR[12-03|04:12:40.454] Unhandled trie error in Trie.Prove       err="missing trie node 746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab (owner 7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75) (path 000b) unexpected node 57c165b4d08f768f5714db8af1b8b1c2ec10bffd839cec024d9ba9a94b578e3c!=746d39f09b4c5c1b26ce016fcc3468c88eddecb18b0897b6e031d49925b8d1ab(7006375ac2fae9614f6258206c1e86e5112c551b246dd6efb4783aa4b1b62d75 [0 11])"
execution_1                  | INFO [12-03|04:12:48.300] Resuming state snapshot generation       root=3cc9fa..26b48b in=6ff621..e6b9af at=77f358..94748f accounts=83,620,645           slots=378,263,858           storage=30.80GiB   dangling=0 elapsed=1h50m30.970s eta=2h22m10.752s
execution_1                  | INFO [12-03|04:12:48.319] Imported new potential chain segment     number=16,101,822 hash=031a1c..942e9e blocks=1       txs=119         mgas=10.348  elapsed=203.691ms    mgasps=50.801  dirty=391.93MiB
execution_1                  | INFO [12-03|04:12:48.343] Chain head was updated                   number=16,101,822 hash=031a1c..942e9e root=8dd65b..f59000 elapsed=6.472359ms
@yorickdowne
Copy link
Author

Starting a fresh sync this time with debug logs, in hopes of catching this error.

@holiman
Copy link
Contributor

holiman commented Dec 3, 2022

Aside from --trie.path-based=true, what other switches are you using?
Also: thanks for trying it out, we appreciate it!
Users wiling to try out the bleeding edge help a lot.

@yorickdowne
Copy link
Author

      - geth
      - --http
      - --http.addr
      - 0.0.0.0
      - --http.vhosts=*
      - --http.corsdomain=*
      - --http.api
      - web3,eth,net
      - --datadir
      - /var/lib/goethereum
      - --port
      - ${EL_P2P_PORT:-30303}
      - --http.port
      - ${EL_RPC_PORT:-8545}
      - --ws
      - --ws.addr
      - 0.0.0.0
      - --ws.port
      - ${EL_WS_PORT:-8546}
      - --ws.api
      - web3,eth,net
      - --${NETWORK}
      - --metrics
      - --metrics.expensive
      - --pprof
      - --pprof.addr
      - 0.0.0.0
      - --authrpc.jwtsecret
      - /var/lib/goethereum/ee-secret/jwtsecret
      - --authrpc.addr
      - 0.0.0.0
      - --authrpc.port
      - ${EE_PORT:-8551}
      - --authrpc.vhosts=*
      - --maxpeers
      - ${EL_MAX_PEER_COUNT:-50}

@yorickdowne
Copy link
Author

Now that I'm watching for it, it doesn't occur again. Of course.

Is this more likely to be an issue that just needs a specific state change to happen on chain to occur, and so it's just a matter of waiting; or more likely to be an issue that could be "forced" to reoccur by just syncing from scratch a bunch of times?

@rjl493456442 rjl493456442 self-assigned this Dec 6, 2022
@rjl493456442
Copy link
Member

It feels like some trie nodes are corrupted. Maybe you can run geth snapshot traverse-state --trie.path-based=true --datadir <your-dir>, it should detect the stuff.

@yorickdowne
Copy link
Author

That particular DB is no longer available, I started over with a fresh DB and debug logs. To reproduce this, is it more likely that just waiting will, or that syncing from scratch repeatedly until issue is encountered will?

@rjl493456442
Copy link
Member

rjl493456442 commented Dec 6, 2022

I don't know, I think you can just try to play around with it and save the db if it happens again.
I tested it manually for a couple of times and nothing wrong happened.

@rjl493456442
Copy link
Member

Btw, I just did another round of snap sync and it can generate snapshot correctly.
Might be some corner cases or something irrelevant. But anyway, would be nice
to have something to repro.

@yorickdowne
Copy link
Author

Unable to repro so far. I'll keep trying.

@sagivo
Copy link

sagivo commented Dec 12, 2022

for what it's worth, I also have these errors, tried to clean the DB multiple times but after the DB gets to ~500GB it start giving me these messages and never fully syncs.

@rjl493456442
Copy link
Member

rjl493456442 commented Dec 12, 2022

Are you also using this experimental PBSS branch?

What do you mean "tried to clean the DB multiple times"?

it start giving me these messages and never fully syncs.

Can you also share some logs?

@yorickdowne
Copy link
Author

I've synced two of these pretty much once a day for the last week, and I cannot reproduce this now. If it's still in the code, it's some real corner case.

@rjl493456442
Copy link
Member

@yorickdowne thanks for your time! I also fixed some issues in state history construction. But I guess it's irrelevant with your original issue, which should only occur when SetHead happens.

Since you are heavily testing/playing around it, I will let you know if I have the next stable version.

@rjl493456442
Copy link
Member

@yorickdowne You may want to try out the latest version in #26274

Fixes some issues in trie history construction. Please sync from scratch if you want to have a try by any chance.

@yorickdowne
Copy link
Author

@rjl493456442 Thanks! Will give it a whirl.

@yorickdowne
Copy link
Author

It appears to have gotten "stuck" during sync. Moved the pivot and stopped making progress after:

geth-pbss-execution-1  | WARN [01-03|21:56:23.720] Pivot became stale, moving               old=16,327,894 new=16,328,419
geth-pbss-execution-1  | INFO [01-03|21:56:24.273] Imported new block receipts              count=731  elapsed=551.017ms     number=16,328,418 hash=0818c4..40e64d age=2h13m49s   size=102.58MiB
geth-pbss-execution-1  | INFO [01-03|21:56:24.273] Syncing: state download in progress      synced=10.68% state=26.87GiB  accounts=20,817,870@4.20GiB  slots=106,228,115@22.06GiB codes=113,631@617.67MiB eta=31h30m7.323s

It's now 16 hours later - no state download progress and no further receipts, just another

geth-pbss-execution-1  | INFO [01-04|13:47:05.794] Syncing: state download in progress      synced=10.68% state=26.87GiB  accounts=20,992,323@4.20GiB  slots=106,228,115@22.06GiB codes=113,631@617.67MiB eta=163h58m45.699s

I have this DB in that state - what can I run for you?

@rjl493456442
Copy link
Member

Well.

Can you geth attach /path/to/geth.ipc --exec "debug.stacks()" > stacktrace.txt?
Honestly, I don't think it's that relevant with my change, probably some issues in snap-sync.
We can dump out the stacks and to see if program is blocked somehow.

@yorickdowne
Copy link
Author

It's the other way around it turns out, exec has to be before the target: geth attach --exec "debug.stacks()" /var/lib/goethereum/geth.ipc >stacktrace.txt

Attached.

stacktrace.txt

@rjl493456442
Copy link
Member

Thanks for the info. I don't find anything relevant with PBSS. Perhaps it's relevant with your network environment? Or the setup with consensus client? Does it eventually recover and finish the sync?