Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blob.GetAll fails from a certain height #3185

Closed
christopherbrumm opened this issue Feb 14, 2024 · 19 comments
Closed

blob.GetAll fails from a certain height #3185

christopherbrumm opened this issue Feb 14, 2024 · 19 comments
Assignees
Labels
bug Something isn't working external Issues created by non node team members

Comments

@christopherbrumm
Copy link

Celestia Node version

Semantic version: v0.12.4 Commit: 8e5a717 Build Date: Fri Feb 9 11:20:30 UTC 2024 Golang version: go1.21.1

OS

System version: amd64/linux

Install tools

Followed step-by-step tutorial in the docs to install a celestia DA full node for Mainnet Beta

Others

Running on AWS m5.4xlarge:

  • 16 vCPU
  • 64GB RAM

Steps to reproduce it

  • Setup celestia DA full node
  • Start syncing process
  • Write a script to query blob.GetAll for each height, e.g. using
    curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ $HEIGHT, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658

Expected result

I expect to get a response for every request to store all blobs for a given namespace.

Actual result

Instead, the request stalls starting from a certain height (observed starting from height740731, 740740), although requesting all blobs for a lower height works totally fine. The node doesn't throw errors when requesting, CPU and RAM usage limits are not exceeded. A node restart ends in the same issue, although resetting the store with celestia full unsafe-reset-store.

Relevant log output

Couldn't find any possibly related node logs

Notes

No response

@christopherbrumm christopherbrumm added the bug Something isn't working label Feb 14, 2024
@github-actions github-actions bot added the external Issues created by non node team members label Feb 14, 2024
@renaynay
Copy link
Member

Hey @christopherbrumm

A node restart ends in the same issue

Does this mean you restarted the node and tried the script again and it still did not work?

although resetting the store with celestia full unsafe-reset-store

Did you reset the store and sync from scratch and then it worked?

@vgonkivs
Copy link
Member

Might be related to #2915

@walldiss
Copy link
Member

Could you please run your node with metrics flags pointing to our otel collecter? We would also need your peerID (show at node start) to filter out your node metrics.

--metrics.endpoint otel.celestia.observer

@christopherbrumm
Copy link
Author

christopherbrumm commented Feb 14, 2024

Did you reset the store and sync from scratch and then it worked?
I tried the following things:

  • Restarted the node without resetting the store
  • Restarted the node using other --core.ip endpoint
  • Restarted the node after resetting the store
  • Up-scaled the instance the node is running on

In all cases it ended in the same issue.

Today I found out that restarting the node or the whole server fixes the issue shortly, but just increases that certain height: I couldn't query blob.GetAll on my instance for all heights >= 740740; after restarting, I can't query it for all heights >= 740817.

@christopherbrumm
Copy link
Author

Could you please run your node with metrics flags pointing to our otel collecter? We would also need your peerID (show at node start) to filter out your node metrics.

Got CANONICAL_PEER_STATUS: peer=12D3KooWR4Xkp1rM92xs5vzXfYxVBg6Pvknoeb5JqKJS3yvTgBYv - is this the correct peer ID @walldiss?

@renaynay
Copy link
Member

Today I found out that restarting the node or the whole server fixes the issue shortly, but just increases that certain height: I couldn't query blob.GetAll on my instance for all heights >= 740740; after restarting, I can't query it for all heights >= 740817.

Seems related to celestiaorg/go-header#159

@vgonkivs
Copy link
Member

It is not related to the issue in the syncer. In case the requested height > storeHeight, the node will return an error. Also, syncer.Head can't be stuck.

The possible root cause of this issue I've described in #2915. TLDR: by some reason, ipld can stuck during shares fetching.

@christopherbrumm
Copy link
Author

christopherbrumm commented Feb 28, 2024

I have reproduced the issue and I can confirm your thesis @vgonkivs, because the error occurs when querying one blob more than one time. However, I've appended the logs of my node when reproducing:

celestia-node.txt

In another terminal, I have executed the following commands (every time I exited the curl execution with ^C, the response was stalling for more than 30s):

Wed Feb 28 14:51:51 UTC 2024

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745000, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
{"jsonrpc":"2.0","id":1,"error":{"code":1,"message":"getting blobs for namespace(00000000000000000000000000000000000000866269ddf77dbc40ed9d): blob: not found\nblob: not found"}}

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745000, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
^C

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745010, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
{"jsonrpc":"2.0","id":1,"error":{"code":1,"message":"getting blobs for namespace(00000000000000000000000000000000000000866269ddf77dbc40ed9d): blob: not found\nblob: not found"}}

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745010, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
{"jsonrpc":"2.0","id":1,"error":{"code":1,"message":"getting blobs for namespace(00000000000000000000000000000000000000866269ddf77dbc40ed9d): blob: not found\nblob: not found"}}

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745010, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
{"jsonrpc":"2.0","id":1,"error":{"code":1,"message":"getting blobs for namespace(00000000000000000000000000000000000000866269ddf77dbc40ed9d): blob: not found\nblob: not found"}}

ubuntu@ip-172-31-1-21:~$ curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $CELESTIA_NODE_AUTH_TOKEN" -d '{ "id": 1, "jsonrpc": "2.0", "method": "blob.GetAll", "params": [ 745010, ["AAAAAAAAAAAAAAAAAAAAAAAAAIZiad33fbxA7Z0="] ] }' 127.0.0.1:26658
^C

@Wondertan
Copy link
Member

@christopherbrumm, could you also test this behavior on the LN? I can see you using FN and we wonder if that's consistent across node types, particularly light

@christopherbrumm
Copy link
Author

christopherbrumm commented Feb 29, 2024

@christopherbrumm, could you also test this behavior on the LN? I can see you using FN and we wonder if that's consistent across node types, particularly light

I was able to reproduce the issue on the LN. Will provide logs if required.

@christopherbrumm
Copy link
Author

For completeness, linking the script to reproduce the issue in this thread:

https://github.com/christopherbrumm/scripts/tree/main/celestia

@vgonkivs
Copy link
Member

Hello, @christopherbrumm. Thanks for the script. It helped a lot. Yesterday I had a chance to reproduce this issue. It seems that @renaynay was right: the root cause was in syncer. This issue was fixed at celestiaorg/go-header#159. I tried your script on version v0.13(which already contains the fix) and for now, it works fine. Could you please also try v0.13?

@christopherbrumm
Copy link
Author

That's great news, will try and confirm it asap in order to close this issue.

Thanks a lot!

@christopherbrumm
Copy link
Author

Hi @vgonkivs, after five attempts with different setups, I can confirm that the upgrade to v0.13 did not completely fix the reported issue.

The node still stalls at a certain height and makes blob requests not possible. I am still using the provided script to replicate this issue.

Also, I've noticed another issue that occurs when requesting share.GetEDS. Essentially, it mirrors the blob issue: the node stalls when requesting data above a certain level. I have added an additional script to reproduce this issue with more context in the README.

I deployed five nodes on five different instances. While four of them have the issues listed, one is working fine. After realizing that one node has no problems providing blobs and shares, I tried to duplicate the setup. However, a node with identical setup shows the same problems again.

cc @Wondertan

@christopherbrumm
Copy link
Author

Currently running 3 nodes on a fresh server and for all 3 the scripts are running through super smoothly.

Will keep this thread updated.

@vgonkivs
Copy link
Member

Hello @christopherbrumm. Thanks for your update. Will continue investigating this issue. May I also ask you to attach logs from the failed attempt?

@christopherbrumm
Copy link
Author

Update: The share.GetEDS query no longer causes problems.

It seems that v0.13 has solved the complete stalling when executing blob.GetAll, although the query duration still increases and sometimes takes over 60 seconds.

celestia-node_2.txt

@vgonkivs
Copy link
Member

vgonkivs commented Mar 20, 2024

I guess we can close this issue then. Regarding the high duration of some requests: the current protocol for getting shares is slow(a known problem) and we are currently working on a better solution.

cc @Wondertan

@Wondertan
Copy link
Member

Yes, lets close then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

No branches or pull requests

5 participants