Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to restore recent snapshot with strange error #1160

Closed
ghost opened this issue Aug 16, 2023 · 9 comments
Closed

Failed to restore recent snapshot with strange error #1160

ghost opened this issue Aug 16, 2023 · 9 comments
Assignees
Labels
bug ⚠️ Something isn't working

Comments

@ghost
Copy link

ghost commented Aug 16, 2023

Context & versions

Trying to restore a recent snapshot:

$ mithril-client --version
mithril-client 0.3.27+ff06651

Got the following error at unpacking stage:

$ mithril-client snapshot download fdd609c5affa627c9b19dfd32c5a370a9e6ba0f930ec50281b5f470fe3c955de
1/7 - Checking local disk info…                                                                                                                                                                                                                                                                                            2/7 - Fetching the certificate's information…
3/7 - Verifying the certificate chain…
4/7 - Downloading the snapshot…
⠐ [00:03:46] [###########################################################>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] 15.33 GiB/69.44 GiB (892.8s)
5/7 - Unpacking the snapshot…
6/7 - Computing the snapshot digest…
Error: "An error occured: Could not compute digest in './db': At least two immutable chunks should exist in directory './db': expected 4655 but found Some(4654)."

Steps to reproduce

curl -L -o mithril-client.deb https://github.com/input-output-hk/mithril/releases/download/2331.1/mithril-client_0.3.27+ff06651_amd64.deb
dpkg -i mithril-client.deb
mithril-client snapshot download fdd609c5affa627c9b19dfd32c5a370a9e6ba0f930ec50281b5f470fe3c955de
@ghost ghost added the bug ⚠️ Something isn't working label Aug 16, 2023
@ghost
Copy link
Author

ghost commented Aug 16, 2023

Unarchiving by the tar.gz archve and counting the immutables every second, I see this strange pattern:

4742
4759
4772
4685
ls: cannot access 'immutable/03319.chunk': No such file or directory
4344
4098
96
119
139
158
165
183
202
212
226
237
266
292

eg. file "disappear"

@Alenar
Copy link
Collaborator

Alenar commented Aug 16, 2023

Thanks for the report !

Sadly this is not the first issue about snapshots unpacking, see issue: #1140. This is probably related to the fact that we pack the snapshot without stopping the node.
Since changing that will be difficult we added a verification of the snapshot before upload (#1138) hoping that this would catch most errors and avoid uploading of a corrupted archive, looks like this mechanism needs to be strengthen.

@ghost
Copy link
Author

ghost commented Aug 16, 2023

That's unfortunate :( We should probably stop the node before snapshotting, or perhaps use some journaled FS?

@Alenar
Copy link
Collaborator

Alenar commented Aug 16, 2023

Stopping the node would be ideal but would totally change the relation between it and the aggregator. Instead of being an addition running alongside, the aggregator would control the node (what about environment variable or parameters that may be needed for that ?).
But in some way this would make sense and further strengthen the fact that running an aggregator is a task that ask some dedication, not something running on the sidelines likes a mithril signer.

We did not thought about a journaled FS, so I've no idea on how this would works and the related costs but this is an interesting idea.

We mainly thought about mitigation, what we have in mind is copying the files that changes (last immutable trio, ledger and volatiles) before making the snapshot (see #1140) but imo this may just move the problem since the copy would still happen while the node is running.

@ghost
Copy link
Author

ghost commented Aug 17, 2023

Yes, the aggregator is special so there's no reason to restrain ourselves in what it can do. I do think it makes total sense for it to control the node, even to fork one as part of its startup process. BTW, this could be a testbed for offering a package providing mithril+cardano-node ;)

Renaming (eg. mv) is atomic on Linux but that's not something we can control here

@jpraynaud
Copy link
Member

This is weird because I had no problem while doing the same operation on my computer with mithril-client 0.3.27+ff06651 🤔. Here is the output of the command:

$ mithril-client snapshot download fdd609c5affa627c9b19dfd32c5a370a9e6ba0f930ec50281b5f470fe3c955de
./mithril-client snapshot download fdd609c5affa627c9b19dfd32c5a370a9e6ba0f930ec50281b5f470fe3c955de
1/7 - Checking local disk info…
2/7 - Fetching the certificate's information…                                                                                                                                                     
3/7 - Verifying the certificate chain…                                                                                                                                                            
4/7 - Downloading the snapshot…                                                                                                                                                                   
5/7 - Unpacking the snapshot…                                                                                                                                                                     
6/7 - Computing the snapshot digest…                                                                                                                                                              
7/7 - Verifying the snapshot signature…                                                                                                                                                           
Snapshot 'fdd609c5affa627c9b19dfd32c5a370a9e6ba0f930ec50281b5f470fe3c955de' has been unpacked and successfully checked against Mithril multi-signature contained in the certificate.
                
Files in the directory './db' can be used to run a Cardano node.

If you are using Cardano Docker image, you can restore a Cardano Node with:

docker run -v cardano-node-ipc:/ipc -v cardano-node-data:/data --mount type=bind,source="./db",target=/data/db/ -e NETWORK=mainnet inputoutput/cardano-node:8.1.2

Maybe the aggregator is not responsible for that behavior, however we have strengthened the archive verification: we make sure that the archive can be fully unpacked before publishing it (as in #1179).

Can you provide more information about the environment where you encountered this problem?

@ghost
Copy link
Author

ghost commented Aug 29, 2023

I could try again. I had some other issues last week before the workshop but I suspect this was more a file corruption due to interrupted network download than an issue in the snapshot

@jpraynaud
Copy link
Member

It looks like the warning stating that the disk space available is not displayed when the -v option is not specified.
I suspect that this might be the source of the weird behavior.
The #1192 should fix the problem 👍

@ghost
Copy link
Author

ghost commented Aug 30, 2023

OK, let's close this then, it's certainly a spurious problem, let's keep our eyes open should it reproduce.

@ghost ghost closed this as completed Aug 30, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ⚠️ Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants