Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collector deletes data stored in the MFS (which was pinned) #7008

Open
RubenKelevra opened this issue Mar 17, 2020 · 6 comments
Open
Assignees
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Mar 17, 2020

Version information:

go-ipfs version: 0.4.23-6ce9a355f
Repo version: 7
System version: amd64/linux
Golang version: go1.14

Description:

I'm using IPFS in a script which updates the local MFS as needed. New files are added with ipfs cp /ipfs/<cid> /path/to/file after ipfs-cluster-ctl added them to the cluster.

So the files are pinned locally (by the cluster service) and also stored in the MFS.

Files which should be deleted are removed from the MFS and I use ipfs-cluster-ctl to add a expire timeout of 14 days to the file.

Since I started to add a lot of files to the repo, I decided to let the garbage collector deal with old stuff and clean up the repo.

After the garbage collector completed the work.

Now I cannot get the hashes or the content of some in the MFS stored files. This is unexpected and should not happen (as far as I understand).

ipfs files ls /path/to/file/ | grep "filename" shows that the directory still contains the file, when the daemon is freshly started. After a files stat --hash on the file, the directory cannot be listed anymore until the daemon is restarted.

$ ipfs files stat --hash --timeout 120s /path/to/a/file.img
Error: Post "http://127.0.0.1:5001/api/v0/files/stat?...&hash=true&stream-channels=true&timeout=120s": context deadline exceeded

ipfs-cluster-ctl shows me the CID and that it's allocated on the local node (and pinned).

ipfs dht findprovs <CID> (the cid taken from ipfs-cluster-ctl) returns with no result - which explains why I cannot access the file anymore.

ipfs pin ls --timeout=120s /ipfs/<CID> results in a timeout.

$ ipfs repo verify returns with a successful integrity check of the repo.

IPFS/IPFS-Cluster stores the blocks and the databases on a ZFS filesystem which reports no integrity errors.

@RubenKelevra RubenKelevra added the kind/bug A bug in existing code (including security flaws) label Mar 17, 2020
@RubenKelevra
Copy link
Contributor Author

After a fresh start of the ipfs-daemon I cannot remove the one file I identified so far from the MFS.

$ ipfs files rm /path/to/file.bin does not return

I try to recover from the situation by just adding all files again to the ipfs repo (with pin=0). Hopefully just the blocks are missing and not the metadata is corrupt.

@RubenKelevra
Copy link
Contributor Author

So the issue are 'just' missing blocks, which also lead to non-fullfillable requests like files stat --hash on a file with missing blocks or non-working files rm.

After adding all files again without pinning I could remove the file with the issue and found 3 other files which blocks was also missing. I added them too from a backup and could continue.

So the GC seems to be not safe to use when anything is happening to the MFS, especially worrying was for me that the file was in the MFS and pinned too. Since the files was all pinned I don't see how this was happening in the first place. Maybe ipfs-cluster-service is unpinning and pinning again right afterwards when I add a timeout to a pin with ipfs-cluster-ctl pin add --expire-in and for the short duration while the file was unpinned it got removed.

This still doesn't explain, while a file which is in the MFS can lose it blocks when the GC is running.

@ribasushi
Copy link
Contributor

This sounds like a missing lock somewhere. The team is in over-drive right now trying to get #6776 out the door, so response might be delayed by a week or two.
Sorry about that!

@RubenKelevra
Copy link
Contributor Author

@ribasushi I don't expect a priority on this one, since it's just a race condition anyway. Maybe just happening in my setup and similar ones.

But I think it should be reviewed if the first RC is out, just to make sure it's not a widespread issue. :)

I commented several times to document my recovery efforts to make sure to get the most informations on this event as possible, not to push it again.

Some thoughts on this topic:

There was no error, warning or info message while this happened or afterwards while the access was not possible.

I'm wondering how files stat --hash can be impacted by missing data, since a simple files ls can list the content of the folder. I think that a stat with --hash is trying to read too much data - it should just access the directory listing and return the hash.

I'm not sure how the files rm can fail if the element is missing. I think this could be optimized too, that it doesn't require access to the data behind a CID, if the user request to remove it. GC would remove the CID and any blocks remaining anyway, since they are not referenced anymore. Or am I missing something? 🤔

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Apr 5, 2020

I can confirm this bug for this version as well:

go-ipfs version: 0.5.0-dev-6c45f9ed9
Repo version: 9
System version: amd64/linux
Golang version: go1.13.8

I basically have to stop my scripts and add the data back to the repo with pin=0 to make sure everything is still available for IPFS after each run of the GC :/

@schomatis
Copy link
Contributor

Probably related to #6113.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

3 participants