-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600
Conversation
Move the go commands that should run under cmd/ipfs in the Makefile in cmd/ipfs rather than doing a "cd cmd/ipfs && go ..." in the root Makefile. The "cd cmd/ipfs && go ..." lines causes problems with GNU Emacs's compilation mode. With the current setup Emacs is unable to jump to the location of the error outputted by go compiler as it can not find the source file. The problem is that the embedded "cd" command causes Emacs's compilation mode to lose track of the current directory and thus attempts to look for the source file in the wrong directory. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
Hi Kevin. Has someone looked at your pull request? Do you need someone to try it? (Or wait for it to be more baked?) |
I've looked over a bit of it, holding off on leaving any comments or doing a serious review until its asked for |
@whyrusleeping Please do a serious review on what functions, keeping in mind this is a work in progress. Here is what works as of now:
There is still lots to do, but I am hesitant on putting too much work until I get some feedback that I am heading down the right path. @jefft0 yes please try it out Both: Commit up to (and including) f26c2df is stable and I won't do any forced updates until this is closer to getting into master. Commits after that are newer and slightly less stable and I might do a forced update if I discover a problem (for example tests failing). If this is a problem for either of you let me know. Edit: I had to rebase to correct a mistake, now everything up to (and including) f26c2df should be stable. |
Required for #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
Also change other paths to be absolute. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
Required for #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
None of the other methods in the measure package return this error, instead they only call RecordValue() when the value is []byte. This change makes batch Put consistent with the other methods and allows non []byte data to be passed though the measure datastore. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
The DataPtr points to the location of the data within a file on the file system. It the node is a leaf it also contains an alternative serialization of the Node or Block that does not contain the data. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
The datastore has an optional "advanced" datastore that handles Put requests for non []byte values, a "normal" datastore that handles all other put requests, and then any number of other datastore, some of them that can be designated read-only. Delete requests are passed on to all datastore not designed read-only. For now, querying will only work on a "normal" datastore. Note: Only tested in the case of just a "normal" datastore and the case of an "advanced" and "normal" datastore. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
This involved: 1) Constructing an alternative data object that instead of raw bytes is a DataPtr with information on where the data is on the file system and enough other information in AltData to reconstruct the Merkle-DAG node. 2) A new datastore "filestore" that stores just the information in DataPtr. When retrieving blocks the Merkle-DAG node is reconstructed from combining AltData with the data from the file in the file system. Because the datastore needs to reconstruct the node it needs access to the Protocol Buffers for "merkledag" and "unixfs" and thus, for now, lives in go-ipfs instead of go-datastore. The filestore uses another datastore to store the protocol buffer encoded DataPtr. By default this is the leveldb datastore, as the size fo the encoded DataPtr is small. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
Towards #875. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
In merkledag.Node and blocks.Block maintain a DataPtr License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
This is looking promising, there are definitely a few problems we're going to have to think through, but this is progress. The first thing I suggest is to not add an extra parameters to the blockstore methods, instead we can look at making |
@whyrusleeping I am not sure I am completely following. Are you saying you would like to see something like the |
@kevina yeah, basically. The idea is also to not have to do weird augments to the types like youre doing to |
@whyrusleeping I see what you are saying about not adding the Are you also against adding a generic In the future I might want a way to pass additional parameters to the |
I am also against adding extra parameters to the existing methods. I think keeping the methods simple and using the type system for conveying extra information is a better option. |
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
@whyrusleeping okay. I just pushed two commits to make |
@@ -25,6 +25,8 @@ type Repo interface { | |||
// SetAPIAddr sets the API address in the repo. | |||
SetAPIAddr(addr string) error | |||
|
|||
Self() Repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a way to be able to get to the FSRepo
and hence get to the filestore.Datastore
in the next commit (4ef5531). In this commit node.Repo.(*fsrepo.FSRepo)
does not work as node.Repo is repo.ref
so I needed to use add the Self() method and use node.Repo.Self().(*fsrepo.FSRepo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the node.Repo
should either be an fsrepo or a mock repo. It should work with a simple type assertion
In this commit
node.Repo.(*fsrepo.FSRepo)
does not work as node.Repo isrepo.ref
I'm not sure what you mean by this. what is repo.ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node.Repo
is not a FSRepo
it is a repo.ref
. repo.ref
is defined here: https://github.com/ipfs/go-ipfs/blob/c067fb9e83e89cf04226d2c43de7c6fd5ebbccd2/repo/onlyone.go#L50. It may be easier for you to just try it without the Self() and see for yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now. I incorrectly assumed the fsrepo.Open constructor returned an fsrepo
@jefft0 if you want to do some preliminary testing now would be a good time Here is what you can try: Add a file with There is basic help now available for the Objects are pinned as usual when adding, but removing the pin and running |
Add tests for: filestore ls filestore verify filestore rm-invalid filestore rm Also rename t0046-add-no-copy.sh to t0260-filestore.sh. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
Boy, that's a lot of work in a short amount of time! I found one counter-intuitive thing. I made a short hello file and did |
@jefft0 Thanks for testing it out!
I don't see it as a major issue. The objects are actually still valid, if you try to get the old object the contents would not have changed. "ipfs filestore verify" is meant to verify individual blocks not complete files. It should be fairly easy to add a command that will verify files and check that the "WholeFile" flag is correct, but that is a low priority for me right now. |
OK. Is it ready for me to stress test with a bunch of 200 MB video files (my use case)? If you're still tweaking performance, I'll hold off. |
Simplify files.File interface by combining Offset() and AbsPath() methods into one that return a files.ExtraInfo interface that can be extended with additional information. Simplify chunk.Splitter by returning a Bytes struct in the NextBytes() method. This eliminates the need for the AbsPath() method and the need to return the data and offset separately License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
…der. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
I have not tuned it for performance at all but I would be interested in how well it performs on a large file. |
I just pushed some commits that basically implemented this and simplified a lot of code. The DataPtr stuff in merkledag.Node contains different information so I can not use this new type for that. |
…eader. Remove ExtraInfo() method from the files.File interface as it not strictly necessary. Also add SetExtraInfo() to AdvReader to eliminate the need for the NewReaderWaddOpts wrapper. License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT Signed-off-by: Kevin Atkinson <k@kevina.org>
On my MacBook Pro, I added a folder with 24 video files totaling 3GB. |
I found weird bug/glitch in It happened when I tried to add/verify ceph-base_10.2.0-1~bpo80%2b1_amd64.deb, (copy for archival reasons):
but
The block could be extracted with
badblock-6PKo.bin - extracted block Doing
and BUT, doing
For |
@jefft0. I am glad to hear that adding it is faster. Retrieving locally is a bit slower, but my informal test have determined that is due to always verifying the hash. A better solution might be use modification times and only verify when the file's modification timestamp has changed. |
@kevina, If I already did |
for _, b := range bs { | ||
if _, ok := w.cache.Get(b.Key()); !ok { | ||
// Don't cache "advance" blocks | ||
if _, ok := b.(*blocks.BasicBlock); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check doesnt need to be just for a single block type. If we have a given block in the flatfs store, theres no need to add a filestore reference to it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this change is related to #2600 (comment)
I think the next step here is going to be to break up the changes in this PR and start working some of the 'framework' stuff into the codebase. For example: the change to the blocks.Block type could be extracted and merged in separately |
At the moment the blocks will be in both datastores. I am working on doing something about this. |
Okay, I will start by separating out the block.Block change to an interface type as I would like to see that go in to prevent bitrot. |
100% agreed, avoiding stagnation is good |
Closing and creating new request, See #2634. |
NOT READY FOR MERGE
This is a work in progress, but I wanted to get some early feedback on my work towards #875. Implementing this feature touched a lot of code and requires some API changes.
Basic adding of files without storing the data now works. This currently needs to be done when the node is offline. To use:
will add the file without copying the blocks. If the file is moved or changed than any blocks created from that file will become invalid.
Notes for review: