Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Recognize and render an mtime metadata field for "Directory" unixfs nodes #3451

Closed
mib-kd743naq opened this issue Dec 1, 2016 · 18 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature

Comments

@mib-kd743naq
Copy link
Contributor

( Please move to notes if this is the wrong forum )

Version information:

N/A

Type:

Feature/Enhancement

Priority:

P4

Description:

Compare the experience of visitng a conventional mirror listing and the experience of an IPFS-rendered directory.

The presence of mtime ( at least for me ) is a very valuable indicator of "how stale is this mirror exactly?". I am aware that the number can be faked, and this is fine: it can't be authoritative anyway, only the content can be. Yet it is really useful when eyeballing things.

IPFS currently does not make a provision for storing this information ( both the current implementation and whatever I could find about the IPLD plans ). I think it would be a large step in usability to make this possible.

On implementation: allowing specification of mtime within a UnixFS (or other type of ) content-node itself is a bad idea. Instead mtimes ( and possible other future metadata ) should be an optional property of links. This fits nicely into the sharded directory implementation we have going as well. Something like (note - this could very well be an IPLD-only feature, using mdag to kick-start discussion):

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional uint64 PresentationTimestamp = 4;
}

or maybe even:

... optional timespec PresentationTimestamp = 4;

message timespec {
  required uint64 sec = 1;
  optional fixed32 nsec = 2;
}

Thoughts?

@whyrusleeping
Copy link
Member

I think we should use the metadata unixfs type for this.

https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto#L18

We havent gotten around to actually doing anything with this yet, but its basically what it was designed for

@mib-kd743naq
Copy link
Contributor Author

@whyrusleeping I briefly considered it, and faced the following trouble:

  • It already defines a required MimeType, which is really... out of scope
  • It is a first class Node, just like 'File' itself. How would one attach one to the other...? With a link everything is clear - these extra fields describe exactly this link and whatever it points to.

@whyrusleeping
Copy link
Member

the MimeType should be changed to optional. We shouldnt be using required fields anyways.

The metadata node links to the object it describes. They are designed to only point to a single child node.

@mib-kd743naq
Copy link
Contributor Author

The metadata node links to the object it describes.

Hm... then... how does it work with a file in a dir? Let's take QmW4UsYZRtxSBtQgCkHEsFpJDMNYv8j4eFpEU97w1fEHtp:

QmW4UsYZRtxSBtQgCkHEsFpJDMNYv8j4eFpEU97w1fEHtp
    
             ||
             ||
             ||
  1 {
    1: Directory
  }
  2 {
    1: QmVzzr9ArZaQZSWTYgUpAN4fJrK46qmbYpf95CyD4fz7jd
    2: "foo"
    3: 109                    ||
  }                           ||
                              ||
                              ||
           1 {                ||
             1: Directory
           }
           2 {
             1: QmSMxgMY4ZpB9H7zbn3AhLNobnjjnzkfeEKwXaAJ9ibr1G
             2: "bar"
             3: 60                    ||                                        
           }                          ||                                        
                                      ||                                        
                                      ||
                    1 {               ||
                      1: Directory
                    }
                    2 {
                      1: QmQpeUrxtQE5N2SVog1ZCxd7c7RN4fBNQu5aLwkk5RY9ER
                      2: "baz"
                      3: 11                     ||
                    }                           ||
                                                ||                        
                                          1 {
                                            1: File
                                            2: "abc"
                                            3: 3
                                          }

Where do various Metadata nodes attach, and how does a containing (parent) directory find out about them for display?

@whyrusleeping whyrusleeping added the kind/enhancement A net-new feature or improvement to an existing feature label Dec 1, 2016
@whyrusleeping
Copy link
Member

Think about metadata nodes as semi-transparent wrappers. They can wrap any other unixfs node to augment it with some extra information. If you point directly to a file, and not the metadata object, then you obviously don't have metadata for that file.

@mib-kd743naq
Copy link
Contributor Author

Think about metadata nodes as semi-transparent wrappers.

Understood. Not ideal from a low-level PoV as it doubles the amount of nodes one needs to use to represent a directory listing, but it is workable.

I will add some tentative examples to the "unorthodox-dag" I am putting together.

@mib-kd743naq
Copy link
Contributor Author

@whyrusleeping thinking about it more... I still think it is incorrect to have an extra metadata node as a standalone object. Consider the earlier example listing. One would ideally want to expand this with:

  • mtime
  • MIME (or some other) type (what is the entry pointing to)
  • the pointed-to object id ( a "comment" in the rightmost column to allow copy/pasting of the actual "direct pointer" )
  • the file/directory byte-size (not the underdefined "size" that we have at Field3 )
  • possibly POSIX perms

Currently for any of these we would need to do an extra fetch for every entry in the directory, which even for moderately large directories will be expensive. If we add metadata nodes - then there are potentially two lookups needed (at least for the pointed-to-object id part)

Instead if we forget about the existence of metadata at the unixfs level, and instead redefine:

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional UnixfsMetadata UfsMeta = 4;
}

Then a lot of the problems with lookups go away and link nodes can remain lean if they simply omit the metadata sub-structure.

@jbenet thoughts on the above?

@mib-kd743naq
Copy link
Contributor Author

P.S. Yes, I am aware of the IPLD specification effort. But it doesn't seem to cover any of this either...

@jbenet
Copy link
Member

jbenet commented Jan 15, 2017

Hey @mib-kd743naq -- some quick comments:

Currently for any of these we would need to do an extra fetch for every entry in the directory

To confirm we're understanding each other, there would be at most one Metadata object for each file, and not one Metadata object per metadata key-value entry. It's one dict. So instead of N nodes per entry in each dir, 2N nodes. And not kN (where k is the number of metadata entries used).

The reasoning for separating metadata out and making it a wrapper around the file (instead of embedded within the file node OR another node pointed to by the file node) is to make sure that all raw data versions of the file are deduped, and that changing the mode does not constitute creating a new root object. Granted, the bulk of the data is at the leaves, so those wouldn't change anyway, only the very root, but im not sure that preventing the extra node fetch is ideal here.

Let me ask it this way: why is an additional node a problem?

  • (1) is it the additional RTT currently present due to an unoptimized bitswap? (i.e. each node link forces an RTT just because we haven't added path (nor selector) support to bitswap).
  • (2) or is it network latency not a problem and really just the additional overhead of another fetch into the disk datastore?

If it's (1), then we can proceed and optimize bitswap instead. if it's (2), then we should be working on optimizing the datastore.

I don't think either of this is reason not to create a larger graph-- in that a good ipfs implementation should make additional nodes not very expensive. They are indeed expensive now, so could we work on making them not expensive? instead of hiding the expense by changing the data model.


separately, i would like to know whether you think -- perf concerns completely aside -- the metadata object wrapper is confusing or bad from a data modeling perspective. I.e. is this the wrong thing to do? or just a "currently expensive but optimizable" thing to do?

@jbenet
Copy link
Member

jbenet commented Jan 15, 2017

quick note: since the addition of IPLD and "raw data" nodes, "unixfs-file" has become differentiated from the raw data they contain. Therefore, it may be that storing the metedata directly in the file is no longer that big of an issue. That said, we should note that we would need to do this twice, once for files and once for dirs, instead of once (once for metadata, wrapping both files or dirs).


Another point pro metadata: it can be fetched comfortably by an ls call, as a metadata object should not be large, therefore figuring out a bunch of important info required without accidentally pulling a bunch of ~256K objects.

@mib-kd743naq
Copy link
Contributor Author

The reasoning for separating metadata out and making it a wrapper around the file (instead of embedded within the file node OR another node pointed to by the file node)

@jbenet this is not what I was suggesting. I am suggesting that the metadata become property of the unixfs directory link itself. I.e. having this thing become this:

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional UnixfsMetadata UfsMeta = 4;
}

Could you reread my original comment/question with that in mind and respond to that? A lot of what you wrote above is very reasonable, but alas does not apply to what I said. Sorry for the confusion... :/

If my reasoning still isn't clear - I will try to explain this in much greater detail sometime towards end next week: work is going crazy...

@kalmi
Copy link
Contributor

kalmi commented May 2, 2017

I would like to push this forward (willing to contribute code) as having optional mtime metadata could enable novel use cases such as:

  • faster, delta ipfs add of a mtime-enabled directory structure that was added in the past
  • faster, delta ipfs get of a mtime-enabled directory structure that was getted in the past to the same file system location
  • file synchronization with rsync-like performance characteristics (or maybe even faster)

I am perfectly okay with the current Metadata-wrappers. Compared to @mib-kd743's suggestion, the only advantage I see in separate Metadata nodes is smaller memory/disk consumption when metadata is not present (and thus less to copy around). The different root hash argument is invalid as far as i can see, because the directory's root hash also changes by the introduction of a wrapper node. Metadata retrieval cost for large directories could be prohibitive, but I agree with @jbenet that that can be solved with bitswap.

If we are to extend the Metadata node with an optional ModTime field, is everyone okay with the timespec type proposed by @mib-kd743naq? I considered the possibility of having only second-level accuracy, but i believe that metadata should be preserved as accurately as possible.

@mib-kd743naq
Copy link
Contributor Author

mib-kd743naq commented May 3, 2017

@kalmi thank you for pushing this further! Really glad someone else showed interest, as I myself got completely swamped at work and likely won't be able to get back to IPFS goodness for another 2 months :(

On the subject of where the metadata should live: I think I was misunderstood by pretty much everyone in this thread. I am still convinced that the mtime and other content-unaffecting metadata should be part of the link structure itself. The reason isn't efficiency during data access, but efficiency during metadata access ( i.e. browsing ).

To illustrate this differently: one of the projects for which I need this functionality is serving a standard ftp mirror over http but backed by IPFS. Under my proposal in order to display this listing one needs to download very few blocks. Under the "separate metadata nodes" proposal one needs to grab thousands of them. This is a non-trivial difference.

All of this notwithstanding - I will take anything at this point.

@kalmi
Copy link
Contributor

kalmi commented May 3, 2017

@mib-kd743naq, I totally get what you are saying, and yes, the performance implications also seem scary to me, but I also get the others who wish to keep PBLink lean and continue the existing path, the Metadata node. I am bound to accept that it can eventually be solved efficiently with bitswap selectors if @jbenet says so.

@kalmi
Copy link
Contributor

kalmi commented May 3, 2017

I suggest adding a --save-mtime boolean flag to ipfs add that would wrap all unixfs nodes in Metadata containing ModTime.
I suggest adding a --restore-mtime boolean flag to ipfs get that woud set the mtime on the created files/directories/links based on the content of the Metadata wrappers.

The default behaviour wouldn't change one bit in either case.

After we have that, I will also suggest adding a Experimental.TrustModTimes boolean config option in a different issue. When this config option is true, IPFS could make use of some optimizations to make repeated adds and gets faster. I will get into that later in more detail. First, I just want to get ModTime into the Metadata structure, and implement the --save-mtime and --restore-mtime flags and their behaviour.

I would love to have some assurance that these additions would be welcome.

@kalmi
Copy link
Contributor

kalmi commented May 3, 2017

Moved timespec message format definition discussion to: ipfs/notes#60

@mib-kd743naq
Copy link
Contributor Author

but I also get the others who wish to keep PBLink lean and continue the existing path

Actually this was never explained. There was an unambiguous "we are not modifying UnixFS anymore, all new work is going into IPLD", but there was no reasoning nor timeline behind this.

I follow and am excited about many new developments: the OpenBazaar integration work, the Peergos stuff, and many other neat things build on top of IPFS. But as @mguentner said back in November: the fs-stuff in IPFS is still not really usable.

@jbenet @whyrusleeping: could you please add some clarity on what plans if any are there to be able to faithfully represent content like this? I know it is not sexy work, but still - inquiring minds want to know ;)

whyrusleeping pushed a commit to ipfs/go-unixfs that referenced this issue Jul 30, 2018
*** THIS IS A BREAKING CHANGE *** as per [1]: "Required is forever"

Nevertheless this seems like a good idea at this time: there are no known
producers ( nor consumers ) of MetaData nodes, and the current requirement
of MimeType has an extremely narrow application scope.

This change could very well be rejected in lieu of implementing a new type
of node ( e.g. TheRealMetadata ) in the DataType enum.

Based on ipfs/kubo#3451 (comment)

License: MIT
Signed-off-by: Mib Kd743naq <mib.kd743naq@gmail.com>

[1] https://developers.google.com/protocol-buffers/docs/proto#specifying-field-rules
@Stebalien
Copy link
Member

Will be fixed by #6920.

Jorropo pushed a commit to Jorropo/go-libipfs that referenced this issue Jan 25, 2023
*** THIS IS A BREAKING CHANGE *** as per [1]: "Required is forever"

Nevertheless this seems like a good idea at this time: there are no known
producers ( nor consumers ) of MetaData nodes, and the current requirement
of MimeType has an extremely narrow application scope.

This change could very well be rejected in lieu of implementing a new type
of node ( e.g. TheRealMetadata ) in the DataType enum.

Based on ipfs/kubo#3451 (comment)

License: MIT
Signed-off-by: Mib Kd743naq <mib.kd743naq@gmail.com>

[1] https://developers.google.com/protocol-buffers/docs/proto#specifying-field-rules


This commit was moved from ipfs/go-unixfs@e94be52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants