Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative opinions on filesystem features #212

Closed
ahankinson opened this issue Oct 10, 2018 · 15 comments
Closed

Normative opinions on filesystem features #212

ahankinson opened this issue Oct 10, 2018 · 15 comments
Assignees
Labels
Question Further information is requested
Milestone

Comments

@ahankinson
Copy link
Contributor

I'm wondering if there is a place somewhere in the spec (i.e., normative) where we can say something about the use of symlinks, and that they MUST not be used to implement versioning.

Reasons:

  1. Not portable
  2. Not implementable on Object Stores
  3. May cause problems or unexpected behaviours with some clients
  4. Inconsistencies with relative v. absolute paths

Are there other filesystem-level features that we should mention? Permissions? Extended attributes? Character sets? Path length restrictions / considerations? (Our storage guys want the shortest path possible for performance reasons) Restricted words / characters?

@ahankinson ahankinson added the Question Further information is requested label Oct 10, 2018
@neilsjefferies
Copy link
Member

@ahankinson
Copy link
Contributor Author

File path separators as well; see #219

ahankinson added a commit that referenced this issue Oct 12, 2018
@zimeon zimeon added this to the Beta milestone Oct 12, 2018
@neilsjefferies
Copy link
Member

I've been thinking about this and maybe the best way would be to say that OCFL treats symlinks and hard links as if the file was in situ for the purposes of operations and validation. In other words, OCFL does not recognise them as anything other than files.

A side effect is that you could use symlinks for dedupe and the corresponding OCFL inventory would appear to be an undeduplicated one.- and would/should validate and copy as if it were so.

@ahankinson
Copy link
Contributor Author

I've had problems in the past with file operations that helpfully 'resolve' symlinks, meaning copying from one FS to another can result in vastly different amounts of storage used. Or, the opposite, where paths are given as absolute symlinks, and thus everything breaks.

I think it would be easier all round (validation, checksumming, client operations) if we were to consider symlinks verboten.

@neilsjefferies
Copy link
Member

That means we have to detect symlinks and hard links and that makes any validator rather less portable. I don't think OCFL should worry about how a filesystem path gets to a bytestream - or else we end up delving into the guts of filesystems. Do we want to detect mount points too, how about loopback and overlay mounts? Hierarchical filesystems?

@ahankinson
Copy link
Contributor Author

I think it's exactly the point of OCFL that we worry about how a filesystem path gets to a bytestream.

@neilsjefferies
Copy link
Member

As per my note - that is not actually very practical. Either we adopt a fairly minimal approach to filesystem operations or we get mired in filesystem specific features.

@ahankinson
Copy link
Contributor Author

It is very practical. It's taking an opinionated stand on a well-understood and widely implemented filesystem feature that can have long-reaching implications on bitstream addressability across data transfers and technology migrations. I think we should come down on the side of 'no', for the reasons stated in my first post.

We have spent quite a bit of time and effort modelling de-duplication and content addressability in the inventory. We could have saved ourselves a lot of bother if we had just said "use symlinks if you want de-dupe". But we didn't because there is (I think) a general recognition that symlinks in the scenarios that we are envisioning OCFL will be applied are generally a Bad Idea™.

Detecting and displaying symlink aliases is built-in to just about every *nix utility that needs it (find, ls, bash (and other shs)). It's not a boutique problem. Higher-level languages (Python, Go, etc.) have cross-platform libraries that will also recognize symlinks and aliases across platforms.

I also don't buy the 'slippery slope' argument. We can have an opinion about symlinks without necessarily having one about mount points, or loopback mounts.

@neilsjefferies
Copy link
Member

I'm not convinced they are a bad idea per se - they exist and are used for good reasons. The problems arise from inconsistent treatment - either treat them transparently as file/directory pointers like any other path or as data to be copied. I am suggesting the first approach is sufficient and requires nothing of OCFL. Essentially, as long as fopen(path)/seek/read/write works.

The cross-platform libraries are not that good - NTFS links work differently and you end up having to do Windows specific things to work with them.

Now I would certainly favour recommendations in Implementation Notes not to do any of the things I mentioned, when dealing with a simple, vanilla filesystem. But there are cases, such as hierarchical file stores, where files are replaced by symlinks when they are staged to other storage. Traversing the symlink is what triggers retreival. I don't see why OCFL should mind that.

Let's face it, when we implement object support, symlinks are basically what OCFL is going to have to do.

@ahankinson
Copy link
Contributor Author

ahankinson commented Oct 16, 2018

Considerations around hierarchical storage management, like Compellent arrays, are currently out-of-scope since they implement their bitstream linking at the system level, not at the userland level. They present the user with a consistent view of their directory structure, regardless of the underlying implementation of which physical drive stores the file (the answer is, of course, none of them...). This is no different than object stores implementing their file/folder structures on hadoop / HDFS / HBase. At some point it is all software, yes, but the underlying principle of OCFL is that the compatibility layer is the hierarchical file/directory tree model. Symlinks break this, turning a hierarchy into a "messy half-***ed graph." :) [1]

Inconsistent treatment can arise from innocuous and normal filesystem activities, like transferring files over the network with scp, SFTP, or rsync. Aliases / symlinks also do not work with object stores. And they don't transfer from *nix to non-*nix FSes. There are tons of reasons not to use them, and I'm saying we surface those to the point of making it part of the spec.

Symbolic links do exist and they solve certain problems. But I still think we should come down on the side of 'don't use them' if you have any interest in DP.


PS: Anecdotal, but interesting reading I uncovered while doing some background on this issue:

'Symlink race' security vulnerability : https://en.wikipedia.org/wiki/Symlink_race
Rob Pike on symlinks: https://9p.io/sys/doc/lexnames.html

A good example of the problems with dealing with symlinks can also be found in the Go os module documentation. CTRL-F for 'symbolic'. : https://golang.org/pkg/os/

Not offered as evidence, just as interest.

[1] https://news.ycombinator.com/item?id=3075522

@neilsjefferies
Copy link
Member

I am well aware of the arguments on both sides. Symlinks is a slippery slope, however. At the very least, by your argument, hard links should also be included because they make a filesystem a hard graph with no underlying hierarchy (which symlinks don't do, there is always a primary hierarchy - or they're broken). However, they are almost always treated consistenly by tools so they are arguably better from a DP point of view except that you often don't realise you only have one copy of a file until it's too late!

I think it is clearer to require that an OCFL hosting filesystem supports a certain minimal set of features and, as long as every path in the inventory resolves to a bitstream (using fopen(pathname) or whatever) that can be subject to the basic OCFL file operations that we have defined, then OCFL need not have an opinion on the underlying mechanics as far as validation goes. However, we can and should make make recommendations in the Implementation Notes about what is good practice.

@zimeon
Copy link
Contributor

zimeon commented Oct 17, 2018

I also think we shouldn't say anything normative about links. I think it is helpful to advise, that on systems that support them, they should not be used. I think a discussion of filesystem and object store affordances/issues should be included in the implementation notes.

@awoods
Copy link
Member

awoods commented Oct 17, 2018

+1 to non-normative advice/suggestions in the implementation notes.

@zimeon
Copy link
Contributor

zimeon commented Nov 7, 2018

Should have post-facto review of #246 before closing, perhaps 👍 on this comment.

@ahankinson ahankinson self-assigned this Feb 6, 2019
@zimeon
Copy link
Contributor

zimeon commented Apr 27, 2019

Closing as we have 6 👍 on #212 (comment) above

@zimeon zimeon closed this as completed Apr 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants