-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Characters allowed in paths #407
Comments
I don't see a problem here, other than that on a Windows system you are going to have real problems recreating an original linux path that completely validly contains a backslash. The example inventory above seems to be a valid representation of an OCFL object for two files:
Obviously the implementation isn't on Windows and hasn't cared about portability of the OCLF objects. Otherwise it could have normalized the the internal content paths to avoid backslashes (the ones in |
I guess my point is that it's easy to create paths that aren't portable, and, if the spec doesn't want to restrict this, I think it should include a stronger recommendation to normalize content paths at the very least in the implementation notes and possibly in the spec itself. I quoted the only passing mention of file name portability in the first post, and the implementation notes only address modifying content paths to accommodate objects with a deep hierarchy. |
On the community call today it was asked that I share some of the research I've done recently regarding path/filename restrictions on various systems. It's far from exhaustive, but below are some of the resources I've been referencing. For myself, I am currently pursuing mapping logical paths directly to content paths, and returning errors if the paths cannot be mapped because they do not meet a configurable set of constraints. This is based on the assumption that users may want the paths within a version's content directory to be identical to their logical paths for preservation reasons and may not care if the paths are Windows or cloud safe. That said, this expectation immediately breaks down in the case of a rename operations. Alternatively, users could enable sanitizers that attempt to keep the content path as close to the logical path as possible, but freely modify the content path to make it safe for storage. If you truly don't care about content paths, you could just store files using their content-addressable digest in a truncated n-tuple layout within the content directory, which would make objects extremely difficult to interpret without their inventory. That said, even with a direct logical path mapping, an OCFL object without its inventory file could be challenging to interpret a few versions in. GeneralRegardless of filesystem limitations, there may be strings that should not be allowed in paths. The only examples of this that I can think of are FilesystemsThis table provides a good overview of the constraints of various filesystems. I haven't looked at all of the filesystems in detail, but, from what I'm seeing, the following are some general characterizations. Linux
Windows
Cloud Object StoresAmazon S3
Azure Blob Storage
Google Cloud Storage
|
I'm still wading through path related edge cases, which has raised some additional questions around logical paths. All of the examples of logical paths in the examples in the spec are relative. However, as far as I can tell, the spec does not require this. Can logical paths be absolute? More generally, should logical paths be treated as entirely opaque? More of as a key than as a path? This is the route that I'm currently pursuing, but I think it's worth clarifying. Are the following logical paths the same or different?
|
Thanks @pwinckles -- I think you make very good points about
|
The intention was that the logical paths represented a 'virtual view' of the object at a given point in time, and in that respect the paths "../" and "/" don't mean anything, since the 'Object Root' is the root. (I guess you could think of the logical paths being However, I do think leading characters should not be allowed, simply because it makes it much harder for naive clients to process potentially damaging things. Imagine if a quick-and-dirty client saw a logical path of |
The Filename Wikipedia article has a good collection of banned characters, problem filenames (who knew that Also, something to add to your list above @pwinckles. I mentioned this on the call. Filesystem character normalization Not all filesystems or storage systems treat non-ASCII charaters the same way. http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf This may be a problem when translating file paths from a tool to JSON (UTF-8) and back again. It may also be a problem when transferring files to a cloud provider if you write the manifest on your local machine and assume that the paths on the cloud service will be the same. For 99.999% of uses this is probably not a problem, but it would be a problem if a file gets effectively 'lost' because the system cannot address it by path. |
|
From the filename wiki:
But, I'd imagine if you're still running Windows 95/98/ME you have bigger problems to worry about than naive oclf clients overwritting system files with |
While we are at file name oddities there is also the issue of various Mac filesystems (HFS and HFS+ I think) not being case sensitive (preserves but doesn't distinguish) |
Resolved with #423 ? |
Resolved with the flurry of activity from above. |
The spec does not prohibit the use of any characters in paths (logical and content) and states:
The only requirement is that a
/
is used as the path separator:This leaves the door open to significant portability problems. Take this inventory as an example:
On a linux system that inventory will be interpreted correctly, but on Windows those paths are identical! I wonder what would happen if that object was copied to Windows...
This problem also applies to the
contentDirectory
field, which is not allowed to contain a/
but can contain a\
.Even if the spec is uninterested in imposing character restrictions, I think it should at the least draw more attention to file name related portability issues.
The text was updated successfully, but these errors were encountered: