Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters allowed in paths #407

Closed
pwinckles opened this issue Nov 8, 2019 · 12 comments
Closed

Characters allowed in paths #407

pwinckles opened this issue Nov 8, 2019 · 12 comments
Milestone

Comments

@pwinckles
Copy link

The spec does not prohibit the use of any characters in paths (logical and content) and states:

Except for location within the appropriate version directory, v1/content in this example, the OCFL specification does not constrain the choice of content paths used when creating or updating an OCFL object. The choice might depend on particular limitations of, or optimizations for, the target storage system, or on portability considerations. Any compliant implementation will be able to recover version state with the original logical paths.

The only requirement is that a / is used as the path separator:

The forward slash (/) path separator MUST be used in content and logical paths in the manifest, fixity, and state blocks within the inventory. Implementations that target systems using other separators will need to translate paths appropriately.

This leaves the door open to significant portability problems. Take this inventory as an example:

{
  "id" : "o1",
  "type" : "https://ocfl.io/1.0/spec/#inventory",
  "digestAlgorithm" : "sha512",
  "head" : "v1",
  "manifest" : {
    "96a2...c39e" : [ "v1/content/path\\with\\backslashes" ],
    "5j1h...6j3m" : [ "v1/content/path/with/backslashes" ]
  },
  "versions" : {
    "v1" : {
      "created" : "2019-08-05T15:57:53Z",
      "state" : {
        "96a2...c39e" : [ "path\\with\\backslashes" ],
        "5j1h...6j3m" : [ "path/with/backslashes" ]
      }
    }
  }
}

On a linux system that inventory will be interpreted correctly, but on Windows those paths are identical! I wonder what would happen if that object was copied to Windows...

This problem also applies to the contentDirectory field, which is not allowed to contain a / but can contain a \.

Even if the spec is uninterested in imposing character restrictions, I think it should at the least draw more attention to file name related portability issues.

@zimeon
Copy link
Contributor

zimeon commented Nov 8, 2019

I don't see a problem here, other than that on a Windows system you are going to have real problems recreating an original linux path that completely validly contains a backslash.

The example inventory above seems to be a valid representation of an OCFL object for two files:

.
├── path
│   └── with
│       └── backslashes
└── path\with\backslashes

Obviously the implementation isn't on Windows and hasn't cared about portability of the OCLF objects. Otherwise it could have normalized the the internal content paths to avoid backslashes (the ones in manifest, they may be anything convenient in the correct v#/content/ dirs). The logical paths in state are fixed by the object one is trying to represent unless normalized by some archival process and metadata description beyond the scope of OCFL.

@pwinckles
Copy link
Author

I guess my point is that it's easy to create paths that aren't portable, and, if the spec doesn't want to restrict this, I think it should include a stronger recommendation to normalize content paths at the very least in the implementation notes and possibly in the spec itself. I quoted the only passing mention of file name portability in the first post, and the implementation notes only address modifying content paths to accommodate objects with a deep hierarchy.

@awoods awoods added this to the 1.0 milestone Nov 12, 2019
@pwinckles
Copy link
Author

pwinckles commented Nov 13, 2019

On the community call today it was asked that I share some of the research I've done recently regarding path/filename restrictions on various systems. It's far from exhaustive, but below are some of the resources I've been referencing.

For myself, I am currently pursuing mapping logical paths directly to content paths, and returning errors if the paths cannot be mapped because they do not meet a configurable set of constraints. This is based on the assumption that users may want the paths within a version's content directory to be identical to their logical paths for preservation reasons and may not care if the paths are Windows or cloud safe. That said, this expectation immediately breaks down in the case of a rename operations.

Alternatively, users could enable sanitizers that attempt to keep the content path as close to the logical path as possible, but freely modify the content path to make it safe for storage. If you truly don't care about content paths, you could just store files using their content-addressable digest in a truncated n-tuple layout within the content directory, which would make objects extremely difficult to interpret without their inventory. That said, even with a direct logical path mapping, an OCFL object without its inventory file could be challenging to interpret a few versions in.

General

Regardless of filesystem limitations, there may be strings that should not be allowed in paths. The only examples of this that I can think of are . and ...

Filesystems

This table provides a good overview of the constraints of various filesystems.

I haven't looked at all of the filesystems in detail, but, from what I'm seeing, the following are some general characterizations.

Linux

  • No character restrictions other than the NUL character (ASCII 0)
    • Problematic characters discussed at length here
  • Max filename length: 255 bytes

Windows

  • Official doc
  • Illegal chars: <, >, :, /, \, |, ?, *, and ASCII chars 0-31
  • A list of reserved words, including those words with file extensions
  • Cannot end a filename with a space or period
  • Max filename length: 255 chars
  • Max path length: ~32,767 chars

Cloud Object Stores

Amazon S3

  • Official doc
  • Max key length: 1024 bytes (UTF-8)
  • Nominally there are no character restrictions. However, there are reports of AWS SDK implementations not handling certain special characters gracefully. This post has a more exhaustive list of problematic characters, but the source of this information is unclear.

Azure Blob Storage

  • Official doc
  • Max key length: 1024 chars
    • Keys are URL encoded, and it's unclear if this restriction is applied to the encoded or unencoded value.
  • Max filename length: 254 chars
  • There are reports of a number of additional, undocumented restrictions, such as no ASCII control characters and automatically converting \ into /

Google Cloud Storage

  • Official doc
  • Max key length: 1024 bytes (UTF-8)
  • Key cannot contain \r or \n
  • Key cannot be . or ..
  • Key cannot start with .well-known/acme-challenge
  • There are a number of additional restrictions listed for characters that, while not prohibited, cause problems for their tooling

@pwinckles
Copy link
Author

pwinckles commented Nov 15, 2019

I'm still wading through path related edge cases, which has raised some additional questions around logical paths.

All of the examples of logical paths in the examples in the spec are relative. However, as far as I can tell, the spec does not require this. Can logical paths be absolute? More generally, should logical paths be treated as entirely opaque? More of as a key than as a path? This is the route that I'm currently pursuing, but I think it's worth clarifying. Are the following logical paths the same or different?

/file.txt vs file.txt vs ./file.txt vs sub/dir/../../file.txt vs //file.txt

@zimeon
Copy link
Contributor

zimeon commented Nov 15, 2019

Thanks @pwinckles -- I think you make very good points about . , .. and // -- the first two are not defined to have any meaning in OCFL and the latter is perhaps unclear. I think we should probably:

  1. be explicit that both logical and content paths are relative, they MUST NOT start with /
  2. outlaw . and .. as path elements (and hence no special meaning needs to be defined or denied, and paths starting with ./ are thus also not allowed)
  3. outlaw empty path elements or repeated // (so any OCFL creating code should s#(/+)#/#)

@ahankinson
Copy link
Contributor

The intention was that the logical paths represented a 'virtual view' of the object at a given point in time, and in that respect the paths "../" and "/" don't mean anything, since the 'Object Root' is the root. (I guess you could think of the logical paths being chrooted to a virtual Object Root?)

However, I do think leading characters should not be allowed, simply because it makes it much harder for naive clients to process potentially damaging things. Imagine if a quick-and-dirty client saw a logical path of /etc/hosts and just charged on ahead. So we should add a line that says logical paths MUST NOT begin with ., .., /, or \ -- any others? :?

@ahankinson
Copy link
Contributor

ahankinson commented Nov 15, 2019

The Filename Wikipedia article has a good collection of banned characters, problem filenames (who knew that $Extend could cause problems on NTFS? not me!)

Also, something to add to your list above @pwinckles. I mentioned this on the call.

Filesystem character normalization

Not all filesystems or storage systems treat non-ASCII charaters the same way.
https://cloud.google.com/storage/docs/gsutil/addlhelp/Filenameencodingandinteroperabilityproblems

http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf

This may be a problem when translating file paths from a tool to JSON (UTF-8) and back again. It may also be a problem when transferring files to a cloud provider if you write the manifest on your local machine and assume that the paths on the cloud service will be the same. For 99.999% of uses this is probably not a problem, but it would be a problem if a file gets effectively 'lost' because the system cannot address it by path.

@pwinckles
Copy link
Author

\\ and C:\-style paths also create problems on Windows, though I am less worried about them.

@pwinckles
Copy link
Author

From the filename wiki:

In Unix-like systems, DOS, and Windows, the filenames "." and ".." have special meanings (current and parent directory respectively). Windows 95/98/ME also uses names like "...", "...." and so on to denote grandparent or great-grandparent directories.[18] All Windows versions forbid creation of filenames that consist of only dots, although names consist of three dots ("...") or more are legal in Unix.

But, I'd imagine if you're still running Windows 95/98/ME you have bigger problems to worry about than naive oclf clients overwritting system files with .... paths.

@zimeon
Copy link
Contributor

zimeon commented Nov 19, 2019

While we are at file name oddities there is also the issue of various Mac filesystems (HFS and HFS+ I think) not being case sensitive (preserves but doesn't distinguish)

@awoods
Copy link
Member

awoods commented Mar 11, 2020

Resolved with #423 ?

@rosy1280
Copy link
Contributor

Resolved with the flurry of activity from above.

@rosy1280 rosy1280 moved this to Done in OCFL 1.0 Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants