Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension Parameterisation (for comment, DO NOT merge!) #3

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
3864050
Update README.md to include parameters
neilsjefferies Jan 22, 2020
a96b873
Update README.md
neilsjefferies Feb 4, 2020
bda6212
Update README.md
neilsjefferies Feb 18, 2020
2413950
Update README.md
neilsjefferies Feb 18, 2020
0272098
Update README.md
neilsjefferies Feb 18, 2020
721203e
Update README.md
neilsjefferies Feb 18, 2020
51e428b
Update 0000-example-extension.md
neilsjefferies Feb 18, 2020
643f95e
Update 0001-digest-algorithms.md
neilsjefferies Feb 18, 2020
efae56a
Update README.md
neilsjefferies Feb 18, 2020
6f949c3
Update 0001-digest-algorithms.md
neilsjefferies Feb 18, 2020
bd57205
Update 0000-example-extension.md
neilsjefferies Feb 18, 2020
fa6b47c
Create 0002-N-tuple-tree.md
neilsjefferies Feb 18, 2020
06a11cc
Update 0002-N-tuple-tree.md
neilsjefferies Feb 18, 2020
9b0b183
Update 0002-N-tuple-tree.md
neilsjefferies Feb 18, 2020
cf29454
Update 0002-N-tuple-tree.md
neilsjefferies Feb 18, 2020
9d15c12
Update README.md
neilsjefferies Feb 18, 2020
be25e2c
Update README.md
neilsjefferies Feb 18, 2020
770e6de
Update README.md
neilsjefferies Feb 18, 2020
14e16c0
Update README.md
neilsjefferies Feb 18, 2020
b6e615d
Update README.md
neilsjefferies Feb 18, 2020
96a6713
Update 0000-example-extension.md
neilsjefferies Feb 18, 2020
e83bfbb
Update 0001-digest-algorithms.md
neilsjefferies Feb 18, 2020
f7ad427
Update 0002-N-tuple-tree.md
neilsjefferies Feb 18, 2020
fd2f044
Update README.md
neilsjefferies Feb 21, 2020
9849d2a
Update 0002-N-tuple-tree.md
neilsjefferies Feb 21, 2020
7df3954
Update 0002-N-tuple-tree.md
neilsjefferies Feb 25, 2020
44dbe28
Update 0002-N-tuple-tree.md
neilsjefferies Feb 25, 2020
be4f6ee
Patch 1 (#1)
neilsjefferies Feb 26, 2020
e24187b
Update README.md
neilsjefferies Apr 21, 2020
7f7d964
Update README.md
neilsjefferies Apr 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,39 @@ See also [pending pull requests](https://github.com/OCFL/extensions/pulls) for e

Community extensions should be written as GitHub flavored markdown in the `docs` directory of this repository. They should be numbered sequentially using a 4-digit, zero-padded prefix; should use hyphens to separate words; and have the `.md` extension.

An example/template is available in this repository as [OCFL Community Extension](docs/0000-example-extension) and is rendered via GitHub pages as https://ocfl.github.io/extensions/0000-example-extension
An example/template is available in this repository as [OCFL Community Extension](docs/0000-example-extension) and is rendered
via GitHub pages as https://ocfl.github.io/extensions/0000-example-extension

## Extension Parameters

For efficiency, it is likely that many extension definitions might actually cover a number of variants. Therefore, when an
extension is referenced, it may be accompanied by a number of parameters that specify the particular variant in use. This
provides both more effective documention of an OCFL structure but allows the implementation of generic extension code that
covers a wider variety of use cases. Parameters MUST have be single valued. For each parameter the following properties should
be defined:

* Name: A short name for the parameter. Since this has then potential to be used as part of programmatic access the name MUST
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
not contain control characters and SHOULD be shorter than 127 characters. The length limit is based on a survey of the defaults for various JSON parsers.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* Description: A brief description of the function of the parameter. This should be expanded in the main description of the
extension which MUST reference all the parameters.
* Type: Data type for the parameter. In order to allow validation and limit to scope for implementation specific variations,
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
parameters are typed.
* integer - may be signed or not as specified in the range parameter.
* string - aligned with JSON strings, these should be UTF-8 encoded and avoid control characters.
* enumerated - one of an ordered set of labels which MUST conform to the same limitations as parameter names. No specific values are associated with a label other than its ordinality in the set, which is zero-based.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* Range: For each parameter type a range must be specified that limits values that a parameter may take.
* For integer parameters the range specifies minimum and maximum values, separated by a comma, which MUST be integers themselves.
* For string parameters, the range specifies the maximum length of the string as an integer number of characters, not bytes. Again, based on a survey of parsers, try to keep strings shorter than 4095 characters.
* For enumerated parameters, the range is a comma separated ordered list of valid labels. Enumerated parameters are case sensitive. A boolean value is a special case of an enumerated type with the values: {FALSE, TRUE}
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* Default: Default value for parameter, which MUST be consistent with the range limitations. If this is left blank then the parameter is mandatory

## Referencing Parameters

Wherever a parameterised extension is referenced, include any parameters in an accompanying JSON file. If using an extensions directory, the JSON file MUST be named for the extension and included in the directory. For example, the example extension above would have an accompanying file *0000-example-extension.json* which might contain:

"0000-example-extension.md": {
"first example parameter": "12",
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
"second example parameter": "Hello",
"third example parameter": "Green"
}
If, instead, the extension is referenced in a JSON file (e.g. additional digests in a manifest) then these can be included immediately following the reference. However, there MUST only be one parameter set defined for each extension.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 18 additions & 0 deletions docs/0000-example-extension.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@

This extension is but an example and has no content, but if it did the content would be summarized in a paragraph here.

## Parameters

* name: first example parameter
* description: a mandatory 8 bit unsigned value
* type: integer
* range: 0,255
* default:
* name: second example parameter
* description: an optional 64 character long string, defaulting to "Not applicable", if omitted.
* type: string
* range: 64
* default: "Not applicable"
* name: third example parameter
* description: An example enumerated parameter
* type: enumerated
* range: Red,Yellow,Orange,Green,Blue,Indigo,Violet
* default: Green

## Other Sections Providing Details, Examples, etc.

... more in here ...
25 changes: 25 additions & 0 deletions docs/0001-digest-algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,33 @@

This extension is an index of additional digest algorithms. It provides a controlled vocabulary of digest algorithm names that may be used to indicate the given algorithm in `fixity` blocks of OCFL Objects, and links their defining extensions.

## Parameters

* name: blake2b-160
* description: Indicates if this alorithm is used
* type: enumerated
* range: FALSE,TRUE
* default: FALSE
* name: blake2b-256
* description: Indicates if this alorithm is used
* type: enumerated
* range: FALSE,TRUE
* default: FALSE
* name: blake2b-384
* description: Indicates if this alorithm is used
* type: enumerated
* range: FALSE,TRUE
* default: FALSE
* name: sha512/256
* description: Indicates if this alorithm is used
* type: enumerated
* range: FALSE,TRUE
* default: FALSE

## Digest Algorithms Defined in Community Extensions

Each parameter corresponds to a Digest Algorithm Name used in the table below, and indicates if this algorithm is in use as a consequence of including this extension. As the parameters default to FALSE, only the algorthms used need to be listed in *0001-digest-algorithms.json*.

| Digest Algorithm Name | Note |
| --------------------- | ---- |
| `blake2b-160` | BLAKE2 digest using the 2B variant (64 bit) with size 160 bits as defined by [RFC7693](https://tools.ietf.org/html/rfc7693). MUST be encoded using hex (base16) encoding [RFC4648](https://tools.ietf.org/html/rfc4648). For example, the `blake2b-160` digest of a zero-length bitstream is `3345524abf6bbe1809449224b5972c41790b6cf2` (40 hex digits long). |
Expand Down
180 changes: 180 additions & 0 deletions docs/0002-N-tuple-tree.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# OCFL Community Extension 0002: N-tuple Trees for OCFL Storage Hierarchies

* Authors: Neil Jefferies
* Minimum OCFL Version: 1.0
* Obsoletes: n/a
* Obsoleted by: n/a
zimeon marked this conversation as resolved.
Show resolved Hide resolved

## Overview

This extension provides a general mechanism for describing a set of algorithms for mapping numerical identifers onto a tree structure for constructing an OCFL Storage Hierarchy in a way that maintains performance. The aim is to balance the problem of having too many files or subdirectories in any one directory with having an overly deeply nested set of subdirectories. As both identifier formats and specific filesystem instances may vary there is not a single optimal approach.

## Parameters

* name: identifier length
* description: The number of meaningful characters in the base identifier
* type: integer
* range: 0,4096
* default:
* name: case mapping
* description: Indicates how case in the source identifier should be handled when mapping to a path
* type: enumerated
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* range: ToUpper,ToLower,Literal
* default:
* name: invert mapping
* description: Indicate if mapping should begin at the rightmost (least significant) character of the stripped identifier
* type: enumerated
* range: FALSE,TRUE
* default: FALSE
* name: tuple size
* description: Indicates the size of the chunks (in characters) that the identifier is split into during mapping
* type: integer
* range: 1,32
* default: 2
* name: number of tuples
* description: Indicates how many chunks are used for path generation
* type: integer
* range: 0,32
* default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving a default here may encourage consistency of approach. For example, a "tuple size" of "2" (hex characters) and a "number of tuples" of "4" provides for four billion options. That could be a reasonable default starting point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is to "pairtree" behaviour but I'm open to change.

* short object root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are obviously other potential options for naming the object root. We should be open to the possibility of community feedback on that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those can be other extensions. This is an extension based on generalising pairtree.

* description: Indicates how the OCFL object root directory name should be generated from the identifier
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* type: enumerated
* range: FALSE,TRUE
* default: FALSE

## Detailed explanation

The approach described here is a generalisation of the [PairTree](https://tools.ietf.org/html/draft-kunze-pairtree-01) algorithm designed to be more flexible as file storage technologies have developed. Conventional filesystems are generally better able to handle large numbers of files in a directory and object stores tend to favour much flatter storage hierarchies. In short, the approach is to derive a unique file path for an OCFL object from its unique identifier in a programmatic and repeatable manner.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved

### identifier length

This extension assumes that object unique identifiers are all the same length and that all the characters of the identifer are reasonably well distributed (e.g. for a hexadecimal-based identifier, each character can be any value from 0-f). This may mean that it is prudent to adjust the format of the identifier before it can be safely used to generate a path. For example, a [UUID](https://tools.ietf.org/html/rfc4122) is typically written in the form "uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6" where the "uuid:" premable and the hyphens are clearly non-unique elements. The "stripped" version used for path generation would be "f81d4fae7dec11d0a76500a0c91e6bf6". The **identifier length** parameter indicates the length of the stripped version of the identifier, in this example, 32.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved

### case mapping

Depending on how identifiers are generated they may not always be consistent in their usage of case. A UUID is actually a hexdecimal number, so both "f81d4fae7dec11d0a76500a0c91e6bf6" and "F81D4FAE7DEC11D0A76500A0C91E6BF6" would be valid renderings of the example given above. However, many storage systems are case sensitive so if we want to map identifiers to paths consistently we need to specify which case to map to. The **case mapping** parameter allows this to be specified but also allows for identifiers that are lexical and thus should not necessarily be case mapped. Note especially that upper/lower case mappings are often language/locale dependent for characters outside the basic \[A-z\]\[a-z\] range and thus quite likely to be non-portable.

### invert mapping

Some identifers are generated sequentially. In this case the N-tuple tree approach to generating paths does not generate a well distributed tree if we begin at the leftmost (most significant) end of the stripped identifier. Instead, we end up with a small number of fully populated individual branches which can be inefficient for larger numbers of objects. In order to avoid this the *invert mapping* option indicates that mapping should start from the rightmost (least significant) end of the identifier which changes most rapidly and therefore distributes more effectively. In the example, "f81d4fae7dec11d0a76500a0c91e6bf6" would be flipped to "6fb6e19c0a00567a0d11ced7eaf4d18f" before path conversion.

Note that the mapping, and its inverse, operates at a character level and not bit-wise. Many identifier schemes are designed to be opaque and thus have pseudo-random characteristics so the mapping defaults to the more intuitive most-signficant to least-signficant approach.

### tuple size

Indicates the size of the chunks that the identifier is split into during path generation. The optimal chunk size depends on a number of factors:
* The number of values that each character in the identifier can have. For example, UUID's are hexadecimal based so each character may be in the range \[0-9,a-f\] giving 16 different values whereas an alphanumeric identifier might have the range \[0-9,a-z,A-Z\] giving 62 values.
* The characteristics of the underlying storage and associated code libraries. Although not the case in the past, modern storage systems can generally handle tens of thousands of files in a directory without difficulty. It is more likely that the code libraries and tools used to access and parse these systems will encounter some performance limitations when handling large numbers of files. In particular, Linux command-line wildcard expansions are typically limited to just under 128K characters which equates to around 4000 directory names if they have 32 characters.
* Human readability is also reduced for long lists of files, which may make recovery *in extremis* more difficult.

For a tuple size of 3, our example UUID of "f81d4fae7dec11d0a76500a0c91e6bf6" would be split up into a path beginning "/f81/d4f/ae7/...".

### number of tuples

In practice, you may wish to limit the depth of the OCFL Storage Hierarchy tree to avoid overly deep nesting of directories. The **number of tuples** determines this depth. Splitting the entire identifier down into tuples may not be necessary since the number of objects to be stored is often much less than all the possible values for the source identifier.

For example, if we split the example UUID into size 3 tuples, each directory can contain 4096 subdirectories so, with the number of tuples also set to three the resulting tree would have 4096^3 (=68719476736) directories, each of which could contain one or more objects. All the objects with UUID's that begin f81d4fae7... would be stored in OCFL object roots in the directory /f81/d4f/ae7/. However, if the UUID's are reasonably pseudo-randomly distributed, the likelihood of many object identifiers sharing even the first 9 characters is quite low until a signficant number of objects have been created.

### short object root
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved

Once a Storage Hierarchy has been created, the name of each OCFL Object Root directory should also be determined from the identifier in a consistent manner. The default approach is to use the full (stripped) identifier but, if identifiers are long or there is the need to keep Storage Hierarchy paths short because of object complexity, there is the option to just use the portion of the identifier that remains after Storage Hierarchy path generation. To continue the UUID example the full OCFL Object Root path could be:
* **short object root = FALSE** /f81/d4f/ae7/f81d4fae7dec11d0a76500a0c91e6bf6/
* **short object root = TRUE** /f81/d4f/ae7/dec11d0a76500a0c91e6bf6/

## Examples

These examples are taken from the OCFL Implementation notes:

* *Flat*: Each object is contained in a directory with a name that is simply derived from the unique identifier of the object.
* identifer length = 12
* case mapping = ToLower
* invert mapping = FALSE
* tuple size = 12
* number of tuples = 0
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* short object root = FALSE

[storage_root]
├── 0=ocfl_1.0
├── ocfl_1.0.html (optional copy of the OCFL specification)
├── d45be626e024
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
├── d45be626e036
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
├── 3104edf0363a
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
└── ...


* PairTree: [PairTree] is designed to overcome the limitations on the number of files in a directory that most file systems have. It creates hierarchy of directories by mapping identifier strings to directory paths two characters at a time. For numerical identifiers specified in hexadecimal this means that there are a maximum of 256 items in any directory which is well within the capacity of any modern filesystem. However, for long identifiers, pairtree creates a large number of directories which will be sparsely populated unless the number of objects is very large. Traversing all these directories during validation or rebuilding operations can be slow.
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* identifer length = 12
* case mapping = ToLower
* invert mapping = FALSE
* tuple size = 2
* number of tuples = 6
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* short object root = FALSE

[storage_root]
├── 0=ocfl_1.0
├── ocfl_1.0.html (optional copy of the OCFL specification)
├── d4
| └── 5b
| └── e6
| └── 26
| └── e0
| ├── 24
| | └──d45be626e024
| | ├── 0=ocfl_object_1.0
| | └── ...
| └── 36
| └──d45be626e036
| ├── 0=ocfl_object_1.0
| └── ...
├── 31
| └── 04
| └── ed
| └── f0
| └── 36
| └── 3a
| └── 3104edf0363a
| ├── 0=ocfl_object_1.0
| └── ...
└── ...


* Truncated n-tuple Tree: This approach aims to achieve some of the scalability benefits of PairTree whilst limiting the depth of the resulting directory hierarchy. To achieve this, the source identifier can be split at a higher level of granularity, and only a limited number of the identifier digits are used to generate directory paths. For example, using triples and three levels with example above yields:
neilsjefferies marked this conversation as resolved.
Show resolved Hide resolved
* identifer length = 12
* case mapping = ToLower
* invert mapping = FALSE
* tuple size = 3
* number of tuples = 3
* short object root = FALSE

[storage_root]
├── 0=ocfl_1.0
├── ocfl_1.0.html (optional copy of the OCFL specification)
├── d45
| └── be6
| └── 26e
| ├──d45be626e024
| | ├── 0=ocfl_object_1.0
| | └── ...
| └──d45be626e036
| ├── 0=ocfl_object_1.0
| └── ...
├── 310
| └── 4ed
| └── f03
| └── 3104edf0363a
| ├── 0=ocfl_object_1.0
| └── ...
└── ...