OCFL · neilsjefferies · Jan 22, 2020 · Feb 4, 2020 · Feb 18, 2020 · Feb 18, 2020
diff --git a/README.md b/README.md
@@ -10,4 +10,39 @@ See also [pending pull requests](https://github.com/OCFL/extensions/pulls) for e
 
 Community extensions should be written as GitHub flavored markdown in the `docs` directory of this repository. They should be numbered sequentially using a 4-digit, zero-padded prefix; should use hyphens to separate words; and have the `.md` extension.
 
-An example/template is available in this repository as [OCFL Community Extension](docs/0000-example-extension) and is rendered via GitHub pages as https://ocfl.github.io/extensions/0000-example-extension
+An example/template is available in this repository as [OCFL Community Extension](docs/0000-example-extension) and is rendered
+via GitHub pages as https://ocfl.github.io/extensions/0000-example-extension
+
+## Extension Parameters
+
+For efficiency, it is likely that many extension definitions might actually cover a number of variants. Therefore, when an
+extension is referenced, it may be accompanied by a number of parameters that specify the particular variant in use. This
+provides both more effective documention of an OCFL structure but allows the implementation of generic extension code that
+covers a wider variety of use cases. Parameters MUST have be single valued. For each parameter the following properties should
+be defined:    
+
+* Name: A short name for the parameter. Since this has then potential to be used as part of programmatic access the name MUST
+not contain control characters and SHOULD be shorter than 127 characters. The length limit is based on a survey of the defaults for various JSON parsers. 
+* Description: A brief description of the function of the parameter. This should be expanded in the main description of the
+extension which MUST reference all the parameters.
+* Type: Data type for the parameter. In order to allow validation and limit to scope for implementation specific variations,
+parameters are typed.
+  * integer - may be signed or not as specified in the range parameter.
+  * string - aligned with JSON strings, these should be UTF-8 encoded and avoid control characters.  
+  * enumerated - one of an ordered set of labels which MUST conform to the same limitations as parameter names. No specific values are associated with a label other than its ordinality in the set, which is zero-based.
+* Range: For each parameter type a range must be specified that limits values that a parameter may take.
+  * For integer parameters the range specifies minimum and maximum values, separated by a comma, which MUST be integers themselves.
+  * For string parameters, the range specifies the maximum length of the string as an integer number of characters, not bytes. Again, based on a survey of parsers, try to keep strings shorter than 4095 characters.
+  * For enumerated parameters, the range is a comma separated ordered list of valid labels. Enumerated parameters are case sensitive.  A boolean value is a special case of an enumerated type with the values: {FALSE, TRUE}    
+* Default: Default value for parameter, which MUST be consistent with the range limitations. If this is left blank then the parameter is mandatory 
+
+## Referencing Parameters
+
+Wherever a parameterised extension is referenced, include any parameters in an accompanying JSON file. If using an extensions directory, the JSON file MUST be named for the extension and included in the directory. For example, the example extension above would have an accompanying file *0000-example-extension.json* which might contain:
+
+    "0000-example-extension.md": {  
+        "first example parameter": "12",  
+        "second example parameter": "Hello",  
+        "third example parameter": "Green"  
+    }
+If, instead, the extension is referenced in a JSON file (e.g. additional digests in a manifest) then these can be included immediately following the reference. However, there MUST only be one parameter set defined for each extension.
diff --git a/docs/0000-example-extension.md b/docs/0000-example-extension.md
@@ -11,6 +11,24 @@
 
 This extension is but an example and has no content, but if it did the content would be summarized in a paragraph here.
 
+## Parameters
+
+* name: first example parameter
+  * description: a mandatory 8 bit unsigned value
+  * type: integer
+  * range: 0,255
+  * default:
+* name: second example parameter
+  * description: an optional 64 character long string, defaulting to "Not applicable", if omitted.
+  * type: string
+  * range: 64
+  * default: "Not applicable"
+* name: third example parameter
+  * description: An example enumerated parameter
+  * type: enumerated
+  * range: Red,Yellow,Orange,Green,Blue,Indigo,Violet
+  * default: Green
+
 ## Other Sections Providing Details, Examples, etc.
 
 ... more in here ...
diff --git a/docs/0001-digest-algorithms.md b/docs/0001-digest-algorithms.md
@@ -9,8 +9,33 @@
 
 This extension is an index of additional digest algorithms. It provides a controlled vocabulary of digest algorithm names that may be used to indicate the given algorithm in `fixity` blocks of OCFL Objects, and links their defining extensions.
 
+## Parameters
+
+* name: blake2b-160
+  * description: Indicates if this alorithm is used 
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+* name: blake2b-256
+  * description: Indicates if this alorithm is used 
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+* name: blake2b-384
+  * description: Indicates if this alorithm is used 
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+* name: sha512/256
+  * description: Indicates if this alorithm is used 
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+
 ## Digest Algorithms Defined in Community Extensions
 
+Each parameter corresponds to a Digest Algorithm Name used in the table below, and indicates if this algorithm is in use as a consequence of including this extension. As the parameters default to FALSE, only the algorthms used need to be listed in *0001-digest-algorithms.json*. 
+
 | Digest Algorithm Name | Note |
 | --------------------- | ---- |
 | `blake2b-160`         | BLAKE2 digest using the 2B variant (64 bit) with size 160 bits as defined by [RFC7693](https://tools.ietf.org/html/rfc7693). MUST be encoded using hex (base16) encoding [RFC4648](https://tools.ietf.org/html/rfc4648). For example, the `blake2b-160` digest of a zero-length bitstream is `3345524abf6bbe1809449224b5972c41790b6cf2` (40 hex digits long). |

diff --git a/docs/0002-N-tuple-tree.md b/docs/0002-N-tuple-tree.md
@@ -0,0 +1,180 @@
+# OCFL Community Extension 0002: N-tuple Trees for OCFL Storage Hierarchies
+
+  * Authors: Neil Jefferies
+  * Minimum OCFL Version: 1.0
+  * Obsoletes: n/a
+  * Obsoleted by: n/a
+
+## Overview
+
+This extension provides a general mechanism for describing a set of algorithms for mapping numerical identifers onto a tree structure for constructing an OCFL Storage Hierarchy in a way that maintains performance. The aim is to balance the problem of having too many files or subdirectories in any one directory with having an overly deeply nested set of subdirectories. As both identifier formats and specific filesystem instances may vary there is not a single optimal approach.       
+
+## Parameters
+
+* name: identifier length
+  * description: The number of meaningful characters in the base identifier 
+  * type: integer
+  * range: 0,4096
+  * default:
+* name: case mapping
+  * description: Indicates how case in the source identifier should be handled when mapping to a path
+  * type: enumerated
+  * range: ToUpper,ToLower,Literal
+  * default:
+* name: invert mapping
+  * description: Indicate if mapping should begin at the rightmost (least significant) character of the stripped identifier
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+* name: tuple size
+  * description: Indicates the size of the chunks (in characters) that the identifier is split into during mapping
+  * type: integer
+  * range: 1,32
+  * default: 2
+* name: number of tuples
+  * description: Indicates how many chunks are used for path generation
+  * type: integer
+  * range: 0,32
+  * default:
+* short object root 
+  * description: Indicates how the OCFL object root directory name should be generated from the identifier
+  * type: enumerated
+  * range: FALSE,TRUE
+  * default: FALSE
+
+## Detailed explanation
+
+The approach described here is a generalisation of the [PairTree](https://tools.ietf.org/html/draft-kunze-pairtree-01) algorithm designed to be more flexible as file storage technologies have developed. Conventional filesystems are generally better able to handle large numbers of files in a directory and object stores tend to favour much flatter storage hierarchies. In short, the approach is to derive a unique file path for an OCFL object from its unique identifier in a programmatic and repeatable manner. 
+
+### identifier length
+
+This extension assumes that object unique identifiers are all the same length and that all the characters of the identifer are reasonably well distributed (e.g. for a hexadecimal-based identifier, each character can be any value from 0-f). This may mean that it is prudent to adjust the format of the identifier before it can be safely used to generate a path. For example, a [UUID](https://tools.ietf.org/html/rfc4122) is typically written in the form "uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6" where the "uuid:" premable and the hyphens are clearly non-unique elements. The "stripped" version used for path generation would be "f81d4fae7dec11d0a76500a0c91e6bf6". The **identifier length** parameter indicates the length of the stripped version of the identifier, in this example, 32.   
+
+### case mapping
+
+Depending on how identifiers are generated they may not always be consistent in their usage of case. A UUID is actually a hexdecimal number, so both "f81d4fae7dec11d0a76500a0c91e6bf6" and "F81D4FAE7DEC11D0A76500A0C91E6BF6" would be valid renderings of the example given above. However, many storage systems are case sensitive so if we want to map identifiers to paths consistently we need to specify which case to map to. The **case mapping** parameter allows this to be specified but also allows for identifiers that are lexical and thus should not necessarily be case mapped. Note especially that upper/lower case mappings are often language/locale dependent for characters outside the basic \[A-z\]\[a-z\] range and thus quite likely to be non-portable.
+
+### invert mapping
+
+Some identifers are generated sequentially. In this case the N-tuple tree approach to generating paths does not generate a well distributed tree if we begin at the leftmost (most significant) end of the stripped identifier. Instead, we end up with a small number of fully populated individual branches which can be inefficient for larger numbers of objects. In order to avoid this the *invert mapping* option indicates that mapping should start from the rightmost (least significant) end of the identifier which changes most rapidly and therefore distributes more effectively. In the example, "f81d4fae7dec11d0a76500a0c91e6bf6" would be flipped to "6fb6e19c0a00567a0d11ced7eaf4d18f" before path conversion.
+
+Note that the mapping, and its inverse, operates at a character level and not bit-wise. Many identifier schemes are designed to be opaque and thus have pseudo-random characteristics so the mapping defaults to the more intuitive most-signficant to least-signficant approach.          
+
+### tuple size
+
+Indicates the size of the chunks that the identifier is split into during path generation. The optimal chunk size depends on a number of factors:
+* The number of values that each character in the identifier can have. For example, UUID's are hexadecimal based so each character may be in the range \[0-9,a-f\] giving 16 different values whereas an alphanumeric identifier might have the range \[0-9,a-z,A-Z\] giving 62 values.
+* The characteristics of the underlying storage and associated code libraries. Although not the case in the past, modern storage systems can generally handle tens of thousands of files in a directory without difficulty. It is more likely that the code libraries and tools used to access and parse these systems will encounter some performance limitations when handling large numbers of files. In particular, Linux command-line wildcard expansions are typically limited to just under 128K characters which equates to around 4000 directory names if they have 32 characters. 
+* Human readability is also reduced for long lists of files, which may make recovery *in extremis* more difficult.
+
+For a tuple size of 3, our example UUID of "f81d4fae7dec11d0a76500a0c91e6bf6" would be split up into a path beginning "/f81/d4f/ae7/...".
+
+### number of tuples
+
+In practice, you may wish to limit the depth of the OCFL Storage Hierarchy tree to avoid overly deep nesting of directories. The **number of tuples** determines this depth. Splitting the entire identifier down into tuples may not be necessary since the number of objects to be stored is often much less than all the possible values for the source identifier. 
+
+For example, if we split the example UUID into size 3 tuples, each directory can contain 4096 subdirectories so, with the number of tuples also set to three the resulting tree would have 4096^3 (=68719476736) directories, each of which could contain one or more objects. All the objects with UUID's that begin f81d4fae7... would be stored in OCFL object roots in the directory /f81/d4f/ae7/. However, if the UUID's are reasonably pseudo-randomly distributed, the likelihood of many object identifiers sharing even the first 9 characters is quite low until a signficant number of objects have been created.
+
+### short object root
+
+Once a Storage Hierarchy has been created, the name of each OCFL Object Root directory should also be determined from the identifier in a consistent manner. The default approach is to use the full (stripped) identifier but, if identifiers are long or there is the need to keep Storage Hierarchy paths short because of object complexity, there is the option to just use the portion of the identifier that remains after Storage Hierarchy path generation. To continue the UUID example the full OCFL Object Root path could be:    
+* **short object root = FALSE** /f81/d4f/ae7/f81d4fae7dec11d0a76500a0c91e6bf6/
+* **short object root = TRUE** /f81/d4f/ae7/dec11d0a76500a0c91e6bf6/
+
+## Examples
+
+These examples are taken from the OCFL Implementation notes:
+
+* *Flat*: Each object is contained in a directory with a name that is simply derived from the unique identifier of the object.
+  * identifer length = 12
+  * case mapping = ToLower
+  * invert mapping = FALSE
+  * tuple size = 12
+  * number of tuples = 0
+  * short object root = FALSE
+
+                [storage_root]
+                    ├── 0=ocfl_1.0
+                    ├── ocfl_1.0.html (optional copy of the OCFL specification)
+                    ├── d45be626e024
+                    |   ├── 0=ocfl_object_1.0
+                    |   ├── inventory.json
+                    |   ├── inventory.json.sha512
+                    |   └── v1...
+                    ├── d45be626e036
+                    |   ├── 0=ocfl_object_1.0
+                    |   ├── inventory.json
+                    |   ├── inventory.json.sha512
+                    |   └── v1...
+                    ├── 3104edf0363a
+                    |   ├── 0=ocfl_object_1.0
+                    |   ├── inventory.json
+                    |   ├── inventory.json.sha512
+                    |   └── v1...
+                    └── ...
+
+
+* PairTree: [PairTree] is designed to overcome the limitations on the number of files in a directory that most file systems have. It creates hierarchy of directories by mapping identifier strings to directory paths two characters at a time. For numerical identifiers specified in hexadecimal this means that there are a maximum of 256 items in any directory which is well within the capacity of any modern filesystem. However, for long identifiers, pairtree creates a large number of directories which will be sparsely populated unless the number of objects is very large. Traversing all these directories during validation or rebuilding operations can be slow.
+  * identifer length = 12
+  * case mapping = ToLower
+  * invert mapping = FALSE
+  * tuple size = 2
+  * number of tuples = 6
+  * short object root = FALSE
+
+                [storage_root]
+                    ├── 0=ocfl_1.0
+                    ├── ocfl_1.0.html (optional copy of the OCFL specification)
+                    ├── d4
+                    |   └── 5b
+                    |       └── e6
+                    |           └── 26
+                    |               └── e0
+                    |                   ├── 24
+                    |                   |   └──d45be626e024
+                    |                   |       ├── 0=ocfl_object_1.0
+                    |                   |       └── ...
+                    |                   └── 36
+                    |                       └──d45be626e036
+                    |                           ├── 0=ocfl_object_1.0
+                    |                           └── ...
+                    ├── 31
+                    |   └── 04
+                    |       └── ed
+                    |           └── f0
+                    |               └── 36
+                    |                   └── 3a
+                    |                       └── 3104edf0363a
+                    |                           ├── 0=ocfl_object_1.0
+                    |                           └── ...
+                    └── ...
+
+
+* Truncated n-tuple Tree: This approach aims to achieve some of the scalability benefits of PairTree whilst limiting the depth of the resulting directory hierarchy. To achieve this, the source identifier can be split at a higher level of granularity, and only a limited number of the identifier digits are used to generate directory paths. For example, using triples and three levels with example above yields:
+  * identifer length = 12
+  * case mapping = ToLower
+  * invert mapping = FALSE
+  * tuple size = 3
+  * number of tuples = 3
+  * short object root = FALSE
+
+                [storage_root]
+                    ├── 0=ocfl_1.0
+                    ├── ocfl_1.0.html (optional copy of the OCFL specification)
+                    ├── d45
+                    |   └── be6
+                    |       └── 26e
+                    |           ├──d45be626e024
+                    |           |  ├── 0=ocfl_object_1.0
+                    |           |  └── ...
+                    |           └──d45be626e036
+                    |              ├── 0=ocfl_object_1.0
+                    |              └── ...
+                    ├── 310
+                    |   └── 4ed
+                    |       └── f03
+                    |           └── 3104edf0363a
+                    |               ├── 0=ocfl_object_1.0
+                    |               └── ...
+                    └── ...
+