This repository contains a specification for delayed array operations stored in a HDF5 file.
The concept of delayed operations is taken from Bioconductor's DelayedArray package,
where any operations on a DelayedArray
are cached in memory and evaluated on an as-needed basis.
Our aim is to save these operations to file in a well-defined, cross-language format;
this avoids the need to compute and store the results of such operations, which may be prohibitively expensive.
Several use cases benefit from the serialization of delayed operations:
- We have an immutable array dataset stored in a database. Rather than making a copy for manipulation, we can hold a reference to the original and save the operations (slicing, arithmetic, etc.). This avoids duplication of large datasets.
- We have a dataset that can be represented in a small type, e.g.,
uint8_t
s. We apply a transformation that promotes the type, e.g., log-transformation tofloat
s ordouble
s. By saving the delayed operation, we can maintain our compact representation to reduce the file size. - We have a sparse dataset that is subjected to sparsity-breaking operation, e.g., centering. Rather than saving the dense matrix, we keep the efficient sparse representation and save the delayed operation.
In the chihaya specification, we store a "delayed object" as a HDF5 group in the file. Delayed operations are represented as further nested groups, terminating in an array containing the original data (or a reference to it). The type of delayed operation/array is specified in the group's attributes. By recursively inspecting the contents of each HDF5 group, applications can reconstitute the original delayed object in the framework of choice.
The chihaya specification currently supports a range of delayed operations including subsetting, combining, transposition, matrix products, and an assortment of unary and binary operations. It also supports dense arrays, sparse matrices, constant arrays and custom arrays. More details about the on-disk representation of each operation can be found in the specifications:
In C++, a delayed object in a file can be validated by calling the validate
function:
#include "chihaya/chihaya.hpp"
chihaya::validate("path_to_file.h5", "delayed/object/name");
In R, DelayedArray
objects (from the DelayedArray package)
can be saved to a chihaya-compliant HDF5 file using the our R package.
The same package also reconstitutes a DelayedArray
from the file.
library(DelayedArray)
X <- DelayedArray(matrix(runif(100), 100, 20))
X <- log(t(t(X) / runif(ncol(X))) + 1)
library(chihaya)
tmp <- tempfile(fileext=".h5")
saveDelayed(X, tmp)
Y <- loadDelayed(tmp)
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
include(FetchContent)
FetchContent_Declare(
chihaya
GIT_REPOSITORY https://github.com/ArtifactDB/chihaya
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(chihaya)
Then you can link to chihaya to make the headers available during compilation:
# For executables:
target_link_libraries(myexe chihaya)
# For libaries
target_link_libraries(mylib INTERFACE chihaya)
You can install the library by cloning a suitable version of this repository and running the following commands:
mkdir build && cd build
cmake .. -DTATAMI_TESTS=OFF
cmake --build . --target install
Then you can use find_package()
as usual:
find_package(artifactdb_chihaya CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE artifactdb::chihaya)
If you're not using CMake, the simple approach is to just copy the files the include/
subdirectory -
either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
.
You will also need to link to the HDF5 library, usually from a system installation (1.10 or higher).
Web applications can read delayed matrices into memory using the chihaya Javascript package.
At some point, we may also add tatami bindings to load the delayed operations into memory. This would enable C++ applications to natively read from the HDF5 files that comply with chihaya's specification.
The library is provisionally named after Chihaya Kisaragi, one of my favorite characters.