The emerging ecosystem of complementary conventions for archiving data sets is colliding, in a good way, with innovations in semantic annotation for federated web APIs. smartBag blends these with a view to making semantically rich, machine readable data the norm.
The NIH Data Commons seeks to create a unified platform for biomedial computing. It makes use of a number of data protocols including:
- Bagit is a file packaging protocol.
- Bagit-RO integrates the Bagit and Research Object frameworks.
- BDBags extends Bagit-RO so that referenced data files may be remote, referenced via ids with checksums.
The NCATS Data Translator is annotating federated web APIs with semantic information. This makes biomedical data amenable to automated discovery, access, and reasoning.
- RDF Is a knowledge representation format.
- JSON Is a data serialization format widely used on the web.
- OpenAPI Is a specification for describing web data interfaces.
- smartAPI Extends the OpenAPI spec with additional metadata.
- JSON-LD Is an RDF serialization format for JSON.
Translator is making automated reasoning over biomedical data tractable but is gated by
- Development: Exposing data as web APIs is tedious and expensive.
- Technology: The underlying technologies to do this are in perpetual flux.
The Data Commons is providing scalable computing and a home for large biomedical data but would benefit from
- Semantic Annotation: A base line approach for publishing data sets for dynamic query with semantic metadata.
- Support for AI: Methods for data stewards to make data amenable to automated resoning.
It would be better to
- Annotate: Annotate data archives with appropriate semantic and ontological metadata.
- Generate: Compile the data and semantics to publish them into various evolving tech pipelines.
The smartBag tool chain lets data stewards (optimally) or consumers (pragmatically) semantically annotate their data sources using Research Object (RO) conventions. Use JSON-LD contexts to specify the identifiers and ontologies describing tabular data. smartBag integrates the Bagit suite of conventions with the Data Translator.
The smartBag toolchain will generate an executable smartAPI from a properly annotated bag. Data stewards who annotate their data will be rewarded with the flexibility to compile data publishing pipelines to target arbitrary data delivery and execution platforms.
In the following steps we use the endpoints/ctd directory to compile a smartAPI for accessing a subset of the Clinical Toxicogenomic Database.
These steps clone the repo and set up your path to use the toolchain:
git clone git@github.com:NCATS-Tangerine/smartBag.git
cd smartBag
pip install -r requirements.txt
export PATH=$PWD/bin:$PATH
Next, we download data files for the data set we're working with. In this case, they're for CTD. The metadata frame of the bag is in the endpoints/ctd directory and is structured like this:
└── metadata
├── annotations
│ ├── CTD_chem_gene_ixn_types.csv.jsonld
│ ├── CTD_chemicals.csv.jsonld
│ └── CTD_pathways.csv.jsonld
├── manifest.json
└── provenance
└── results.prov.jsonld
This step also configures the bag we'll create by copying JSON-LD and other metadata as well as data files into a bag staging directory.
cd endpoints/ctd
./configure
This stages a bag directory structure blending selected data files with metadata like this:
├── CTD_chem_gene_ixn_types.csv
├── CTD_chemicals.csv
├── CTD_pathways.csv
└── metadata
├── annotations
│ ├── CTD_chem_gene_ixn_types.csv.jsonld
│ ├── CTD_chemicals.csv.jsonld
│ └── CTD_pathways.csv.jsonld
├── manifest.json
└── provenance
└── results.prov.jsonld
This next command creates a BDBag archive (bag.tgz) of the configured data. Note this is automatically done in the previous "configure" step.
./smartbag make bag
Next we generate the smartAPI based on the provided metadata.
- Generate code for a smartAPI based on the bag
- Create a sqlite3 database per tabular file, inserting all rows
- Generate an OpenAPI interface able to query all rows by each column
- Add smartAPI specific tags based on accompanying JSON-LD annotations
- A configuration file must be specified to declare the properties of Swagger and website (flask) settings.
- The command line below should be run in the bin/smartbag directory.
./smartbag make smartapi --bag ../endpoints/ctd/bag.tgz --opts ../endpoints/ctd/options.json
Finally, run the smartAPI. Here's a link to your server once you've run the command.
./smartbag run smartapi
The generated user interface looks like this:
To query one of the services, use a valid column value like this:
It also serves its own self describign JSON-LD metadata:
The API makes it easy to introspect example values to help explore the interface.
It also serves its own smartAPI schema document.
This alpha release is applicabile to data stewards or consumers with tabular data.
Of course, this is preliminary. Candidate next steps include:
- Consolidating all datasets into different tables.
- Supporting join and aggregate queries.
- Generating different back ends.
- Generating more of the Research Object metadata infrastructure like the manifest.