Skip to content

Repo for the Tanagra service being developed by the All of Us DRC

License

Notifications You must be signed in to change notification settings

DataBiosphere/tanagra

Repository files navigation

Tanagra

Tanagra is a project to build a configurable cohort builder and data explorer. Our goal is to make it easy to set up a new dataset for exploring with little or no custom code required, so everything we've built is configuration-driven.

Project overview

The project has three main pieces: indexer, service, UI. All three pieces are highly interconnected and are not intended to be used or deployed separately. Everything lives in this single GitHub repository.

The indexer takes the source dataset and produces a logical copy that's better suited to the types of queries the UI needs to run. It denormalizes some data, precomputes some things, and reorganizes tables. The goal is not to meet some query benchmark, only to have the UI not time out.

The service processes queries for the UI and manages the application database, which stores user-managed artifacts like cohorts and data feature sets.

The UI includes the cohort builder, data feature set builder, export, and cohort review interfaces.

Configure a new dataset

Tanagra supports data patterns, rather than specific SQL schemas. Check the list of currently supported patterns to see how they map to your dataset.

Tanagra defines a custom object model on top of the underlying relational data. The dataset configuration language is based on this object model, so it's helpful to be familiar with the main concepts.

A dataset configuration is spread across multiple files, to improve readability and allow easier sharing across datasets. See an overview of the different files and directory structure, as well as pointers to example files. Check the full dataset configuration schema documentation to lookup specific properties. Documentation for protocol buffers used for visualizations and criteria plugins is here.

Set up a new deployment

Choose a deployment pattern and configure the GCP project(s).

Once you've defined the configuration files for a dataset, run the indexer. Check the full indexer CLI documentation to lookup specific commands.

Tanagra does not provide an API for managing access control for a population of users. Instead, we provide an interface for calling an external access control service. (e.g. The VUMC admin service serves as the external access control service for the SD deployment.) Either reuse an existing access control implementation, or add your own.

We expect deployments to require varied methods of exporting data. Either reuse an existing export implementation, or add your own.

Check the full application configuration documentation to lookup specific deployment properties.

Once your deployment is up and running, create a regression test suite to detect unexpected changes due to config or underlying data changes and run it re