theislab · davidsebfischer · Dec 10, 2020 · Nov 6, 2020 · Nov 9, 2020 · Nov 9, 2020
diff --git a/README.rst b/README.rst
@@ -1,14 +1,31 @@
-Managing single-cell data sets and neural networks used for analysis
-=====================================================================
+|Stars| |PyPI| |PyPIDownloads|
 
-.. image:: https://github.com/theislab/sfaira/blob/master/resources/images/concept.jpeg
-   :width: 600px
+.. |Stars| image:: https://img.shields.io/github/stars/theislab/sfaira?logo=GitHub&color=yellow
+   :target: https://github.com/theislab/sfaira/stargazers
+.. |PyPI| image:: https://img.shields.io/pypi/v/sfaira?logo=PyPI
+   :target: https://pypi.org/project/sfaira
+.. |PyPIDownloads| image:: https://pepy.tech/badge/sfaira
+   :target: https://pepy.tech/project/sfaira
+
+
+sfaira - data and model repository for single-cell data
+=======================================================
+
+.. image:: https://github.com/theislab/sfaira/blob/master/resources/images/concept.png
+   :width: 1000px
    :align: center
 
 sfaira_ is a model and a data repository in a single python package. 
-Its data API gives users access to streamlined data loaders that allow reproducible use of published and private data sets for model training and exploration.
-Its model API gives user streamlined access to pre-trained models and to common model architectures to ease usage of neural networks in common single-cell analysis workflows.
+We provide an interactive overview of the current state of the zoos on sfaira-site_.
+
+Its data zoo gives users access to streamlined data loaders that allow reproducible use of published and private data sets for model training and exploration.
+Its model zoo gives user streamlined access to pre-trained models and to common model architectures to ease usage of neural networks in common single-cell analysis workflows:
+A model zoo is a software infrastructure that improves user access to pre-trained models which are separately published, such as DCA_ or scArches_:
+Instead of focussing on developing new models, we focus on making models easily accessible to users and distributable by developers.
 sfaira integrates into scanpy_ workflows.
 
 .. _scanpy: https://github.com/theislab/scanpy
 .. _sfaira: https://sfaira.readthedocs.io
+.. _DCA: https://github.com/theislab/dca
+.. _scArches: https://github.com/theislab/scarches
+.. _sfaira-site: https://theislab.github.io/sfaira-site/index.html
diff --git a/docs/api/index.rst b/docs/api/index.rst
@@ -0,0 +1,154 @@
+.. module:: sfaira
+.. automodule:: sfaira
+   :noindex:
+
+API
+===
+
+Import sfaira as::
+
+   import sfaira
+
+
+
+Data: `data`
+------------
+
+.. module:: sfaira.data
+.. currentmodule:: sfaira
+
+The sfaira data zoo API.
+
+
+Pre-defined data set collections
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This sub-module gives you access to curated subsets of the data zoo, e.g. all data sets from human lungs.
+
+.. autosummary::
+   :toctree: .
+
+   data.human
+   data.mouse
+
+
+Functionalities for interactive data analysis
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This sub-module gives you access to functionalities you need to define your own data set collections based on the sfaira data zoo.
+
+.. autosummary::
+   :toctree: .
+
+   data.DatasetBase
+   data.DatasetGroupBase
+   data.DatasetSuperGroup
+
+
+Functionalities for interactive data analysis
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This sub-module gives you access to functionalities you need to load new data live into the data zoo to handle a raw data set in the context of zoo data sets.
+
+.. autosummary::
+   :toctree: .
+
+   data.DatasetInteractive
+
+
+Genomes: `genomes`
+------------------
+
+.. module:: sfaira.genomes
+.. currentmodule:: sfaira
+
+This sub-module gives you access to properties of the genome representations used in sfaira.
+
+.. autosummary::
+   :toctree: .
+
+   genomes.ExtractFeatureListEnsemble
+
+
+Models: `models`
+----------------
+
+.. module:: sfaira.models
+.. currentmodule:: sfaira
+
+The sfaira model zoo API for advanced use.
+This API is structured by streamlined, task-specific APIs for specific analysis problems.
+This API is targeted at developers, see also `ui` for a user centric wrapping API for this model zoo.
+
+
+Cell-type predictor models
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This sub-module handles models that predict cell types.
+
+.. autosummary::
+   :toctree: .
+
+   models.celltype
+
+
+Embedding models
+~~~~~~~~~~~~~~~~
+
+This sub-module handles models that embed expression vectors (cells) into a latent space.
+
+.. autosummary::
+   :toctree: .
+
+   models.embedding
+
+
+Train: `train`
+--------------
+
+.. module:: sfaira.train
+.. currentmodule:: sfaira
+
+The interface for training sfaira compatible models.
+This is a sub-module dedicated for developers to ease model training and deployment.
+
+Trainer classes
+~~~~~~~~~~~~~~~
+
+Trainer class wrap estimator classes (which wrap model classes) and handle grid-search specific tasks centred on model fits,
+such as saving evaluation metrics and model weights.
+
+.. autosummary::
+   :toctree: .
+
+   train.TargetZoos
+   train.TrainModelCelltype
+   train.TrainModelEmbedding
+
+
+Grid search summary classes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Grid search summary classes allow a developer to easily interact with a finished grid search by loading and summarising results,
+which were saved through Trainer classes.
+
+.. autosummary::
+   :toctree: .
+
+   train.GridsearchContainer
+   train.SummarizeGridsearchCelltype
+   train.SummarizeGridsearchEmbedding
+
+User interface: `ui`
+--------------------
+
+.. module:: sfaira.ui
+.. currentmodule:: sfaira
+
+This sub-module gives users access to the model zoo, including model query from remote servers.
+This API is designed to be used in analysis workflows and does not require any understanding of the way models are defined and stored.
+
+.. autosummary::
+   :toctree: .
+
+   ui.UserInterface
diff --git a/docs/data.rst b/docs/data.rst
@@ -1,22 +1,46 @@
 Data
 ======
 
+.. image:: https://raw.githubusercontent.com/theislab/sfaira/master/resources/images/data_zoo.png
+   :width: 600px
+   :align: center
+
 Build data repository locally
 ------------------------------
 
-Build a repository structure:
-1. Choose a directory to dedicate to the data base, called root in the following.
-2. Make subfolders in root for each organism for which you want to build a data base.
-3. Make subfolders for each organ whithin each organism for which you want to build a data base.
+Build a repository structure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+    1. Choose a directory to dedicate to the data base, called root in the following.
+    2. Make subfolders in root for each organism for which you want to build a data base.
+    3. Make subfolders for each organ whithin each organism for which you want to build a data base.
+
+We maintain a couple of download scripts that automatise this process, which have to be executed in a shell once to download specific subsets of the full data zoo.
+These scripts can be found in sfaira.data.download_scripts.
+
+Use 3rd party repositories
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira.
+One example for such an organization is the cellxgene_ data portal.
+Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira.
+Data loaders for cellxgene structured data objects will be available soon!
+Contact us for support of any other repositories.
+
+.. _cellxgene: https://cellxgene.cziscience.com/
+
+Add data sets
+~~~~~~~~~~~~~
 
-Add data sets:
-4. For each species and organ combination, choose the data sets that you want to use.
-5. Identify the raw files as indicated in the data loader classes and copy them into the folder. Use processed data
-using the described processing if this is required: This is usually done to speed up loading for file
-formats that are difficult to access.
+    4. For each species and organ combination, choose the data sets that you want to use.
+    5. Identify the raw files as indicated in the data loader classes and copy them into the folder. Use processed data
+    using the described processing if this is required: This is usually done to speed up loading for file
+    formats that are difficult to access.
+
+Data loaders
+------------
 
 Use data loaders on existing data repository
---------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 You only want to use data sets with existing data loaders and have adapted your directory structure as above?
 In that case, you can immediately start using the data loader functions, you just need to supply the root directory
@@ -25,10 +49,8 @@ Depending on the functionalities you want to use, you need to create a directory
 can be easily done via the data set api itself, example python scripts are under benchmarks/data_preparation. This
 meta information is necessary to anticipate file sizes for backing merged adata objects for example.
 
-TODO example.
-
 Contribute data loaders
------------------------
+~~~~~~~~~~~~~~~~~~~~~~~
 
 Each data set (organsism, organ, protocol, optionally also batches) has its own data loader class. Each such class is
 in a separate file and inherits from a base class that contains most functionalities. Accordingly, the data loader class
@@ -74,7 +96,7 @@ before it is loaded into memory:
         if fn is None:
             if self.path is None:
                 raise ValueError("provide either fn in load or path in constructor")
-            fn = os.path.join(self.path, "human/eye/my_data.h5ad")  defined file in streamlined directory structure
+            fn = os.path.join(self.path, "human", "eye", "my_data.h5ad")  defined file in streamlined directory structure
         self.adata = anndata.read(fn)  # loading instruction into .adata, use other ones if the data is not h5ad
 
         self.adata.uns["lab"] = x  # load the adata.uns with meta data
@@ -108,13 +130,59 @@ in which local data and cell type annotation can be managed separately but still
 The data loaders and cell type annotation formats between sfaira and sfaira_extensions are identical and can be easily
 copied over.
 
-
-Handling ontologies in data loaders
------------------------------------
-
-Each data loader has a versioned cell type annotation map, a dictionary.
-This dictionary allows mapping of the cell type annotations that come with the raw form of the data set to the cell type
-universe or ontology terms defined in sfaira, this is, however, only done upon loading of the data (.load()).
-The outcome of this map is a new set of cell type labels that can be propagated to leave nodes of the ontology graph.
-This dictionary requires a new entry for each new version of the corresponding cell type universe.
-
+Ontology management
+-------------------
+
+Sfaira maintains versioned cell type universes and ontologies by species and organ.
+A cell type universe is a list of the unique, most fine-grained cell type definitions available.
+These cell types can be referred to by a human readable cell type name or a structure identifier within an ontology,
+an ontology ID.
+Often, one is also interested in access to more coarse grained groups of cell types, for example if the data quality
+does not allow to distinguish between T cell subtypes.
+To allow coarser type definition, sfaira maintains hierarchies of cell types, in which each hierarchical level is again
+defined by a cell type identifier.
+Such a hierarchy can be writted as directed acyclic graph which has the cell type universe as its leave nodes.
+Intuitively, the cell type hierarchy graph depends on the cell type universe.
+Accordingly, both are versioned together in sfaira:
+Updates in the cell type universe, such as discovery of a new cell type, lead to an update of the ontology and an
+incrementation in both of their versions.
+These versioned changes materialise as a distinct list (universe) and dictionary (ontology) for each version in the
+file that harbors the species- and organ-specific class that inherits from CelltypeVersionsBase and thus are available
+even after updates.
+This versioning without depreceation of the old objects allows sfaira to execute and train models that were designed
+for older cell type universes and thus ensures reproducibility.
+
+Contribute cell types to ontologies
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To contibute new cell types or change existing cell type universe entries, the cell type universe version has to be
+incremented and the new entry can simply be added to the list or modified in the list.
+We do not increment the universe version if a change does not influence the identity of a leave node with respect to
+the other types in the universe, ie if it simply changes the spelling of a cell type or if an onology ID is added to
+a type that previously did not have one.
+
+Contribute hierarchies to ontologies
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To contribute a term to a cell type ontology, one just has to add a dictionary item that defines the new term as a set
+of the leave nodes (cell type universe) of the corresponding universe version.
+
+
+Using ontologies to train cell type classifiers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Cell type classifiers can be trained on data sets with different coarsity of cell type annotation using aggregate
+cross-entropy as a loss and aggregate accuracy as a metric.
+The one-hot encoded cell type label matrix is accordingly modified in the estimator class in data loading if terms
+that correspond to intermediate nodes (rather than leave nodes) are encountered in the label set.
+
+Genome management
+-----------------
+
+We streamline feature spaces used by models by defining standardized gene sets that are used as model input.
+Per default, sfaira works with the protein coding genes of a genome assembly right now.
+A model topology version includes the genome it was trained for, which also defines the feature of this model as genes.
+As genome assemblies are updated, model topology version can be updated and models retrained to reflect these changes.
+Note that because protein coding genes do not change drastically between genome assemblies,
+sample can be carried over to assemblies they were not aligned against by matching gene identifiers.
+Sfaira automatically tries to overlap gene identifiers to the genome assembly selected through the current model.