Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master merge into dev #26

Merged
merged 66 commits into from
Dec 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
a34ad0e
add plot_npc and plot_active_latent_units (#9)
mk017 Nov 6, 2020
e45e5e1
added data loader for interactive workflows with unprocessed data
davidsebfischer Nov 9, 2020
9307be3
made cell type loading optional in dataset .load()
davidsebfischer Nov 9, 2020
e266773
enabled usage of type estimator on data without labels in prediction …
davidsebfischer Nov 9, 2020
35404ef
recursively search custom model repo for weights files
le-ander Nov 10, 2020
dd04526
sort model lookuptable alphabetically before writing it
le-ander Nov 10, 2020
8c665c0
make sure mode_path is set correctly in model_lookuptable when recurs…
le-ander Nov 10, 2020
78b1243
fix os.path.join usage in dataloaders
le-ander Nov 10, 2020
2685944
replace path handling through string concatenations with os.paths.joi…
le-ander Nov 10, 2020
511f9a3
fix bug in lookup table writing
le-ander Nov 11, 2020
5781c7e
add mdoel file path to lookup table
le-ander Nov 11, 2020
c18bf68
reset index in model lookuptable before saving
le-ander Nov 11, 2020
77102ef
add method to user interface for pushing local model weights to zenodo
le-ander Nov 11, 2020
7621069
fix bug in user interface
le-ander Nov 11, 2020
22cfca3
fix bux in summaries.py
le-ander Nov 11, 2020
ab185f1
use absolute model paths when model_lookuptable is used
le-ander Nov 11, 2020
71d4136
fix bug in pretrained weights loading
le-ander Nov 11, 2020
9d97d65
fix bug in pretrained weights loading
le-ander Nov 11, 2020
cce9e95
automatically create an InteractiveDataset when loading data through …
le-ander Nov 11, 2020
ab9c5c7
fix bug inUI data loading
le-ander Nov 11, 2020
b54cdb5
Explicitly cast indices and indptr of final backed file to int64. (#17)
Abdul-Moeed Nov 20, 2020
7273b50
update human lung dataset doi
le-ander Nov 26, 2020
76ca89b
align mouse organ names with human organ names
le-ander Dec 1, 2020
84c3a8d
fix typo in trachea organ naming in mouse
le-ander Dec 1, 2020
f7eb320
rename mouse ovary organ to femalegonad
le-ander Dec 1, 2020
ac0f959
rename mouse ovary organ to femalegonad
le-ander Dec 1, 2020
7e84260
sort by model type in classwise f1 heatmap plot
le-ander Dec 2, 2020
1afe596
another hacky solution to ensure a summary tab can be created when bo…
le-ander Dec 2, 2020
c71e2b3
allow custom metadata in zenodo submission
le-ander Dec 3, 2020
2d71aef
do not return doi but deposit url after depositing to zenodo sandbox …
le-ander Dec 3, 2020
b01c566
updated model zoo description
davidsebfischer Dec 3, 2020
c5eaa4e
recognise all .h5 and .data-0000... files as sfaira weights when cons…
le-ander Dec 3, 2020
b0e92bf
Update README.rst
davidsebfischer Dec 4, 2020
82993e5
Add selu activation and lecun_normal weight_init scheme for human VAE…
Abdul-Moeed Dec 4, 2020
ace102c
update sfaira erpo url and handle .h5 extension in model lookuptable id
le-ander Dec 5, 2020
b04aacd
add meta_data download information to all human dataloaders
le-ander Dec 7, 2020
183bc94
updated docs
davidsebfischer Dec 7, 2020
b7b2d41
updated reference to README in docs
davidsebfischer Dec 7, 2020
c3cb1be
updated index
davidsebfischer Dec 7, 2020
33028d0
included reference to svensson et al data base in docs
davidsebfischer Dec 7, 2020
a88e49a
fixed typo in docs
davidsebfischer Dec 7, 2020
7835f5c
fixed typos in docs
davidsebfischer Dec 7, 2020
3af489c
restructured docs
davidsebfischer Dec 8, 2020
01758b7
fixed bug in reference roadmap in docs
davidsebfischer Dec 8, 2020
1990fa3
updated data and model zoo description
davidsebfischer Dec 8, 2020
4d1f10a
added summary picture into index of docs
davidsebfischer Dec 8, 2020
44eda0d
fixed typo in docs
davidsebfischer Dec 8, 2020
c5199a9
updated summary panel
davidsebfischer Dec 8, 2020
7a05af8
add badges to readme and docs index
le-ander Dec 8, 2020
55beb98
Merge branch 'master' of https://github.com/theislab/sfaira
davidsebfischer Dec 8, 2020
3e77e72
updated summary panel (#20)
davidsebfischer Dec 8, 2020
10b360c
Merge branch 'master' of https://github.com/theislab/sfaira
davidsebfischer Dec 8, 2020
cae6067
Doc updates (#21)
davidsebfischer Dec 8, 2020
2e2f98c
Doc updates (#22)
davidsebfischer Dec 8, 2020
ef8f9e2
move from `import sfaira.api as sfaira` to `import sfaira`
le-ander Dec 9, 2020
ff2427f
add custom genomes to sfaira_extension
le-ander Dec 9, 2020
fc1463e
fix loading of custom topology versions from sfaira_extension
le-ander Dec 9, 2020
0bfaeaf
fix circular imports between sfaira_extension and sfaira
le-ander Dec 9, 2020
10a7998
fix dataloader
le-ander Dec 9, 2020
6ad19e3
fix celltype versioning through sfaira_extension
le-ander Dec 9, 2020
176142c
fix celltype versioning through sfaira_extension
le-ander Dec 9, 2020
8480a21
formatting
le-ander Dec 9, 2020
89a7b53
Merge branch 'master' of https://github.com/theislab/sfaira
davidsebfischer Dec 10, 2020
5df5db5
Doc updates (#25)
davidsebfischer Dec 10, 2020
631221e
Merge branch 'master' of https://github.com/theislab/sfaira
davidsebfischer Dec 10, 2020
1775a5e
Merge branch 'master' into dev
davidsebfischer Dec 10, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
29 changes: 23 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
Managing single-cell data sets and neural networks used for analysis
=====================================================================
|Stars| |PyPI| |PyPIDownloads|

.. image:: https://github.com/theislab/sfaira/blob/master/resources/images/concept.jpeg
:width: 600px
.. |Stars| image:: https://img.shields.io/github/stars/theislab/sfaira?logo=GitHub&color=yellow
:target: https://github.com/theislab/sfaira/stargazers
.. |PyPI| image:: https://img.shields.io/pypi/v/sfaira?logo=PyPI
:target: https://pypi.org/project/sfaira
.. |PyPIDownloads| image:: https://pepy.tech/badge/sfaira
:target: https://pepy.tech/project/sfaira


sfaira - data and model repository for single-cell data
=======================================================

.. image:: https://github.com/theislab/sfaira/blob/master/resources/images/concept.png
:width: 1000px
:align: center

sfaira_ is a model and a data repository in a single python package.
Its data API gives users access to streamlined data loaders that allow reproducible use of published and private data sets for model training and exploration.
Its model API gives user streamlined access to pre-trained models and to common model architectures to ease usage of neural networks in common single-cell analysis workflows.
We provide an interactive overview of the current state of the zoos on sfaira-site_.

Its data zoo gives users access to streamlined data loaders that allow reproducible use of published and private data sets for model training and exploration.
Its model zoo gives user streamlined access to pre-trained models and to common model architectures to ease usage of neural networks in common single-cell analysis workflows:
A model zoo is a software infrastructure that improves user access to pre-trained models which are separately published, such as DCA_ or scArches_:
Instead of focussing on developing new models, we focus on making models easily accessible to users and distributable by developers.
sfaira integrates into scanpy_ workflows.

.. _scanpy: https://github.com/theislab/scanpy
.. _sfaira: https://sfaira.readthedocs.io
.. _DCA: https://github.com/theislab/dca
.. _scArches: https://github.com/theislab/scarches
.. _sfaira-site: https://theislab.github.io/sfaira-site/index.html
154 changes: 154 additions & 0 deletions docs/api/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
.. module:: sfaira
.. automodule:: sfaira
:noindex:

API
===

Import sfaira as::

import sfaira



Data: `data`
------------

.. module:: sfaira.data
.. currentmodule:: sfaira

The sfaira data zoo API.


Pre-defined data set collections
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This sub-module gives you access to curated subsets of the data zoo, e.g. all data sets from human lungs.

.. autosummary::
:toctree: .

data.human
data.mouse


Functionalities for interactive data analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This sub-module gives you access to functionalities you need to define your own data set collections based on the sfaira data zoo.

.. autosummary::
:toctree: .

data.DatasetBase
data.DatasetGroupBase
data.DatasetSuperGroup


Functionalities for interactive data analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This sub-module gives you access to functionalities you need to load new data live into the data zoo to handle a raw data set in the context of zoo data sets.

.. autosummary::
:toctree: .

data.DatasetInteractive


Genomes: `genomes`
------------------

.. module:: sfaira.genomes
.. currentmodule:: sfaira

This sub-module gives you access to properties of the genome representations used in sfaira.

.. autosummary::
:toctree: .

genomes.ExtractFeatureListEnsemble


Models: `models`
----------------

.. module:: sfaira.models
.. currentmodule:: sfaira

The sfaira model zoo API for advanced use.
This API is structured by streamlined, task-specific APIs for specific analysis problems.
This API is targeted at developers, see also `ui` for a user centric wrapping API for this model zoo.


Cell-type predictor models
~~~~~~~~~~~~~~~~~~~~~~~~~~

This sub-module handles models that predict cell types.

.. autosummary::
:toctree: .

models.celltype


Embedding models
~~~~~~~~~~~~~~~~

This sub-module handles models that embed expression vectors (cells) into a latent space.

.. autosummary::
:toctree: .

models.embedding


Train: `train`
--------------

.. module:: sfaira.train
.. currentmodule:: sfaira

The interface for training sfaira compatible models.
This is a sub-module dedicated for developers to ease model training and deployment.

Trainer classes
~~~~~~~~~~~~~~~

Trainer class wrap estimator classes (which wrap model classes) and handle grid-search specific tasks centred on model fits,
such as saving evaluation metrics and model weights.

.. autosummary::
:toctree: .

train.TargetZoos
train.TrainModelCelltype
train.TrainModelEmbedding


Grid search summary classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Grid search summary classes allow a developer to easily interact with a finished grid search by loading and summarising results,
which were saved through Trainer classes.

.. autosummary::
:toctree: .

train.GridsearchContainer
train.SummarizeGridsearchCelltype
train.SummarizeGridsearchEmbedding

User interface: `ui`
--------------------

.. module:: sfaira.ui
.. currentmodule:: sfaira

This sub-module gives users access to the model zoo, including model query from remote servers.
This API is designed to be used in analysis workflows and does not require any understanding of the way models are defined and stored.

.. autosummary::
:toctree: .

ui.UserInterface
116 changes: 92 additions & 24 deletions docs/data.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,46 @@
Data
======

.. image:: https://raw.githubusercontent.com/theislab/sfaira/master/resources/images/data_zoo.png
:width: 600px
:align: center

Build data repository locally
------------------------------

Build a repository structure:
1. Choose a directory to dedicate to the data base, called root in the following.
2. Make subfolders in root for each organism for which you want to build a data base.
3. Make subfolders for each organ whithin each organism for which you want to build a data base.
Build a repository structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Choose a directory to dedicate to the data base, called root in the following.
2. Make subfolders in root for each organism for which you want to build a data base.
3. Make subfolders for each organ whithin each organism for which you want to build a data base.

We maintain a couple of download scripts that automatise this process, which have to be executed in a shell once to download specific subsets of the full data zoo.
These scripts can be found in sfaira.data.download_scripts.

Use 3rd party repositories
~~~~~~~~~~~~~~~~~~~~~~~~~~
Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira.
One example for such an organization is the cellxgene_ data portal.
Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira.
Data loaders for cellxgene structured data objects will be available soon!
Contact us for support of any other repositories.

.. _cellxgene: https://cellxgene.cziscience.com/

Add data sets
~~~~~~~~~~~~~

Add data sets:
4. For each species and organ combination, choose the data sets that you want to use.
5. Identify the raw files as indicated in the data loader classes and copy them into the folder. Use processed data
using the described processing if this is required: This is usually done to speed up loading for file
formats that are difficult to access.
4. For each species and organ combination, choose the data sets that you want to use.
5. Identify the raw files as indicated in the data loader classes and copy them into the folder. Use processed data
using the described processing if this is required: This is usually done to speed up loading for file
formats that are difficult to access.

Data loaders
------------

Use data loaders on existing data repository
--------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You only want to use data sets with existing data loaders and have adapted your directory structure as above?
In that case, you can immediately start using the data loader functions, you just need to supply the root directory
Expand All @@ -25,10 +49,8 @@ Depending on the functionalities you want to use, you need to create a directory
can be easily done via the data set api itself, example python scripts are under benchmarks/data_preparation. This
meta information is necessary to anticipate file sizes for backing merged adata objects for example.

TODO example.

Contribute data loaders
-----------------------
~~~~~~~~~~~~~~~~~~~~~~~

Each data set (organsism, organ, protocol, optionally also batches) has its own data loader class. Each such class is
in a separate file and inherits from a base class that contains most functionalities. Accordingly, the data loader class
Expand Down Expand Up @@ -74,7 +96,7 @@ before it is loaded into memory:
if fn is None:
if self.path is None:
raise ValueError("provide either fn in load or path in constructor")
fn = os.path.join(self.path, "human/eye/my_data.h5ad") defined file in streamlined directory structure
fn = os.path.join(self.path, "human", "eye", "my_data.h5ad") defined file in streamlined directory structure
self.adata = anndata.read(fn) # loading instruction into .adata, use other ones if the data is not h5ad

self.adata.uns["lab"] = x # load the adata.uns with meta data
Expand Down Expand Up @@ -108,13 +130,59 @@ in which local data and cell type annotation can be managed separately but still
The data loaders and cell type annotation formats between sfaira and sfaira_extensions are identical and can be easily
copied over.


Handling ontologies in data loaders
-----------------------------------

Each data loader has a versioned cell type annotation map, a dictionary.
This dictionary allows mapping of the cell type annotations that come with the raw form of the data set to the cell type
universe or ontology terms defined in sfaira, this is, however, only done upon loading of the data (.load()).
The outcome of this map is a new set of cell type labels that can be propagated to leave nodes of the ontology graph.
This dictionary requires a new entry for each new version of the corresponding cell type universe.

Ontology management
-------------------

Sfaira maintains versioned cell type universes and ontologies by species and organ.
A cell type universe is a list of the unique, most fine-grained cell type definitions available.
These cell types can be referred to by a human readable cell type name or a structure identifier within an ontology,
an ontology ID.
Often, one is also interested in access to more coarse grained groups of cell types, for example if the data quality
does not allow to distinguish between T cell subtypes.
To allow coarser type definition, sfaira maintains hierarchies of cell types, in which each hierarchical level is again
defined by a cell type identifier.
Such a hierarchy can be writted as directed acyclic graph which has the cell type universe as its leave nodes.
Intuitively, the cell type hierarchy graph depends on the cell type universe.
Accordingly, both are versioned together in sfaira:
Updates in the cell type universe, such as discovery of a new cell type, lead to an update of the ontology and an
incrementation in both of their versions.
These versioned changes materialise as a distinct list (universe) and dictionary (ontology) for each version in the
file that harbors the species- and organ-specific class that inherits from CelltypeVersionsBase and thus are available
even after updates.
This versioning without depreceation of the old objects allows sfaira to execute and train models that were designed
for older cell type universes and thus ensures reproducibility.

Contribute cell types to ontologies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To contibute new cell types or change existing cell type universe entries, the cell type universe version has to be
incremented and the new entry can simply be added to the list or modified in the list.
We do not increment the universe version if a change does not influence the identity of a leave node with respect to
the other types in the universe, ie if it simply changes the spelling of a cell type or if an onology ID is added to
a type that previously did not have one.

Contribute hierarchies to ontologies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To contribute a term to a cell type ontology, one just has to add a dictionary item that defines the new term as a set
of the leave nodes (cell type universe) of the corresponding universe version.


Using ontologies to train cell type classifiers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cell type classifiers can be trained on data sets with different coarsity of cell type annotation using aggregate
cross-entropy as a loss and aggregate accuracy as a metric.
The one-hot encoded cell type label matrix is accordingly modified in the estimator class in data loading if terms
that correspond to intermediate nodes (rather than leave nodes) are encountered in the label set.

Genome management
-----------------

We streamline feature spaces used by models by defining standardized gene sets that are used as model input.
Per default, sfaira works with the protein coding genes of a genome assembly right now.
A model topology version includes the genome it was trained for, which also defines the feature of this model as genes.
As genome assemblies are updated, model topology version can be updated and models retrained to reflect these changes.
Note that because protein coding genes do not change drastically between genome assemblies,
sample can be carried over to assemblies they were not aligned against by matching gene identifiers.
Sfaira automatically tries to overlap gene identifiers to the genome assembly selected through the current model.
Loading