Preparing release 0.3.5 (#325) (#356)

* Cellxgene export (#315) * updated count rounding warning in streamlining * improved meta data streamlining * updated DOIs to distinguish preprint and journal * CLI improvements #321 #314 (#332) * add new adding datasets figure * add sample_source * renamed assay to assay_sc * fix assay_sc template * add cell_types_original_obs_key * add sfaira annotate-dataloader hints Signed-off-by: zethson <lukas.heumos@posteo.net> * added lazy ontology loading in OCS (#334, #335) * reassigned gamma cell in pancreas to pancreatic PP cell CL:0002275 (#338) - affects d10_1016_j_cmet_2016_08_020, d10_1016_j_cels_2016_08_011 * added new edge types (#341) * Improve CLI documentation (#320) * improved error reporting in annotate * improved file not found reporting in annotate * update template creation workflow * fix doi promting * update download urls * fix data path handling in CLI * fix disease default in cli * fix test-dataloader [skip ci] * fix CI (#339) Co-authored-by: david.seb.fischer <david.seb.fischer@gmail.com> Co-authored-by: le-ander <20015434+le-ander@users.noreply.github.com> Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net> * Feature/dao improvements (#318) * updated rounding in cellxgene format export warning * updated DOIs to distinguish preprint and journal * fixed issue with ethnicity handling in cellxgene export * reordered obs in cellxgene streamlining * added store benchmark script * added multi-organism store * update doi setting in datasetinteractive * added mock data for unit test * added msle metric * enabled in memory handling of h5ad backed store * added infrastructure for ontology re-caching * fixed all unit tests and optimised run time a bit Co-authored-by: Abdul Moeed <abdulmoeed444@gmail.com> Co-authored-by: le-ander <20015434+le-ander@users.noreply.github.com> * store improvements (#346) * improvments to store API * added retrieval index sort to dask store * fixed bug in single store generator if index input was None * added sliced X and adata object emission to single store * moved memory footprint into store base class * fixed h5ad store indexing * restructured meta data streamlining code (#347) - includes bug fix that lead to missing meta data import from cellxgene structured data sets - simplified meta data streamlining code and enhanced code readability - depreceated distinction between cell type and cell type original in data set definition in favor of single attribute - allowed all ontology constrained meta data items to be supplied in any format (original + mapl, symbol, or id) via the `*_obs_col` attribute of the loader - removed resetting of _obs_col attributes in streamlining in favor of adataids controlled obs col names that extend to IDs and original labels - updated cell type entry in all data loaders * added attribute check for dictionary formatted attributes from YAML * added processing of obs columns in cellxgene import * extended error reporting in data loader discovery * fixed value protection in meta data streamlining * fixed cellxgene obs adapter * added additional mock data set with little meta data annotation * refactored cellxgene streamlining and added HANCESTRO support via EBI * fixed handling of missing ethnicity ontology for mouse * fixed EBI EFO backend * ontology unit tests now check that ontologies can be downloaded * added new generator interface, restructured batch index design interface and fixed adata uns merge in DatasetGroup (#351) - Iterators for tf dataset and similar are now emitted as an instance of a class that has an property that emit the iterator. This class keeps a pointer to the data set that is iterated over in its attributes. Thus, if this instance stays in the namespace in which tensorflow uses the iterator, it can be restarted without creating a new pointer. This had previously delayed training because tensorflow restarted the validation data set for each epoch, thus creating a new dask data set in each epoch at relatively high cost. - There is now only one iterator end point for stores (before there was base and balanced). The different index shuffling / sampling schedules are now refactored into functions and can be chosen based on string names. This makes creation and addition of new index schedules ("batch designs") easier. - Direct conversion of adata objects in memory to a store is now supported via a new multi store class. - Estimators do not have any more adata processing code but still acceppt adata, next to store instances. The adata are directly converted to a adata store instance though. All previous code related to adata processing is depreceated in the estimators. - The interface of store to estimators in the estimator is heavily simplified through the new generator interface of the store. The generator instances are placed in the train name space for efficiency but not in testing and evaluation namespaces, in which only a data set single pass is required. * Added new batch index design code - Batch schedules are now classes rather than functions. - Introduced epoch-wise reshuffling of indices in batch schedule: The reshuffling is achieved by transferring the schedule from a one-time function evaluation in the generator constructor to a evaluation of a schedule instance property that shuffles at the beginning of the iterator * Fixed balanced batch schedule. * Added merging of shared uns fields in DatasetGroup so that uns streamlining is maintained across merge of adatas. * passed empty store index validation * passed zero length index processing in batch schedule * allowed re-indexing of generator and batch schedule * added uberon versioning (#354) * added data life cycle rst (#355 ) Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net> Co-authored-by: le-ander <20015434+le-ander@users.noreply.github.com> Co-authored-by: Abdul Moeed <abdulmoeed444@gmail.com>
theislab · Sep 7, 2021 · 6c4dbff · 6c4dbff
1 parent dab9cb3
commit 6c4dbff
Show file tree

Hide file tree

Showing 180 changed files with 5,818 additions and 3,100 deletions.
diff --git a/.github/workflows/build_package.yml b/.github/workflows/build_package.yml
@@ -31,7 +31,7 @@ jobs:
       - name: Import sfaira
         run: python -c "import sfaira"
 
-      # Verify that the package does adhere to PyPI's standards
+      # Verify that the package adheres to PyPI's standards
       - name: Install required twine packaging dependencies
         run: pip install setuptools wheel twine
 

diff --git a/.github/workflows/create_templates.yml b/.github/workflows/create_templates.yml
@@ -9,7 +9,6 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest]
-        python: [3.7, 3.8]
     env:
       PYTHONIOENCODING: utf-8
 
@@ -20,7 +19,7 @@ jobs:
       - name: Setup Python
         uses: actions/setup-python@v2.1.4
         with:
-          python-version: ${{ matrix.python }}
+          python-version: 3.8
 
       - name: Upgrade and install pip
         run: python -m pip install --upgrade pip
@@ -30,6 +29,5 @@ jobs:
 
       - name: Create single_dataset template
         run: |
-          cd ..
-          echo -e "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" | sfaira create-dataloader
+          echo -e "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" | sfaira create-dataloader
           rm -rf d10_1000_j_journal_2021_01_001/
diff --git a/.gitignore b/.gitignore
@@ -5,9 +5,9 @@ cache/ontologies/cl/*
 docs/api/
 
 # Unit test temporary data:
-sfaira/unit_tests/test_data_loaders/*
-sfaira/unit_tests/test_data/*
-sfaira/unit_tests/template_data/*
+sfaira/unit_tests/data_for_testing/mock_data/store*
+**cache
+**temp
 
 # General patterns:
 git abuild

diff --git a/docs/adding_datasets.rst b/docs/adding_datasets.rst
@@ -1,15 +1,18 @@
+.. _adding_data_rst:
+
 Contributing data
 ==================
 
+For a high-level overview of data management in sfaira, read :ref:`data_life_cycle_rst` first.
 Adding datasets to sfaira is a great way to increase the visibility of your dataset and to make it available to a large audience.
 This process requires a couple of steps as outlined in the following sections.
 
 
-.. figure:: https://user-images.githubusercontent.com/21954664/117845386-c6744a00-b280-11eb-9d86-8c47132a3949.png
+.. figure:: https://user-images.githubusercontent.com/21954664/126300611-c5ba18b7-7c88-4bb1-8865-a20587cd5f7b.png
    :alt: sfaira adding datasets
 
    Overview of contributing dataloaders to sfaira. First, ensure that your data is not yet available as a dataloader.
-   Next, create a dataloader and validate it. Afterwards, annotate it to finally test it. Finally, submit your dataloader to sfaira.
+   Next, create a dataloader. Afterwards, validate/annotate it to finally test it. Finally, submit your dataloader to sfaira.
 
 sfaira features an interactive way of creating, formatting and testing dataloaders through a command line interface (CLI).
 The common workflow using the CLI looks as follows:
@@ -24,7 +27,7 @@ The common workflow using the CLI looks as follows:
     preprint and publication DOIs if both are available.
     We will also mention publication names in issues, you will however not find these in the code.
 
-.. _code: https://github.com/theislab/sfaira/tree/dev
+.. _code: https://github.com/theislab/sfaira/tree/dev/sfaira/data/dataloaders/loaders
 .. _issues: https://github.com/theislab/sfaira/issues
 
 2. Install sfaira.
@@ -43,93 +46,88 @@ The common workflow using the CLI looks as follows:
 3. Create a new dataloader.
     When creating a dataloader with ``sfaira create-dataloader`` dataloader specific attributes such as organ, organism
     and many more are prompted for.
-    We provide a description of all meta data items at the bottom of this file.
+    We provide a description of all meta data items at the bottom of this page.
     If the requested information is not available simply hit enter and continue until done.
 
 .. code-block::
 
     # make sure you are in the top-level sfaira directory from step 1
     git checkout -b YOUR_BRANCH_NAME  # create a new branch for your data loader.
-    sfaira create-dataloader
-
+    sfaira create-dataloader [--doi] [--path_loader] [--path_data]
 
-The created files are created in the sfaira installation under `sfaira/data/dataloaders/loaders/--DOI-folder--`,
+If `--doi` is not provided in the command above, the user will be prompted to enter it in the creation process.
+If `--path-loader` is not provided the following default location will be used: `./sfaira/data/dataloaders/loaders/`.
+If `--path-data` is not provided, the empty folder for the data files will be created in the following default location: `./sfaira/unit_tests/template_data/`.
+The created files are created in the sfaira installation under `<path_loader>/--DOI-folder--`,
 where the DOI-specific folder starts with `d` and is followed by the DOI in which all special characters are replaced
 by `_`, below referred to as `--DOI-folder--`:
 
 .. code-block::
 
-    ├──sfaira/data/dataloaders/loaders/--DOI-folder--
+    ├── <path_loader>/--DOI-folder--
         ├── extra_description.txt <- Optional extra description file
         ├── __init__.py
         ├── NA_NA_2021_NA_Einstein_001.py <- Contains the load function to load the data
         ├── NA_NA_2021_NA_Einstein_001.yaml <- Specifies all data loader data
+    ├── <path_data>/--DOI-folder--
 ..
 
 4. Correct yaml file.
-    Correct errors in `sfaira/data/dataloaders/loaders/--DOI-folder--/NA_NA_2021_NA_Einstein_001.yaml` file and add
+    Correct errors in `<path_loader>/--DOI-folder--/NA_NA_2021_NA_Einstein_001.yaml` file and add
     further attributes you may have forgotten in step 2.
     This step is optional.
 
 5. Make downloaded data available to sfaira data loader testing.
-    Identify the raw files as indicated in the dataloader classes and copy them into your directory structure as
-    required by your data loader.
-    Note that this should be the exact files that are uploaded to cloud servers such as GEO:
-    Do not decompress these files ff these files are archives such as zip, tar or gz.
+    Identify the raw data files as indicated in the dataloader classes and copy them into the datafolder created by
+    the previous command (`<path_data>/--DOI-folder--/`).
+    Note that this should be the exact files that are downloadable from the download URL you provided in the dataloader.
+    Do not decompress these files if these files are archives such as zip, tar or gz.
     Instead, navigate the archives directly in the load function (step 5).
-    Copy the data into `sfaira/unit_tests/template_data/--DOI-folder--/`.
+    Copy the data into `<path_data>/--DOI-folder--/`.
     This folder is masked from git and only serves for temporarily using this data for loader testing.
     After finishing loader contribution, you can delete this data again without any consequences for your loader.
 
 6. Write load function.
-    Fill load function in `sfaira/data/dataloaders/loaders/--DOI-folder--NA_NA_2021_NA_Einstein_001.py`.
-
-7. Validate the dataloader with the CLI.
-    Next validate the integrity of your dataloader content with ``sfaira validate-dataloader <path to *.yaml>``.
-    All tests must pass! If any of the tests fail please revisit your dataloader and add the missing information.
-
-.. code-block::
+    Complete the load function in `<path_loader>/--DOI-folder--/NA_NA_2021_NA_Einstein_001.py`.
 
-    # make sure you are in the top-level sfaira directory from step 1
-    sfaira validate-dataloader <path>``
-..
-
-8. Create cell type annotation if your data set is annotated.
+7. Create cell type annotation if your data set is annotated.
+    This function will run fuzzy string matching between the annotations in the metadata column you provided in the
+    `cell_types_original_obs_key` attribute of the yaml file and the Cell Ontology Database.
     Note that this will abort with error if there are bugs in your data loader.
 
 .. code-block::
 
     # make sure you are in the top-level sfaira directory from step 1
-    # sfaira annotate <path>`` TODO
+    sfaira annotate-dataloader [--doi] [--path_loader] [--path_data]
 ..
 
-9. Mitigate automated cell type maps.
-        Sfaira creates a cell type mapping `.tsv` file in the directory in which your data loaders is located if you
-        indicated that annotation is present by filling `cell_types_original_obs_key`.
-        This file is: `NA_NA_2021_NA_Einstein_001.tsv`.
+8. Clean up the automated cell type maps.
+        Sfaira creates suggestions for cell type mapping in a `.tsv` file in the directory in which your data loaders is
+        located if you indicated that annotation is present by filling `cell_types_original_obs_key`.
+        This file is: `<path_loader>/--DOI-folder--/NA_NA_2021_NA_Einstein_001.tsv`.
         This file contains two columns with one row for each unique cell type label.
         The free text identifiers in the first column "source",
         and the corresponding ontology term in the second column "target".
-        You can write this file entirely from scratch.
-        Sfaira also allows you to generate a first guess of this file using fuzzy string matching
-        which is automatically executed when you run the template data loader unit test for the first time with you new
-        loader.
-        Conflicts are not resolved in this first guess and you have to manually decide which free text field corresponds
-        to which ontology term in the case of conflicts.
-        Still, this first guess usually drastically speeds up this annotation harmonization.
-        Note that you do not have to include the non-human-readable IDs here as they are added later in a fully
+        After running the `annotate-dataloader` function, you can find a number of suggestions for matching the existing
+        celltype labels to cell labels from the cell ontology. It is now up to you to pick the best match from the
+        suggestions and delete all others from the line in the `.tsv` file. In certain cases the string matching might
+        not give the desired result. In such a case you can manually search the Cell Ontology database for the best
+        match via the OLS_ web-interface.
+        Note that you do not have to include the non-human-readable `target_id` here as they are added later in a fully
         automated fashion.
 
-10. Test data loader.
+.. _OLS:https://www.ebi.ac.uk/ols/ontologies/cl
+
+9. Test data loader.
         Note that this will abort with error if there are bugs in your data loader.
 
 .. code-block::
 
     # make sure you are in the top-level sfaira directory from step 1
-    # sfaira test-dataloader <path>`` TODO
+    sfaira test-dataloader [--doi] [--path_loader] [--path_data]
 ..
 
-11. Make loader public.
+10. Make loader public.
         You can contribute the data loader to public sfaira as code through a pull request.
         Note that you can also just keep the data loader in your local installation or keep it in sfaira_extensions
         if you do not want to make it public.
@@ -151,7 +149,7 @@ by `_`, below referred to as `--DOI-folder--`:
 ..
 
 The following sections will first describe the underlying design principles of sfaira dataloaders and
-then explain how to interactively create, validate and test dataloaders.
+then explain how to interactively create, annotate and test dataloaders.
 
 
 Writing dataloaders
@@ -185,7 +183,8 @@ before it is loaded into memory:
         sample_fns:
     dataset_wise:
         author:
-        doi:
+        doi_preprint:
+        doi_journal:
         download_url_data:
         download_url_meta:
         normalization:
@@ -254,9 +253,8 @@ In summary, a the dataloader for a mouse lung data set could look like this:
         sample_fns:
     dataset_wise:
         author: "me"
-        doi:
-            - "my preprint"
-            - "my peer-reviewed publication"
+        doi_preprint: "my preprint"
+        doi_journal: "my journal"
         download_url_data: "my GEO upload"
         download_url_meta:
         normalization: "raw"

diff --git a/docs/consuming_data.rst b/docs/consuming_data.rst
@@ -1,10 +1,14 @@
-Consuming Data
+.. _consuming_data_rst:
+
+Consuming data
 ===============
 
 .. image:: https://raw.githubusercontent.com/theislab/sfaira/master/resources/images/data_zoo.png
    :width: 600px
    :align: center
 
+For a high-level overview of data management in sfaira, read :ref:`data_life_cycle_rst` first.
+
 Build data repository locally
 ------------------------------
 

diff --git a/docs/data_life_cycle.rst b/docs/data_life_cycle.rst
@@ -0,0 +1,33 @@
+.. _data_life_cycle_rst:
+
+The data life cycle
+===================
+
+The life cycle of a single-cell count matrix often looks as follows:
+
+    1. **Generation** from primary read data in a read alignment pipeline.
+    2. **Annotation** with cell types and sample meta data.
+    3. **Publication** of annotated data, often together with a manuscript.
+    4. **Curation** of this public data set for the purpose of a meta study. In a python workflow, this curation step could be a scanpy script based on data from step 3, for example.
+    5. **Usage** of data curated specifically for the use case at hand, for example for a targeted analysis or a training of a machine learning model.
+
+where step 1-3 is often only performed once by the original authors of the data set,
+while step 4 and 5 are repeated multiple times in the community for different meta studies.
+Sfaira offers the following functionality groups that accelerate steps along this pipeline:
+
+I) Data loaders
+~~~~~~~~~~~~~~~
+We maintain streamlined data loader code that improve **Curation** (step 4) and make this step sharable and iteratively improvable.
+Read more in our guide to data contribution :ref:`adding_data_rst`.
+
+II) Dataset, DatasetGroup, DatasetSuperGroup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Using the data loaders from (I), we built an interface that can flexibly download, subset and curate data sets from the sfaira data zoo, thus improving **Usage** (step 5).
+This interface can yield adata instances to be used in a scanpy pipeline, for example.
+Read more in our guide to data consumption :ref:`consuming_data_rst`.
+
+III) Stores
+~~~~~~~~~~~
+Using the streamlined data set collections from (II), we built a computationally efficient data interface for machine learning on such large distributed data set collection, thus improving **Usage** (step 5):
+Specifically, this interface is optimised for out-of-core observation-centric indexing in scenarios that are typical to machine learning on single-cell data.
+Read more in our guide to data stores :ref:`distributed_data_rst`.
diff --git a/docs/distributed_data.rst b/docs/distributed_data.rst
@@ -0,0 +1,42 @@
+.. _distributed_data_rst:
+
+Distributed data
+================
+
+For a high-level overview of data management in sfaira, read :ref:`data_life_cycle_rst` first.
+Sfaira supports usage of distributed data for model training and execution.
+The tools are summarized under `sfaira.data.store`.
+In contrast to using an instance of AnnData in memory, these tools can be used to use data sets that are saved
+in different files (because they come from different studies) flexibly and out-of-core,
+which means without loading them into memory.
+A general use case is the training of a model on a large set of data sets, subsetted by particular cell-wise meta
+data, without creating a merged AnnData instance in memory first.
+
+Build a distributed data repository
+-----------------------------------
+
+You can use the sfaira dataset API to write streamlined groups of adata instances to a particular disk locaiton that
+then is the store directory.
+Some of the array backends used for loading stores can read arrays from cloud servers, such as dask.
+Therefore, these store directories can also be on cloud servers in some cases.
+
+Reading from a distributed data repository
+------------------------------------------
+
+The core use-case is the consumption of data in batches from a python iterator (a "generator").
+In contrast to using the full data matrix, this allows for workflows that never require the full data matrix in memory.
+This generators can for example directly be used in tensorflow or pytorch stochastic mini-batch learning pipelines.
+The core interface is `sfaira.data.load_store()` which can be used to initialise a store instance that exposes a
+generator, for example.
+An important concept in store reading is that the data sets are already streamlined on disk, which means that they have
+the same feature space for example.
+
+Distributed access optimised (DAO) store
+----------------------------------------
+
+The DAO store format is a on-disk representation of single-cell data which is optimised for generator-based access and
+distributed access.
+In brief, DAO stores optimize memory consumption and data batch access speed.
+Right now, we are using zarr and parquet, this may change in the future, we will continue to work on this format using
+the project name "dao".
+Note that data sets represented as DAO on disk can still be read into AnnData instances in memory if you wish!
diff --git a/docs/index.rst b/docs/index.rst
@@ -38,8 +38,10 @@ Latest additions
    api
    commandline_interface
    tutorials
+   data_life_cycle
    adding_datasets
    consuming_data
+   distributed_data
    models
    ecosystem
    roadmap

diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,10 @@
 anndata>=0.7.6
 crossref_commons
+cellxgene-schema
 dask
 docutils
 fuzzywuzzy
+IPython
 loompy
 matplotlib
 networkx