neurips loader update (#472)

* neurips loader update * added documentation of compressed and r file reading Co-authored-by: davidsebfischer <david.seb.fischer@gmail.com>
theislab · Feb 8, 2022 · 1c6d08a · 1c6d08a
1 parent 07353d7
commit 1c6d08a
Show file tree

Hide file tree

Showing 9 changed files with 225 additions and 28 deletions.
diff --git a/docs/adding_datasets.rst b/docs/adding_datasets.rst
@@ -242,6 +242,11 @@ Phase 1 is sub-structured into 2 sub-phases:
     all files associated with the current dataset.
     The CLI tells you how to continue from here, phase 1b) is always necessary, phase 2) is case-dependent and mistakes
     in naming the data folder in phase Pd) are flagged here.
+    As indicated at appropriate places by the CLI, some meta data are ontology constrained.
+    You should input symbols, ie. readable words and not IDs in these places.
+    For example, the `.yaml` entry ``organ`` could be "lung", which is a symbol in the UBERON ontology,
+    whereas ``organ_obs_key`` could be any string pointing to a column in the ``.obs`` in the ``anndata`` instance
+    that is output by ``load()``, where the elements of the column are then mapped to UBERON terms in phase 2.
 
     1a-docker.
         .. code-block::
@@ -259,13 +264,40 @@ Phase 1 is sub-structured into 2 sub-phases:
             sfaira create-dataloader --path-data DATA_DIR
         ..
 1b. Manual completion of created files (manual).
-    1. Correct yaml file.
+    1. Correct the `.yaml` file.
         Correct errors in `<path_loader>/<DOI-name>/ID.yaml` file and add
         further attributes you may have forgotten in step 2.
         See :ref:`sec-multiple-files` for short-cuts if you have multiple data sets.
         This step is can be skipped if there are the `.yaml` is complete after phase 1a).
-    2. Write load function.
-        Complete the `load()` function in `<path_loader>/<DOI-name>/ID.py`.
+        Note on lists and dictionaries in the yaml file format:
+        Some times, you need to write a list in yaml, e.g. because you have multiple data URLs.
+        A list looks as follows:
+        .. code-block::
+
+                # Single URL:
+                download_url_data: "URL1"
+                # Two URLs:
+                download_url_data:
+                    - "URL1"
+                    - "URL2"
+        ..
+        As suggested in this example, do not use lists of length 1.
+        In contrast, you may need to map a specific ``sample_fns`` to a meta data in multi file loaders:
+        .. code-block::
+
+                sample_fns:
+                    - "FN1"
+                    - "FN2"
+                [...]
+                assay_sc:
+                    FN1: 10x 3' v2
+                    FN2: 10x 3' v3
+        ..
+        Take particular care with the usage of quotes and ":" when using maps as outlined in this example.
+    2. Complete the load function.
+        Complete the ``load()`` function in `<path_loader>/<DOI-name>/ID.py`.
+        If you need to read compressed files directly from python, consider our guide :ref:`reading-compressed-files`.
+        If you need to read R files directly from python, consider our guide :ref:`reading-r-files`.
 
 Phase 2: annotate
 ~~~~~~~~~~~~~~~~~~~
@@ -341,6 +373,9 @@ Phase 2 is sub-structured into 2 sub-phases:
     If you accidentally replace it with `" "`, you will receive errors in phase 3, so do a visual check after finishing
     your work on each `ID*.tsv` file.
 
+    Note 3: Perfect matches are filled wihtout further suggestions,
+    you can often directly leave these rows as they are after a brief sanity check.
+
 .. _OLS:https://www.ebi.ac.uk/ols/ontologies/cl
 
 Phase 3: finalize
@@ -596,6 +631,83 @@ You can use any combination of orthogonal meta data, e.g. organ and disease anno
     which are all direct outputs of V(D)J alignment pipelines and are are stored in ``.obs``.
     This features are documented :ref:`feature-wise`.
 
+.. _sec-reading-compressed-files:
+Reading compressed files
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is a collection of code snippets that can be used in tha ``load()`` function to read compressed download files.
+See also the anndata_ and scanpy_ IO documentation.
+
+- Read a .gz compressed .mtx (.mtx.gz):
+    Note that this often occurs in cellranger output for which their is a scanpy load function that
+    applies to data of the following structure ``./PREFIX_matrix.mtx.gz``, ``./PREFIX_barcodes.tsv.gz``, and
+    ``./PREFIX_features.mtx.gz``. This can be read as:
+
+.. code-block:: python
+
+    import scanpy
+    adata = scanpy.read_10x_mtx("./", prefix="PREFIX_")
+..
+- Read from within a .gz archive (.gz):
+    Note: this requires temporary files, so avoid if read_function can read directly from .gz.
+
+.. code-block:: python
+
+    import gzip
+    from tempfile import TemporaryDirectory
+    import shutil
+    # Insert the file type as a string here so that read_function recognizes the decompressed file:
+    uncompressed_file_type = ""
+    with TemporaryDirectory() as tmpdir:
+        tmppth = tmpdir + f"/decompressed.{uncompressed_file_type}"
+        with gzip.open(fn, "rb") as input_f, open(tmppth, "wb") as output_f:
+            shutil.copyfileobj(input_f, output_f)
+        x = read_function(tmppth)
+..
+
+- Read from within a .tar archive (.tar.gz):
+    It is often useful to decompress the tar archive once manually to understand its internal directory structure.
+    Let's assume you are interested in a file ``fn_target`` within a tar archive ``fn_tar``,
+    i.e. after decompressing the tar the director is ``<fn_tar>/<fn_target>``.
+
+.. code-block:: python
+
+    import pandas
+    import tarfile
+    with tarfile.open(fn_tar) as tar:
+        # Access files in archive with tar.extractfile(fn_target), e.g.
+        tab = pandas.read_csv(tar.extractfile(sample_fn))
+..
+
+.. _anndata: https://anndata.readthedocs.io/en/latest/api.html#reading
+.. _scanpy: https://scanpy.readthedocs.io/en/stable/api.html#reading
+
+.. _sec-reading-r-files:
+Reading R files
+~~~~~~~~~~~~~~~~
+
+Some studies deposit single-cell data in R language files, e.g. ``.rdata``, ``.Rds`` or Seurat objects.
+These objects can be read with python functions in sfaira using anndata2ri and rpy2.
+These modules allow you to run R code from within this python code:
+
+.. code-block:: python
+
+    def load(data_dir, **kwargs):
+        import anndata2ri
+        from rpy2.robjects import r
+
+        fn = os.path.join(data_dir, "SOME_FILE.rdata")
+        anndata2ri.activate()
+        adata = r(
+            f"library(Seurat)\n"
+            f"load('{fn}')\n"
+            f"new_obj = CreateSeuratObject(counts = tissue@raw.data)\n"
+            f"new_obj@meta.data = tissue@meta.data\n"
+            f"as.SingleCellExperiment(new_obj)\n"
+        )
+        return adata
+..
+
 
 Loading third party annotation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -631,6 +743,7 @@ Here an example of a `.py` file with additional annotation:
         "meta_study_y": load_annotation_meta_study_y,
     }
 
+..
 
 The table returned by `load_annotation_meta_study_x` needs to be indexed with the observation names used in `.adata`,
 the object generated in `load()`.
@@ -748,6 +861,11 @@ Note that in both cases the value, or the column values, have to fulfill constra
 - feature_reference and feature_reference_var_key [string]
     The genome annotation release that was used to quantify the features presented here,
     e.g. "Homo_sapiens.GRCh38.105".
+    You can find all ENSEMBL gtf files on the ensembl_ ftp server.
+    Here, you ll find a summary of the gtf files by release, e.g. for 105_.
+    You will find a list across organisms for this release, the target release name is the name of the gtf files that
+    ends on ``.RELEASE.gtf.gz`` under the corresponding organism.
+    For homo_sapiens_ and release 105, this yields the following reference name "Homo_sapiens.GRCh38.105".
 - feature_type and feature_type_var_key {"rna", "protein", "peak"}
     The type of a feature:
 
@@ -758,6 +876,10 @@ Note that in both cases the value, or the column values, have to fulfill constra
     - "peak": chromatin accessibility by peak
         e.g. from scATAC-seq
 
+.. _ensembl: http://ftp.ensembl.org/pub/
+.. _105: http://ftp.ensembl.org/pub/release-105/gtf/
+.. _homo_sapiens: http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/
+
 .. _sec-dataset-or-observation-wise:
 Dataset- or observation-wise
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -824,8 +946,11 @@ outlined below.
     The UBERON_ label of the sample.
     This meta data item ontology is for tissue or organ identifiers from UBERON.
 - organism and organism_obs_key. [ontology term]
-    The NCBItaxon_ label of the sample.
+    The NCBItaxon_ label of the main organism sampled here.
+    For a data matrix of an infection sample aligned against a human and virus joint reference genome,
+    this would "Homo sapiens" as it is the "main organism" in this case.
     For example, "Homo sapiens" or "Mus musculus".
+    See also the documentation of feature_reference to see which orgainsms are supported.
 - primary_data [bool]
     Whether contains cells that were measured in this study (ie this is not a meta study on published data).
 - sample_source and sample_source_obs_key. {"primary_tissue", "2d_culture", "3d_culture", "tumor"}

diff --git a/sfaira/commands/create_dataloader.py b/sfaira/commands/create_dataloader.py
@@ -315,7 +315,10 @@ def format_q_mat_key(attr) -> str:
               'If these meta data vary across cells in a data set, skip them here and annotate them in the next '
               'section. '
               'These items can also later be modified manually in the .yaml which has the same '
-              'effect as setting them here.')
+              'effect as setting them here. '
+              'A lot of these meta data are ontology constrained.'
+              'You should input symbols, ie. readable words and not IDs here.'
+              'You can look up term symbols here https://www.ebi.ac.uk/ols/index.')
 
         def format_q_uns_key(attr, onto) -> str:
             return f"Dataset-wide {attr} annotation (from {onto})"
@@ -324,14 +327,18 @@ def format_q_uns_key(attr, onto) -> str:
             function='text',
             question=format_q_uns_key("assay", "EFO"),
             default='')
+        self.template_attributes.cell_type = sfaira_questionary(
+            function='text',
+            question=format_q_uns_key("cell type", "CL (Cell ontology)"),
+            default='')
         self.template_attributes.development_stage = sfaira_questionary(
             function='text',
             question=format_q_uns_key("developmental stage", "hsapdv for human, mmusdv for mouse"),
             default='')
         self.template_attributes.disease = sfaira_questionary(
             function='text',
             question=format_q_uns_key("disease", "MONDO"),
-            default='healthy')
+            default='')
         self.template_attributes.ethnicity = sfaira_questionary(
             function='text',
             question=format_q_uns_key("ethnicity", "HANCESTRO for human, skip for non-human"),

diff --git a/sfaira/commands/test_dataloader.py b/sfaira/commands/test_dataloader.py
@@ -4,7 +4,6 @@
 import shutil
 import pydoc
 
-from rich import print
 from sfaira.consts.utils import clean_doi
 from sfaira.data import DatasetGroupDirectoryOriented
 

diff --git a/...dataloaders/loaders/d10_1126_sciimmunol_abd1554/human_blood_2020_10xsequencing_lee_001.py b/...dataloaders/loaders/d10_1126_sciimmunol_abd1554/human_blood_2020_10xsequencing_lee_001.py
@@ -1,6 +1,6 @@
 import os
 import pandas as pd
-import scanpy
+import scanpy as sc
 
 # This is provided in plain text on GEO
 sample_map = {"Sample1": "nCoV 1 scRNA-seq",
@@ -27,7 +27,7 @@
 
 
 def load(data_dir, sample_fn, **kwargs):
-    adata = scanpy.read_10x_mtx(data_dir, prefix="GSE149689_")
+    adata = sc.read_10x_mtx(data_dir, prefix="GSE149689_")
     adata.obs["sample"] = "Sample" + adata.obs.index.str.split("-").str[1]
     adata.obs["GEO upload info"] = adata.obs["sample"].map(sample_map)
 

diff --git a/sfaira/data/dataloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001.py b/sfaira/data/dataloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001.py
@@ -1,12 +1,19 @@
 import anndata
+import gzip
 import os
+import shutil
+from tempfile import TemporaryDirectory
 
 
 def load(data_dir, sample_fn, **kwargs):
     fn = os.path.join(data_dir, sample_fn)
-    adata = anndata.read(fn)
-    adata.X = adata.layers["counts"]
-    adata.obs["donor"] = ["d" + x.split("d")[1] for x in adata.obs["batch"].values]
-    adata.obs["site"] = [x.split("d")[0] for x in adata.obs["batch"].values]
-
+    with TemporaryDirectory() as tmpdir:
+        tmppth = tmpdir + "/decompressed.h5ad"
+        with gzip.open(fn, "rb") as input_f, open(tmppth, "wb") as output_f:
+            shutil.copyfileobj(input_f, output_f)
+        adata = anndata.read_h5ad(tmppth)
+    adata.var["feature_types"] = [
+        {"ATAC": "peak", "GEX": "rna", "ADT": "protein"}[x]
+        for x in adata.var["feature_types"].values
+    ]
     return adata
diff --git a/...a/data/dataloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001.yaml b/...a/data/dataloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001.yaml
@@ -1,34 +1,36 @@
 dataset_structure:
     dataset_index: 1
     sample_fns:
-        - "cite/cite_gex_processed_training.h5ad"
-        - "multiome/multiome_gex_processed_training.h5ad"
+        - "GSE194122_openproblems_neurips2021_cite_BMMC_processed.h5ad.gz"
+        - "GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad.gz"
 dataset_wise:
     author: "Luecken, Malte"
-    default_embedding:
+    default_embedding: "GEX_X_umap"
     doi_preprint:
     doi_journal: "no_doi_luecken"
-    download_url_data: "s3://openproblems-bio/public/explore"
+    download_url_data:
+        - "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE194122&format=file&file=GSE194122%5Fopenproblems%5Fneurips2021%5Fcite%5FBMMC%5Fprocessed%2Eh5ad%2Egz"
+        - "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE194122&format=file&file=GSE194122%5Fopenproblems%5Fneurips2021%5Fmultiome%5FBMMC%5Fprocessed%2Eh5ad%2Egz"
     download_url_meta:
     primary_data: True
     year: 2021
 layers:
-    layer_counts: "X"
-    layer_processed:
+    layer_counts: "counts"
+    layer_processed: "X"
     layer_spliced_counts:
     layer_spliced_processed:
     layer_unspliced_counts:
     layer_unspliced_processed:
     layer_velocity:
 dataset_or_feature_wise:
-    feature_reference:
+    feature_reference: "Homo_sapiens.GRCh38.98"
     feature_reference_var_key:
-    feature_type: "rna"
-    feature_type_var_key:
+    feature_type:
+    feature_type_var_key: "feature_types"
 dataset_or_observation_wise:
     assay_sc:
-        cite/cite_gex_processed_training.h5ad: "10x 3' v3"
-        multiome/multiome_gex_processed_training.h5ad: "10x 3' v3"
+        GSE194122_openproblems_neurips2021_cite_BMMC_processed.h5ad.gz: "CITE-seq (cell surface protein profiling)"
+        GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad.gz: "10x multiome"
     assay_sc_obs_key:
     assay_differentiation:
     assay_differentiation_obs_key:
@@ -45,7 +47,7 @@ dataset_or_observation_wise:
     disease: "healthy"
     disease_obs_key:
     ethnicity:
-    ethnicity_obs_key:
+    ethnicity_obs_key: "Ethnicity"
     gm:
     gm_obs_key:
     individual:
@@ -57,11 +59,11 @@ dataset_or_observation_wise:
     sample_source: "primary_tissue"
     sample_source_obs_key:
     sex:
-    sex_obs_key:
+    sex_obs_key: "DonorGender"
     source_doi:
     source_doi_obs_key:
     state_exact:
-    state_exact_obs_key:
+    state_exact_obs_key: "DonorSmoker"
     tech_sample:
     tech_sample_obs_key: "site*donor"
     treatment:

diff --git a/...taloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001_cell_type.tsv b/...taloaders/loaders/dno_doi_luecken/homosapiens_blood_2021_10x3v3_luecken_001_cell_type.tsv
@@ -0,0 +1,51 @@
+source	target	target_id
+B1 B	B cell	CL:0000236
+B1 B IGKC+	B cell	CL:0000236
+B1 B IGKC-	B cell	CL:0000236
+CD14+ Mono	monocyte	CL:0000576
+CD16+ Mono	monocyte	CL:0000576
+CD4+ T CD314+ CD45RA+	CD4-positive, alpha-beta T cell	CL:0000624
+CD4+ T activated	activated CD4-positive, alpha-beta T cell	CL:0000896
+CD4+ T activated integrinB7+	activated CD4-positive, alpha-beta T cell	CL:0000896
+CD4+ T naive	CD4-positive, alpha-beta T cell	CL:0000624
+CD8+ T	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T CD49f+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T CD57+ CD45RA+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T CD57+ CD45RO+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T CD69+ CD45RA+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T CD69+ CD45RO+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T TIGIT+ CD45RA+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T TIGIT+ CD45RO+	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T naive	CD8-positive, alpha-beta T cell	CL:0000625
+CD8+ T naive CD127+ CD26- CD101-	CD8-positive, alpha-beta T cell	CL:0000625
+Erythroblast	erythroblast	CL:0000765
+G/M prog	granulocyte monocyte progenitor cell	CL:0000557
+HSC	hematopoietic stem cell	CL:0000037
+ID2-hi myeloid prog	common myeloid progenitor	CL:0000049
+ILC	lymphocyte	CL:0000542
+ILC1	lymphocyte	CL:0000542
+Lymph prog	early lymphoid progenitor	CL:0000936
+MAIT	mucosal invariant T cell	CL:0000940
+MK/E prog	megakaryocyte-erythroid progenitor cell	CL:0000050
+NK	natural killer cell	CL:0000623
+NK CD158e1+	natural killer cell	CL:0000623
+Naive CD20+ B	naive B cell	CL:0000788
+Naive CD20+ B IGKC+	naive B cell	CL:0000788
+Naive CD20+ B IGKC-	naive B cell	CL:0000788
+Normoblast	erythroblast	CL:0000765
+Plasma cell	plasma cell	CL:0000786
+Plasma cell IGKC+	plasma cell	CL:0000786
+Plasma cell IGKC-	plasma cell	CL:0000786
+Plasmablast IGKC+	plasmablast	CL:0000980
+Plasmablast IGKC-	plasmablast	CL:0000980
+Proerythroblast	proerythroblast	CL:0000547
+Reticulocyte	reticulocyte	CL:0000558
+T prog cycling	T cell	CL:0000084
+T reg	regulatory T cell	CL:0000815
+Transitional B	transitional stage B cell	CL:0000818
+cDC1	conventional dendritic cell	CL:0000990
+cDC2	conventional dendritic cell	CL:0000990
+dnT	double negative thymocyte	CL:0002489
+gdT	gamma-delta T cell	CL:0000798
+gdT	gamma-delta T cell	CL:0000798
+pDC	plasmacytoid dendritic cell	CL:0000784