Master dev merge (#298)

Resolves conflicts from master -> dev
theislab · Jun 9, 2021 · 6e81f75 · 6e81f75
1 parent 680511c
commit 6e81f75
Show file tree

Hide file tree

Showing 19 changed files with 941 additions and 14 deletions.
diff --git a/docs/adding_dataset_classes.rst b/docs/adding_dataset_classes.rst
@@ -0,0 +1,112 @@
+The class-based data loader python file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+As an alternative to the preferred yaml-based dataloaders, users can provide a dataloader class together with the load function.
+In this scenario, meta data is described in a constructor of a class in the same python file as the loading function.
+
+1. A constructor of the following form that contains all the relevant metadata that is available before the actual dataset is loaded to memory.
+
+.. code-block:: python
+
+    def __init__(
+            self,
+            path: Union[str, None] = None,
+            meta_path: Union[str, None] = None,
+            cache_path: Union[str, None] = None,
+            **kwargs
+    ):
+        super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
+        # Data set meta data: You do not have to include all of these and can simply skip lines corresponding
+        # to attritbutes that you do not have access to. These are meta data on a sample level.
+        # The meta data attributes labeled with (*) may als be supplied per cell, see below,
+        # in this case, if you supply a .obs_key* attribute, you ccan leave out the sample-wise attribute.
+
+        self.id = x  # unique identifier of data set (Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi).
+
+        self.author = x  # author (list) who sampled / created the data set
+        self.doi = x  # doi of data set accompanying manuscript
+
+        self.download_url_data = x  # download website(s) of data files
+        self.download_url_meta = x  # download website(s) of meta data files
+
+        self.assay_sc = x  # (*, optional) protocol used to sample data (e.g. smart-seq2)
+        self.assay_differentiation = x  # (*, optional) protocol used to differentiate the cell line (e.g. Lancaster, 2014)
+        self.assay_type_differentiation = x  # (*, optional) type of protocol used to differentiate the cell line (guided/unguided)
+        self.cell_line = x # (*, optional) cell line used (for cell culture samples)
+        self.dev_stage = x  # (*, optional) developmental stage of organism
+        self.ethnicity = x  # (*, optional) ethnicity of sample
+        self.healthy = x  # (*, optional) whether sample represents a healthy organism
+        self.normalisation = x  # (optional) normalisation applied to raw data loaded (ideally counts, "raw")
+        self.organ = x  # (*, optional) organ (anatomical structure)
+        self.organism = x  # (*) species / organism
+        self.sample_source = x  # (*) whether the sample came from primary tissue or cell culture
+        self.sex = x  # (*, optional) sex
+        self.state_exact = x  # (*, optional) exact disease, treatment or perturbation state of sample
+        self.year = x  # year in which sample was acquired
+
+        # The following meta data may instead also be supplied on a cell level if an appropriate column is present in the
+        # anndata instance (specifically in .obs) after loading.
+        # You need to make sure this is loaded in the loading script)!
+        # See above for a description what these meta data attributes mean.
+        # Again, if these attributes are note available, you can simply leave this out.
+        self.obs_key_assay_sc = x  # (optional, see above, do not provide if .assay_sc is provided)
+        self.obs_key_assay_differentiation = x  # (optional, see above, do not provide if .age is assay_differentiation)
+        self.obs_key_assay_type_differentiation = x  # (optional, see above, do not provide if .assay_type_differentiation is provided)
+        self.obs_key_cell_line = x # (optional, see above, do not provide if .cell_line is provided)
+        self.obs_key_dev_stage = x  # (optional, see above, do not provide if .dev_stage is provided)
+        self.obs_key_ethnicity = x  # (optional, see above, do not provide if .ethnicity is provided)
+        self.obs_key_healthy = x  # (optional, see above, do not provide if .healthy is provided)
+        self.obs_key_organ = x  # (optional, see above, do not provide if .organ is provided)
+        self.obs_key_organism = x  # (optional, see above, do not provide if .organism is provided)
+        self.obs_key_sample_source = x  # (optional, see above, do not provide if .sample_source is provided)
+        self.obs_key_sex = x  # (optional, see above, do not provide if .sex is provided)
+        self.obs_key_state_exact = x  # (optional, see above, do not provide if .state_exact is provided)
+        # Additionally, cell type annotation is ALWAYS provided per cell in .obs, this annotation is optional though.
+        # name of column which contain streamlined cell ontology cell type classes:
+        self.obs_key_cell_types_original = x  # (optional)
+        # This cell type annotation is free text but is mapped to an ontology via a .tsv file with the same name and
+        # directory as the python file of this data loader (see below).
+
+
+2. A function called to load the data set into memory:
+It is important to set an automated path indicating the location of the raw files here.
+Our recommendation for this directory set-up is that you define a directory folder in your directory structure
+in which all of these raw files will be (self.path) and then add a sub-directory named as
+`self.directory_formatted_doi` (ie. the doi with all special characters replaced by "_" and place the raw files
+directly into this sub directory.
+
+.. code-block:: python
+
+    def load(data_dir, fn=None) -> anndata.AnnData:
+        fn = os.path.join(data_dir, "my.h5ad")
+        adata = anndata.read(fn)  # loading instruction into adata, use other ones if the data is not h5ad
+        return adata
+
+In summary, a python file for a mouse lung data set could look like this:
+
+.. code-block:: python
+
+    class MyDataset(DatasetBase)
+        def __init__(
+                self,
+                path: Union[str, None] = None,
+                meta_path: Union[str, None] = None,
+                cache_path: Union[str, None] = None,
+                **kwargs
+        ):
+            super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
+            self.author = "me"
+            self.doi = ["my preprint", "my peer-reviewed publication"]
+            self.download_url_data = "my GEO upload"
+            self.normalisation = "raw"  # because I uploaded raw counts, which is good practice!
+            self.organ = "lung"
+            self.organism = "mouse"
+            self.assay_sc = "smart-seq2"
+            self.year = "2020"
+            self.sample_source = "primary_tissue"
+
+            self.obs_key_cell_types_original = "louvain_named"  # i save my cell type names in here
+
+    def load(data_dir, fn=None) -> anndata.AnnData:
+        fn = os.path.join(data_dir, "my.h5ad")
+        adata = anndata.read(fn)
+        return adata
diff --git a/docs/api/sfaira.data.DatasetBase.rst b/docs/api/sfaira.data.DatasetBase.rst
@@ -0,0 +1,106 @@
+sfaira.data.DatasetBase
+=======================
+
+.. currentmodule:: sfaira.data
+
+.. autoclass:: DatasetBase
+
+
+   .. automethod:: __init__
+
+
+   .. rubric:: Methods
+
+   .. autosummary::
+
+      ~DatasetBase.__init__
+      ~DatasetBase.clear
+      ~DatasetBase.collapse_counts
+      ~DatasetBase.download
+      ~DatasetBase.load
+      ~DatasetBase.load_meta
+      ~DatasetBase.load_ontology_class_map
+      ~DatasetBase.project_celltypes_to_ontology
+      ~DatasetBase.set_dataset_id
+      ~DatasetBase.show_summary
+      ~DatasetBase.streamline_features
+      ~DatasetBase.streamline_metadata
+      ~DatasetBase.subset_cells
+      ~DatasetBase.write_backed
+      ~DatasetBase.write_distributed_store
+      ~DatasetBase.write_meta
+      ~DatasetBase.write_ontology_class_map
+
+
+
+
+
+   .. rubric:: Attributes
+
+   .. autosummary::
+
+      ~DatasetBase.additional_annotation_key
+      ~DatasetBase.annotated
+      ~DatasetBase.assay_differentiation
+      ~DatasetBase.assay_differentiation_obs_key
+      ~DatasetBase.assay_sc
+      ~DatasetBase.assay_sc_obs_key
+      ~DatasetBase.assay_type_differentiation
+      ~DatasetBase.assay_type_differentiation_obs_key
+      ~DatasetBase.author
+      ~DatasetBase.bio_sample
+      ~DatasetBase.bio_sample_obs_key
+      ~DatasetBase.cache_fn
+      ~DatasetBase.cell_line
+      ~DatasetBase.cell_line_obs_key
+      ~DatasetBase.cell_ontology_map
+      ~DatasetBase.cell_types_original_obs_key
+      ~DatasetBase.cellontology_class_obs_key
+      ~DatasetBase.cellontology_id_obs_key
+      ~DatasetBase.celltypes_universe
+      ~DatasetBase.citation
+      ~DatasetBase.data_dir
+      ~DatasetBase.default_embedding
+      ~DatasetBase.development_stage
+      ~DatasetBase.development_stage_obs_key
+      ~DatasetBase.directory_formatted_doi
+      ~DatasetBase.disease
+      ~DatasetBase.disease_obs_key
+      ~DatasetBase.doi
+      ~DatasetBase.doi_cleaned_id
+      ~DatasetBase.doi_main
+      ~DatasetBase.download_url_data
+      ~DatasetBase.download_url_meta
+      ~DatasetBase.ethnicity
+      ~DatasetBase.ethnicity_obs_key
+      ~DatasetBase.fn_ontology_class_map_tsv
+      ~DatasetBase.gene_id_ensembl_var_key
+      ~DatasetBase.gene_id_symbols_var_key
+      ~DatasetBase.id
+      ~DatasetBase.individual
+      ~DatasetBase.individual_obs_key
+      ~DatasetBase.loaded
+      ~DatasetBase.meta
+      ~DatasetBase.meta_fn
+      ~DatasetBase.ncells
+      ~DatasetBase.normalization
+      ~DatasetBase.ontology_celltypes
+      ~DatasetBase.ontology_organ
+      ~DatasetBase.organ
+      ~DatasetBase.organ_obs_key
+      ~DatasetBase.organism
+      ~DatasetBase.organism_obs_key
+      ~DatasetBase.primary_data
+      ~DatasetBase.sample_source
+      ~DatasetBase.sample_source_obs_key
+      ~DatasetBase.sex
+      ~DatasetBase.sex_obs_key
+      ~DatasetBase.source
+      ~DatasetBase.state_exact
+      ~DatasetBase.state_exact_obs_key
+      ~DatasetBase.tech_sample
+      ~DatasetBase.tech_sample_obs_key
+      ~DatasetBase.title
+      ~DatasetBase.year
+
+
diff --git a/docs/api/sfaira.data.DatasetInteractive.rst b/docs/api/sfaira.data.DatasetInteractive.rst
@@ -0,0 +1,106 @@
+sfaira.data.DatasetInteractive
+==============================
+
+.. currentmodule:: sfaira.data
+
+.. autoclass:: DatasetInteractive
+
+
+   .. automethod:: __init__
+
+
+   .. rubric:: Methods
+
+   .. autosummary::
+
+      ~DatasetInteractive.__init__
+      ~DatasetInteractive.clear
+      ~DatasetInteractive.collapse_counts
+      ~DatasetInteractive.download
+      ~DatasetInteractive.load
+      ~DatasetInteractive.load_meta
+      ~DatasetInteractive.load_ontology_class_map
+      ~DatasetInteractive.project_celltypes_to_ontology
+      ~DatasetInteractive.set_dataset_id
+      ~DatasetInteractive.show_summary
+      ~DatasetInteractive.streamline_features
+      ~DatasetInteractive.streamline_metadata
+      ~DatasetInteractive.subset_cells
+      ~DatasetInteractive.write_backed
+      ~DatasetInteractive.write_distributed_store
+      ~DatasetInteractive.write_meta
+      ~DatasetInteractive.write_ontology_class_map
+
+
+
+
+
+   .. rubric:: Attributes
+
+   .. autosummary::
+
+      ~DatasetInteractive.additional_annotation_key
+      ~DatasetInteractive.annotated
+      ~DatasetInteractive.assay_differentiation
+      ~DatasetInteractive.assay_differentiation_obs_key
+      ~DatasetInteractive.assay_sc
+      ~DatasetInteractive.assay_sc_obs_key
+      ~DatasetInteractive.assay_type_differentiation
+      ~DatasetInteractive.assay_type_differentiation_obs_key
+      ~DatasetInteractive.author
+      ~DatasetInteractive.bio_sample
+      ~DatasetInteractive.bio_sample_obs_key
+      ~DatasetInteractive.cache_fn
+      ~DatasetInteractive.cell_line
+      ~DatasetInteractive.cell_line_obs_key
+      ~DatasetInteractive.cell_ontology_map
+      ~DatasetInteractive.cell_types_original_obs_key
+      ~DatasetInteractive.cellontology_class_obs_key
+      ~DatasetInteractive.cellontology_id_obs_key
+      ~DatasetInteractive.celltypes_universe
+      ~DatasetInteractive.citation
+      ~DatasetInteractive.data_dir
+      ~DatasetInteractive.default_embedding
+      ~DatasetInteractive.development_stage
+      ~DatasetInteractive.development_stage_obs_key
+      ~DatasetInteractive.directory_formatted_doi
+      ~DatasetInteractive.disease
+      ~DatasetInteractive.disease_obs_key
+      ~DatasetInteractive.doi
+      ~DatasetInteractive.doi_cleaned_id
+      ~DatasetInteractive.doi_main
+      ~DatasetInteractive.download_url_data
+      ~DatasetInteractive.download_url_meta
+      ~DatasetInteractive.ethnicity
+      ~DatasetInteractive.ethnicity_obs_key
+      ~DatasetInteractive.fn_ontology_class_map_tsv
+      ~DatasetInteractive.gene_id_ensembl_var_key
+      ~DatasetInteractive.gene_id_symbols_var_key
+      ~DatasetInteractive.id
+      ~DatasetInteractive.individual
+      ~DatasetInteractive.individual_obs_key
+      ~DatasetInteractive.loaded
+      ~DatasetInteractive.meta
+      ~DatasetInteractive.meta_fn
+      ~DatasetInteractive.ncells
+      ~DatasetInteractive.normalization
+      ~DatasetInteractive.ontology_celltypes
+      ~DatasetInteractive.ontology_organ
+      ~DatasetInteractive.organ
+      ~DatasetInteractive.organ_obs_key
+      ~DatasetInteractive.organism
+      ~DatasetInteractive.organism_obs_key
+      ~DatasetInteractive.primary_data
+      ~DatasetInteractive.sample_source
+      ~DatasetInteractive.sample_source_obs_key
+      ~DatasetInteractive.sex
+      ~DatasetInteractive.sex_obs_key
+      ~DatasetInteractive.source
+      ~DatasetInteractive.state_exact
+      ~DatasetInteractive.state_exact_obs_key
+      ~DatasetInteractive.tech_sample
+      ~DatasetInteractive.tech_sample_obs_key
+      ~DatasetInteractive.title
+      ~DatasetInteractive.year
+
+
diff --git a/docs/api/sfaira.data.DatasetSuperGroup.rst b/docs/api/sfaira.data.DatasetSuperGroup.rst
@@ -0,0 +1,55 @@
+sfaira.data.DatasetSuperGroup
+=============================
+
+.. currentmodule:: sfaira.data
+
+.. autoclass:: DatasetSuperGroup
+
+
+   .. automethod:: __init__
+
+
+   .. rubric:: Methods
+
+   .. autosummary::
+
+      ~DatasetSuperGroup.__init__
+      ~DatasetSuperGroup.collapse_counts
+      ~DatasetSuperGroup.delete_backed
+      ~DatasetSuperGroup.download
+      ~DatasetSuperGroup.extend_dataset_groups
+      ~DatasetSuperGroup.flatten
+      ~DatasetSuperGroup.get_gc
+      ~DatasetSuperGroup.load
+      ~DatasetSuperGroup.load_cached_backed
+      ~DatasetSuperGroup.load_config
+      ~DatasetSuperGroup.ncells
+      ~DatasetSuperGroup.ncells_bydataset
+      ~DatasetSuperGroup.ncells_bydataset_flat
+      ~DatasetSuperGroup.project_celltypes_to_ontology
+      ~DatasetSuperGroup.remove_duplicates
+      ~DatasetSuperGroup.set_dataset_groups
+      ~DatasetSuperGroup.show_summary
+      ~DatasetSuperGroup.streamline_features
+      ~DatasetSuperGroup.streamline_metadata
+      ~DatasetSuperGroup.subset
+      ~DatasetSuperGroup.subset_cells
+      ~DatasetSuperGroup.write_backed
+      ~DatasetSuperGroup.write_config
+      ~DatasetSuperGroup.write_distributed_store
+
+
+
+
+
+   .. rubric:: Attributes
+
+   .. autosummary::
+
+      ~DatasetSuperGroup.adata
+      ~DatasetSuperGroup.adata_ls
+      ~DatasetSuperGroup.additional_annotation_key
+      ~DatasetSuperGroup.datasets
+      ~DatasetSuperGroup.ids
+
+