Skip to content

Commit

Permalink
Master dev merge (#298)
Browse files Browse the repository at this point in the history
Resolves conflicts from master -> dev
  • Loading branch information
davidsebfischer authored Jun 9, 2021
1 parent 680511c commit 6e81f75
Show file tree
Hide file tree
Showing 19 changed files with 941 additions and 14 deletions.
112 changes: 112 additions & 0 deletions docs/adding_dataset_classes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
The class-based data loader python file
~~~~~~~~~~~~~~~~~~~~~~~~~~~
As an alternative to the preferred yaml-based dataloaders, users can provide a dataloader class together with the load function.
In this scenario, meta data is described in a constructor of a class in the same python file as the loading function.

1. A constructor of the following form that contains all the relevant metadata that is available before the actual dataset is loaded to memory.

.. code-block:: python
def __init__(
self,
path: Union[str, None] = None,
meta_path: Union[str, None] = None,
cache_path: Union[str, None] = None,
**kwargs
):
super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
# Data set meta data: You do not have to include all of these and can simply skip lines corresponding
# to attritbutes that you do not have access to. These are meta data on a sample level.
# The meta data attributes labeled with (*) may als be supplied per cell, see below,
# in this case, if you supply a .obs_key* attribute, you ccan leave out the sample-wise attribute.
self.id = x # unique identifier of data set (Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi).
self.author = x # author (list) who sampled / created the data set
self.doi = x # doi of data set accompanying manuscript
self.download_url_data = x # download website(s) of data files
self.download_url_meta = x # download website(s) of meta data files
self.assay_sc = x # (*, optional) protocol used to sample data (e.g. smart-seq2)
self.assay_differentiation = x # (*, optional) protocol used to differentiate the cell line (e.g. Lancaster, 2014)
self.assay_type_differentiation = x # (*, optional) type of protocol used to differentiate the cell line (guided/unguided)
self.cell_line = x # (*, optional) cell line used (for cell culture samples)
self.dev_stage = x # (*, optional) developmental stage of organism
self.ethnicity = x # (*, optional) ethnicity of sample
self.healthy = x # (*, optional) whether sample represents a healthy organism
self.normalisation = x # (optional) normalisation applied to raw data loaded (ideally counts, "raw")
self.organ = x # (*, optional) organ (anatomical structure)
self.organism = x # (*) species / organism
self.sample_source = x # (*) whether the sample came from primary tissue or cell culture
self.sex = x # (*, optional) sex
self.state_exact = x # (*, optional) exact disease, treatment or perturbation state of sample
self.year = x # year in which sample was acquired
# The following meta data may instead also be supplied on a cell level if an appropriate column is present in the
# anndata instance (specifically in .obs) after loading.
# You need to make sure this is loaded in the loading script)!
# See above for a description what these meta data attributes mean.
# Again, if these attributes are note available, you can simply leave this out.
self.obs_key_assay_sc = x # (optional, see above, do not provide if .assay_sc is provided)
self.obs_key_assay_differentiation = x # (optional, see above, do not provide if .age is assay_differentiation)
self.obs_key_assay_type_differentiation = x # (optional, see above, do not provide if .assay_type_differentiation is provided)
self.obs_key_cell_line = x # (optional, see above, do not provide if .cell_line is provided)
self.obs_key_dev_stage = x # (optional, see above, do not provide if .dev_stage is provided)
self.obs_key_ethnicity = x # (optional, see above, do not provide if .ethnicity is provided)
self.obs_key_healthy = x # (optional, see above, do not provide if .healthy is provided)
self.obs_key_organ = x # (optional, see above, do not provide if .organ is provided)
self.obs_key_organism = x # (optional, see above, do not provide if .organism is provided)
self.obs_key_sample_source = x # (optional, see above, do not provide if .sample_source is provided)
self.obs_key_sex = x # (optional, see above, do not provide if .sex is provided)
self.obs_key_state_exact = x # (optional, see above, do not provide if .state_exact is provided)
# Additionally, cell type annotation is ALWAYS provided per cell in .obs, this annotation is optional though.
# name of column which contain streamlined cell ontology cell type classes:
self.obs_key_cell_types_original = x # (optional)
# This cell type annotation is free text but is mapped to an ontology via a .tsv file with the same name and
# directory as the python file of this data loader (see below).
2. A function called to load the data set into memory:
It is important to set an automated path indicating the location of the raw files here.
Our recommendation for this directory set-up is that you define a directory folder in your directory structure
in which all of these raw files will be (self.path) and then add a sub-directory named as
`self.directory_formatted_doi` (ie. the doi with all special characters replaced by "_" and place the raw files
directly into this sub directory.

.. code-block:: python
def load(data_dir, fn=None) -> anndata.AnnData:
fn = os.path.join(data_dir, "my.h5ad")
adata = anndata.read(fn) # loading instruction into adata, use other ones if the data is not h5ad
return adata
In summary, a python file for a mouse lung data set could look like this:

.. code-block:: python
class MyDataset(DatasetBase)
def __init__(
self,
path: Union[str, None] = None,
meta_path: Union[str, None] = None,
cache_path: Union[str, None] = None,
**kwargs
):
super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
self.author = "me"
self.doi = ["my preprint", "my peer-reviewed publication"]
self.download_url_data = "my GEO upload"
self.normalisation = "raw" # because I uploaded raw counts, which is good practice!
self.organ = "lung"
self.organism = "mouse"
self.assay_sc = "smart-seq2"
self.year = "2020"
self.sample_source = "primary_tissue"
self.obs_key_cell_types_original = "louvain_named" # i save my cell type names in here
def load(data_dir, fn=None) -> anndata.AnnData:
fn = os.path.join(data_dir, "my.h5ad")
adata = anndata.read(fn)
return adata
106 changes: 106 additions & 0 deletions docs/api/sfaira.data.DatasetBase.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
sfaira.data.DatasetBase
=======================

.. currentmodule:: sfaira.data

.. autoclass:: DatasetBase


.. automethod:: __init__


.. rubric:: Methods

.. autosummary::

~DatasetBase.__init__
~DatasetBase.clear
~DatasetBase.collapse_counts
~DatasetBase.download
~DatasetBase.load
~DatasetBase.load_meta
~DatasetBase.load_ontology_class_map
~DatasetBase.project_celltypes_to_ontology
~DatasetBase.set_dataset_id
~DatasetBase.show_summary
~DatasetBase.streamline_features
~DatasetBase.streamline_metadata
~DatasetBase.subset_cells
~DatasetBase.write_backed
~DatasetBase.write_distributed_store
~DatasetBase.write_meta
~DatasetBase.write_ontology_class_map





.. rubric:: Attributes

.. autosummary::

~DatasetBase.additional_annotation_key
~DatasetBase.annotated
~DatasetBase.assay_differentiation
~DatasetBase.assay_differentiation_obs_key
~DatasetBase.assay_sc
~DatasetBase.assay_sc_obs_key
~DatasetBase.assay_type_differentiation
~DatasetBase.assay_type_differentiation_obs_key
~DatasetBase.author
~DatasetBase.bio_sample
~DatasetBase.bio_sample_obs_key
~DatasetBase.cache_fn
~DatasetBase.cell_line
~DatasetBase.cell_line_obs_key
~DatasetBase.cell_ontology_map
~DatasetBase.cell_types_original_obs_key
~DatasetBase.cellontology_class_obs_key
~DatasetBase.cellontology_id_obs_key
~DatasetBase.celltypes_universe
~DatasetBase.citation
~DatasetBase.data_dir
~DatasetBase.default_embedding
~DatasetBase.development_stage
~DatasetBase.development_stage_obs_key
~DatasetBase.directory_formatted_doi
~DatasetBase.disease
~DatasetBase.disease_obs_key
~DatasetBase.doi
~DatasetBase.doi_cleaned_id
~DatasetBase.doi_main
~DatasetBase.download_url_data
~DatasetBase.download_url_meta
~DatasetBase.ethnicity
~DatasetBase.ethnicity_obs_key
~DatasetBase.fn_ontology_class_map_tsv
~DatasetBase.gene_id_ensembl_var_key
~DatasetBase.gene_id_symbols_var_key
~DatasetBase.id
~DatasetBase.individual
~DatasetBase.individual_obs_key
~DatasetBase.loaded
~DatasetBase.meta
~DatasetBase.meta_fn
~DatasetBase.ncells
~DatasetBase.normalization
~DatasetBase.ontology_celltypes
~DatasetBase.ontology_organ
~DatasetBase.organ
~DatasetBase.organ_obs_key
~DatasetBase.organism
~DatasetBase.organism_obs_key
~DatasetBase.primary_data
~DatasetBase.sample_source
~DatasetBase.sample_source_obs_key
~DatasetBase.sex
~DatasetBase.sex_obs_key
~DatasetBase.source
~DatasetBase.state_exact
~DatasetBase.state_exact_obs_key
~DatasetBase.tech_sample
~DatasetBase.tech_sample_obs_key
~DatasetBase.title
~DatasetBase.year


106 changes: 106 additions & 0 deletions docs/api/sfaira.data.DatasetInteractive.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
sfaira.data.DatasetInteractive
==============================

.. currentmodule:: sfaira.data

.. autoclass:: DatasetInteractive


.. automethod:: __init__


.. rubric:: Methods

.. autosummary::

~DatasetInteractive.__init__
~DatasetInteractive.clear
~DatasetInteractive.collapse_counts
~DatasetInteractive.download
~DatasetInteractive.load
~DatasetInteractive.load_meta
~DatasetInteractive.load_ontology_class_map
~DatasetInteractive.project_celltypes_to_ontology
~DatasetInteractive.set_dataset_id
~DatasetInteractive.show_summary
~DatasetInteractive.streamline_features
~DatasetInteractive.streamline_metadata
~DatasetInteractive.subset_cells
~DatasetInteractive.write_backed
~DatasetInteractive.write_distributed_store
~DatasetInteractive.write_meta
~DatasetInteractive.write_ontology_class_map





.. rubric:: Attributes

.. autosummary::

~DatasetInteractive.additional_annotation_key
~DatasetInteractive.annotated
~DatasetInteractive.assay_differentiation
~DatasetInteractive.assay_differentiation_obs_key
~DatasetInteractive.assay_sc
~DatasetInteractive.assay_sc_obs_key
~DatasetInteractive.assay_type_differentiation
~DatasetInteractive.assay_type_differentiation_obs_key
~DatasetInteractive.author
~DatasetInteractive.bio_sample
~DatasetInteractive.bio_sample_obs_key
~DatasetInteractive.cache_fn
~DatasetInteractive.cell_line
~DatasetInteractive.cell_line_obs_key
~DatasetInteractive.cell_ontology_map
~DatasetInteractive.cell_types_original_obs_key
~DatasetInteractive.cellontology_class_obs_key
~DatasetInteractive.cellontology_id_obs_key
~DatasetInteractive.celltypes_universe
~DatasetInteractive.citation
~DatasetInteractive.data_dir
~DatasetInteractive.default_embedding
~DatasetInteractive.development_stage
~DatasetInteractive.development_stage_obs_key
~DatasetInteractive.directory_formatted_doi
~DatasetInteractive.disease
~DatasetInteractive.disease_obs_key
~DatasetInteractive.doi
~DatasetInteractive.doi_cleaned_id
~DatasetInteractive.doi_main
~DatasetInteractive.download_url_data
~DatasetInteractive.download_url_meta
~DatasetInteractive.ethnicity
~DatasetInteractive.ethnicity_obs_key
~DatasetInteractive.fn_ontology_class_map_tsv
~DatasetInteractive.gene_id_ensembl_var_key
~DatasetInteractive.gene_id_symbols_var_key
~DatasetInteractive.id
~DatasetInteractive.individual
~DatasetInteractive.individual_obs_key
~DatasetInteractive.loaded
~DatasetInteractive.meta
~DatasetInteractive.meta_fn
~DatasetInteractive.ncells
~DatasetInteractive.normalization
~DatasetInteractive.ontology_celltypes
~DatasetInteractive.ontology_organ
~DatasetInteractive.organ
~DatasetInteractive.organ_obs_key
~DatasetInteractive.organism
~DatasetInteractive.organism_obs_key
~DatasetInteractive.primary_data
~DatasetInteractive.sample_source
~DatasetInteractive.sample_source_obs_key
~DatasetInteractive.sex
~DatasetInteractive.sex_obs_key
~DatasetInteractive.source
~DatasetInteractive.state_exact
~DatasetInteractive.state_exact_obs_key
~DatasetInteractive.tech_sample
~DatasetInteractive.tech_sample_obs_key
~DatasetInteractive.title
~DatasetInteractive.year


55 changes: 55 additions & 0 deletions docs/api/sfaira.data.DatasetSuperGroup.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
sfaira.data.DatasetSuperGroup
=============================

.. currentmodule:: sfaira.data

.. autoclass:: DatasetSuperGroup


.. automethod:: __init__


.. rubric:: Methods

.. autosummary::

~DatasetSuperGroup.__init__
~DatasetSuperGroup.collapse_counts
~DatasetSuperGroup.delete_backed
~DatasetSuperGroup.download
~DatasetSuperGroup.extend_dataset_groups
~DatasetSuperGroup.flatten
~DatasetSuperGroup.get_gc
~DatasetSuperGroup.load
~DatasetSuperGroup.load_cached_backed
~DatasetSuperGroup.load_config
~DatasetSuperGroup.ncells
~DatasetSuperGroup.ncells_bydataset
~DatasetSuperGroup.ncells_bydataset_flat
~DatasetSuperGroup.project_celltypes_to_ontology
~DatasetSuperGroup.remove_duplicates
~DatasetSuperGroup.set_dataset_groups
~DatasetSuperGroup.show_summary
~DatasetSuperGroup.streamline_features
~DatasetSuperGroup.streamline_metadata
~DatasetSuperGroup.subset
~DatasetSuperGroup.subset_cells
~DatasetSuperGroup.write_backed
~DatasetSuperGroup.write_config
~DatasetSuperGroup.write_distributed_store





.. rubric:: Attributes

.. autosummary::

~DatasetSuperGroup.adata
~DatasetSuperGroup.adata_ls
~DatasetSuperGroup.additional_annotation_key
~DatasetSuperGroup.datasets
~DatasetSuperGroup.ids


Loading

0 comments on commit 6e81f75

Please sign in to comment.