-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
19 changed files
with
941 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
The class-based data loader python file | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
As an alternative to the preferred yaml-based dataloaders, users can provide a dataloader class together with the load function. | ||
In this scenario, meta data is described in a constructor of a class in the same python file as the loading function. | ||
|
||
1. A constructor of the following form that contains all the relevant metadata that is available before the actual dataset is loaded to memory. | ||
|
||
.. code-block:: python | ||
def __init__( | ||
self, | ||
path: Union[str, None] = None, | ||
meta_path: Union[str, None] = None, | ||
cache_path: Union[str, None] = None, | ||
**kwargs | ||
): | ||
super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs) | ||
# Data set meta data: You do not have to include all of these and can simply skip lines corresponding | ||
# to attritbutes that you do not have access to. These are meta data on a sample level. | ||
# The meta data attributes labeled with (*) may als be supplied per cell, see below, | ||
# in this case, if you supply a .obs_key* attribute, you ccan leave out the sample-wise attribute. | ||
self.id = x # unique identifier of data set (Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi). | ||
self.author = x # author (list) who sampled / created the data set | ||
self.doi = x # doi of data set accompanying manuscript | ||
self.download_url_data = x # download website(s) of data files | ||
self.download_url_meta = x # download website(s) of meta data files | ||
self.assay_sc = x # (*, optional) protocol used to sample data (e.g. smart-seq2) | ||
self.assay_differentiation = x # (*, optional) protocol used to differentiate the cell line (e.g. Lancaster, 2014) | ||
self.assay_type_differentiation = x # (*, optional) type of protocol used to differentiate the cell line (guided/unguided) | ||
self.cell_line = x # (*, optional) cell line used (for cell culture samples) | ||
self.dev_stage = x # (*, optional) developmental stage of organism | ||
self.ethnicity = x # (*, optional) ethnicity of sample | ||
self.healthy = x # (*, optional) whether sample represents a healthy organism | ||
self.normalisation = x # (optional) normalisation applied to raw data loaded (ideally counts, "raw") | ||
self.organ = x # (*, optional) organ (anatomical structure) | ||
self.organism = x # (*) species / organism | ||
self.sample_source = x # (*) whether the sample came from primary tissue or cell culture | ||
self.sex = x # (*, optional) sex | ||
self.state_exact = x # (*, optional) exact disease, treatment or perturbation state of sample | ||
self.year = x # year in which sample was acquired | ||
# The following meta data may instead also be supplied on a cell level if an appropriate column is present in the | ||
# anndata instance (specifically in .obs) after loading. | ||
# You need to make sure this is loaded in the loading script)! | ||
# See above for a description what these meta data attributes mean. | ||
# Again, if these attributes are note available, you can simply leave this out. | ||
self.obs_key_assay_sc = x # (optional, see above, do not provide if .assay_sc is provided) | ||
self.obs_key_assay_differentiation = x # (optional, see above, do not provide if .age is assay_differentiation) | ||
self.obs_key_assay_type_differentiation = x # (optional, see above, do not provide if .assay_type_differentiation is provided) | ||
self.obs_key_cell_line = x # (optional, see above, do not provide if .cell_line is provided) | ||
self.obs_key_dev_stage = x # (optional, see above, do not provide if .dev_stage is provided) | ||
self.obs_key_ethnicity = x # (optional, see above, do not provide if .ethnicity is provided) | ||
self.obs_key_healthy = x # (optional, see above, do not provide if .healthy is provided) | ||
self.obs_key_organ = x # (optional, see above, do not provide if .organ is provided) | ||
self.obs_key_organism = x # (optional, see above, do not provide if .organism is provided) | ||
self.obs_key_sample_source = x # (optional, see above, do not provide if .sample_source is provided) | ||
self.obs_key_sex = x # (optional, see above, do not provide if .sex is provided) | ||
self.obs_key_state_exact = x # (optional, see above, do not provide if .state_exact is provided) | ||
# Additionally, cell type annotation is ALWAYS provided per cell in .obs, this annotation is optional though. | ||
# name of column which contain streamlined cell ontology cell type classes: | ||
self.obs_key_cell_types_original = x # (optional) | ||
# This cell type annotation is free text but is mapped to an ontology via a .tsv file with the same name and | ||
# directory as the python file of this data loader (see below). | ||
2. A function called to load the data set into memory: | ||
It is important to set an automated path indicating the location of the raw files here. | ||
Our recommendation for this directory set-up is that you define a directory folder in your directory structure | ||
in which all of these raw files will be (self.path) and then add a sub-directory named as | ||
`self.directory_formatted_doi` (ie. the doi with all special characters replaced by "_" and place the raw files | ||
directly into this sub directory. | ||
|
||
.. code-block:: python | ||
def load(data_dir, fn=None) -> anndata.AnnData: | ||
fn = os.path.join(data_dir, "my.h5ad") | ||
adata = anndata.read(fn) # loading instruction into adata, use other ones if the data is not h5ad | ||
return adata | ||
In summary, a python file for a mouse lung data set could look like this: | ||
|
||
.. code-block:: python | ||
class MyDataset(DatasetBase) | ||
def __init__( | ||
self, | ||
path: Union[str, None] = None, | ||
meta_path: Union[str, None] = None, | ||
cache_path: Union[str, None] = None, | ||
**kwargs | ||
): | ||
super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs) | ||
self.author = "me" | ||
self.doi = ["my preprint", "my peer-reviewed publication"] | ||
self.download_url_data = "my GEO upload" | ||
self.normalisation = "raw" # because I uploaded raw counts, which is good practice! | ||
self.organ = "lung" | ||
self.organism = "mouse" | ||
self.assay_sc = "smart-seq2" | ||
self.year = "2020" | ||
self.sample_source = "primary_tissue" | ||
self.obs_key_cell_types_original = "louvain_named" # i save my cell type names in here | ||
def load(data_dir, fn=None) -> anndata.AnnData: | ||
fn = os.path.join(data_dir, "my.h5ad") | ||
adata = anndata.read(fn) | ||
return adata |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
sfaira.data.DatasetBase | ||
======================= | ||
|
||
.. currentmodule:: sfaira.data | ||
|
||
.. autoclass:: DatasetBase | ||
|
||
|
||
.. automethod:: __init__ | ||
|
||
|
||
.. rubric:: Methods | ||
|
||
.. autosummary:: | ||
|
||
~DatasetBase.__init__ | ||
~DatasetBase.clear | ||
~DatasetBase.collapse_counts | ||
~DatasetBase.download | ||
~DatasetBase.load | ||
~DatasetBase.load_meta | ||
~DatasetBase.load_ontology_class_map | ||
~DatasetBase.project_celltypes_to_ontology | ||
~DatasetBase.set_dataset_id | ||
~DatasetBase.show_summary | ||
~DatasetBase.streamline_features | ||
~DatasetBase.streamline_metadata | ||
~DatasetBase.subset_cells | ||
~DatasetBase.write_backed | ||
~DatasetBase.write_distributed_store | ||
~DatasetBase.write_meta | ||
~DatasetBase.write_ontology_class_map | ||
|
||
|
||
|
||
|
||
|
||
.. rubric:: Attributes | ||
|
||
.. autosummary:: | ||
|
||
~DatasetBase.additional_annotation_key | ||
~DatasetBase.annotated | ||
~DatasetBase.assay_differentiation | ||
~DatasetBase.assay_differentiation_obs_key | ||
~DatasetBase.assay_sc | ||
~DatasetBase.assay_sc_obs_key | ||
~DatasetBase.assay_type_differentiation | ||
~DatasetBase.assay_type_differentiation_obs_key | ||
~DatasetBase.author | ||
~DatasetBase.bio_sample | ||
~DatasetBase.bio_sample_obs_key | ||
~DatasetBase.cache_fn | ||
~DatasetBase.cell_line | ||
~DatasetBase.cell_line_obs_key | ||
~DatasetBase.cell_ontology_map | ||
~DatasetBase.cell_types_original_obs_key | ||
~DatasetBase.cellontology_class_obs_key | ||
~DatasetBase.cellontology_id_obs_key | ||
~DatasetBase.celltypes_universe | ||
~DatasetBase.citation | ||
~DatasetBase.data_dir | ||
~DatasetBase.default_embedding | ||
~DatasetBase.development_stage | ||
~DatasetBase.development_stage_obs_key | ||
~DatasetBase.directory_formatted_doi | ||
~DatasetBase.disease | ||
~DatasetBase.disease_obs_key | ||
~DatasetBase.doi | ||
~DatasetBase.doi_cleaned_id | ||
~DatasetBase.doi_main | ||
~DatasetBase.download_url_data | ||
~DatasetBase.download_url_meta | ||
~DatasetBase.ethnicity | ||
~DatasetBase.ethnicity_obs_key | ||
~DatasetBase.fn_ontology_class_map_tsv | ||
~DatasetBase.gene_id_ensembl_var_key | ||
~DatasetBase.gene_id_symbols_var_key | ||
~DatasetBase.id | ||
~DatasetBase.individual | ||
~DatasetBase.individual_obs_key | ||
~DatasetBase.loaded | ||
~DatasetBase.meta | ||
~DatasetBase.meta_fn | ||
~DatasetBase.ncells | ||
~DatasetBase.normalization | ||
~DatasetBase.ontology_celltypes | ||
~DatasetBase.ontology_organ | ||
~DatasetBase.organ | ||
~DatasetBase.organ_obs_key | ||
~DatasetBase.organism | ||
~DatasetBase.organism_obs_key | ||
~DatasetBase.primary_data | ||
~DatasetBase.sample_source | ||
~DatasetBase.sample_source_obs_key | ||
~DatasetBase.sex | ||
~DatasetBase.sex_obs_key | ||
~DatasetBase.source | ||
~DatasetBase.state_exact | ||
~DatasetBase.state_exact_obs_key | ||
~DatasetBase.tech_sample | ||
~DatasetBase.tech_sample_obs_key | ||
~DatasetBase.title | ||
~DatasetBase.year | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
sfaira.data.DatasetInteractive | ||
============================== | ||
|
||
.. currentmodule:: sfaira.data | ||
|
||
.. autoclass:: DatasetInteractive | ||
|
||
|
||
.. automethod:: __init__ | ||
|
||
|
||
.. rubric:: Methods | ||
|
||
.. autosummary:: | ||
|
||
~DatasetInteractive.__init__ | ||
~DatasetInteractive.clear | ||
~DatasetInteractive.collapse_counts | ||
~DatasetInteractive.download | ||
~DatasetInteractive.load | ||
~DatasetInteractive.load_meta | ||
~DatasetInteractive.load_ontology_class_map | ||
~DatasetInteractive.project_celltypes_to_ontology | ||
~DatasetInteractive.set_dataset_id | ||
~DatasetInteractive.show_summary | ||
~DatasetInteractive.streamline_features | ||
~DatasetInteractive.streamline_metadata | ||
~DatasetInteractive.subset_cells | ||
~DatasetInteractive.write_backed | ||
~DatasetInteractive.write_distributed_store | ||
~DatasetInteractive.write_meta | ||
~DatasetInteractive.write_ontology_class_map | ||
|
||
|
||
|
||
|
||
|
||
.. rubric:: Attributes | ||
|
||
.. autosummary:: | ||
|
||
~DatasetInteractive.additional_annotation_key | ||
~DatasetInteractive.annotated | ||
~DatasetInteractive.assay_differentiation | ||
~DatasetInteractive.assay_differentiation_obs_key | ||
~DatasetInteractive.assay_sc | ||
~DatasetInteractive.assay_sc_obs_key | ||
~DatasetInteractive.assay_type_differentiation | ||
~DatasetInteractive.assay_type_differentiation_obs_key | ||
~DatasetInteractive.author | ||
~DatasetInteractive.bio_sample | ||
~DatasetInteractive.bio_sample_obs_key | ||
~DatasetInteractive.cache_fn | ||
~DatasetInteractive.cell_line | ||
~DatasetInteractive.cell_line_obs_key | ||
~DatasetInteractive.cell_ontology_map | ||
~DatasetInteractive.cell_types_original_obs_key | ||
~DatasetInteractive.cellontology_class_obs_key | ||
~DatasetInteractive.cellontology_id_obs_key | ||
~DatasetInteractive.celltypes_universe | ||
~DatasetInteractive.citation | ||
~DatasetInteractive.data_dir | ||
~DatasetInteractive.default_embedding | ||
~DatasetInteractive.development_stage | ||
~DatasetInteractive.development_stage_obs_key | ||
~DatasetInteractive.directory_formatted_doi | ||
~DatasetInteractive.disease | ||
~DatasetInteractive.disease_obs_key | ||
~DatasetInteractive.doi | ||
~DatasetInteractive.doi_cleaned_id | ||
~DatasetInteractive.doi_main | ||
~DatasetInteractive.download_url_data | ||
~DatasetInteractive.download_url_meta | ||
~DatasetInteractive.ethnicity | ||
~DatasetInteractive.ethnicity_obs_key | ||
~DatasetInteractive.fn_ontology_class_map_tsv | ||
~DatasetInteractive.gene_id_ensembl_var_key | ||
~DatasetInteractive.gene_id_symbols_var_key | ||
~DatasetInteractive.id | ||
~DatasetInteractive.individual | ||
~DatasetInteractive.individual_obs_key | ||
~DatasetInteractive.loaded | ||
~DatasetInteractive.meta | ||
~DatasetInteractive.meta_fn | ||
~DatasetInteractive.ncells | ||
~DatasetInteractive.normalization | ||
~DatasetInteractive.ontology_celltypes | ||
~DatasetInteractive.ontology_organ | ||
~DatasetInteractive.organ | ||
~DatasetInteractive.organ_obs_key | ||
~DatasetInteractive.organism | ||
~DatasetInteractive.organism_obs_key | ||
~DatasetInteractive.primary_data | ||
~DatasetInteractive.sample_source | ||
~DatasetInteractive.sample_source_obs_key | ||
~DatasetInteractive.sex | ||
~DatasetInteractive.sex_obs_key | ||
~DatasetInteractive.source | ||
~DatasetInteractive.state_exact | ||
~DatasetInteractive.state_exact_obs_key | ||
~DatasetInteractive.tech_sample | ||
~DatasetInteractive.tech_sample_obs_key | ||
~DatasetInteractive.title | ||
~DatasetInteractive.year | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
sfaira.data.DatasetSuperGroup | ||
============================= | ||
|
||
.. currentmodule:: sfaira.data | ||
|
||
.. autoclass:: DatasetSuperGroup | ||
|
||
|
||
.. automethod:: __init__ | ||
|
||
|
||
.. rubric:: Methods | ||
|
||
.. autosummary:: | ||
|
||
~DatasetSuperGroup.__init__ | ||
~DatasetSuperGroup.collapse_counts | ||
~DatasetSuperGroup.delete_backed | ||
~DatasetSuperGroup.download | ||
~DatasetSuperGroup.extend_dataset_groups | ||
~DatasetSuperGroup.flatten | ||
~DatasetSuperGroup.get_gc | ||
~DatasetSuperGroup.load | ||
~DatasetSuperGroup.load_cached_backed | ||
~DatasetSuperGroup.load_config | ||
~DatasetSuperGroup.ncells | ||
~DatasetSuperGroup.ncells_bydataset | ||
~DatasetSuperGroup.ncells_bydataset_flat | ||
~DatasetSuperGroup.project_celltypes_to_ontology | ||
~DatasetSuperGroup.remove_duplicates | ||
~DatasetSuperGroup.set_dataset_groups | ||
~DatasetSuperGroup.show_summary | ||
~DatasetSuperGroup.streamline_features | ||
~DatasetSuperGroup.streamline_metadata | ||
~DatasetSuperGroup.subset | ||
~DatasetSuperGroup.subset_cells | ||
~DatasetSuperGroup.write_backed | ||
~DatasetSuperGroup.write_config | ||
~DatasetSuperGroup.write_distributed_store | ||
|
||
|
||
|
||
|
||
|
||
.. rubric:: Attributes | ||
|
||
.. autosummary:: | ||
|
||
~DatasetSuperGroup.adata | ||
~DatasetSuperGroup.adata_ls | ||
~DatasetSuperGroup.additional_annotation_key | ||
~DatasetSuperGroup.datasets | ||
~DatasetSuperGroup.ids | ||
|
||
|
Oops, something went wrong.