diff --git a/docs/adding_dataset_classes.rst b/docs/adding_dataset_classes.rst deleted file mode 100644 index cb499949d..000000000 --- a/docs/adding_dataset_classes.rst +++ /dev/null @@ -1,112 +0,0 @@ -The class-based data loader python file -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -As an alternative to the preferred yaml-based dataloaders, users can provide a dataloader class together with the load function. -In this scenario, meta data is described in a constructor of a class in the same python file as the loading function. - -1. A constructor of the following form that contains all the relevant metadata that is available before the actual dataset is loaded to memory. - -.. code-block:: python - - def __init__( - self, - path: Union[str, None] = None, - meta_path: Union[str, None] = None, - cache_path: Union[str, None] = None, - **kwargs - ): - super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs) - # Data set meta data: You do not have to include all of these and can simply skip lines corresponding - # to attritbutes that you do not have access to. These are meta data on a sample level. - # The meta data attributes labeled with (*) may als be supplied per cell, see below, - # in this case, if you supply a .obs_key* attribute, you ccan leave out the sample-wise attribute. - - self.id = x # unique identifier of data set (Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi). - - self.author = x # author (list) who sampled / created the data set - self.doi = x # doi of data set accompanying manuscript - - self.download_url_data = x # download website(s) of data files - self.download_url_meta = x # download website(s) of meta data files - - self.assay_sc = x # (*, optional) protocol used to sample data (e.g. smart-seq2) - self.assay_differentiation = x # (*, optional) protocol used to differentiate the cell line (e.g. Lancaster, 2014) - self.assay_type_differentiation = x # (*, optional) type of protocol used to differentiate the cell line (guided/unguided) - self.cell_line = x # (*, optional) cell line used (for cell culture samples) - self.dev_stage = x # (*, optional) developmental stage of organism - self.ethnicity = x # (*, optional) ethnicity of sample - self.healthy = x # (*, optional) whether sample represents a healthy organism - self.normalisation = x # (optional) normalisation applied to raw data loaded (ideally counts, "raw") - self.organ = x # (*, optional) organ (anatomical structure) - self.organism = x # (*) species / organism - self.sample_source = x # (*) whether the sample came from primary tissue or cell culture - self.sex = x # (*, optional) sex - self.state_exact = x # (*, optional) exact disease, treatment or perturbation state of sample - self.year = x # year in which sample was acquired - - # The following meta data may instead also be supplied on a cell level if an appropriate column is present in the - # anndata instance (specifically in .obs) after loading. - # You need to make sure this is loaded in the loading script)! - # See above for a description what these meta data attributes mean. - # Again, if these attributes are note available, you can simply leave this out. - self.obs_key_assay_sc = x # (optional, see above, do not provide if .assay_sc is provided) - self.obs_key_assay_differentiation = x # (optional, see above, do not provide if .age is assay_differentiation) - self.obs_key_assay_type_differentiation = x # (optional, see above, do not provide if .assay_type_differentiation is provided) - self.obs_key_cell_line = x # (optional, see above, do not provide if .cell_line is provided) - self.obs_key_dev_stage = x # (optional, see above, do not provide if .dev_stage is provided) - self.obs_key_ethnicity = x # (optional, see above, do not provide if .ethnicity is provided) - self.obs_key_healthy = x # (optional, see above, do not provide if .healthy is provided) - self.obs_key_organ = x # (optional, see above, do not provide if .organ is provided) - self.obs_key_organism = x # (optional, see above, do not provide if .organism is provided) - self.obs_key_sample_source = x # (optional, see above, do not provide if .sample_source is provided) - self.obs_key_sex = x # (optional, see above, do not provide if .sex is provided) - self.obs_key_state_exact = x # (optional, see above, do not provide if .state_exact is provided) - # Additionally, cell type annotation is ALWAYS provided per cell in .obs, this annotation is optional though. - # name of column which contain streamlined cell ontology cell type classes: - self.obs_key_cell_types_original = x # (optional) - # This cell type annotation is free text but is mapped to an ontology via a .tsv file with the same name and - # directory as the python file of this data loader (see below). - - -2. A function called to load the data set into memory: -It is important to set an automated path indicating the location of the raw files here. -Our recommendation for this directory set-up is that you define a directory folder in your directory structure -in which all of these raw files will be (self.path) and then add a sub-directory named as -`self.directory_formatted_doi` (ie. the doi with all special characters replaced by "_" and place the raw files -directly into this sub directory. - -.. code-block:: python - - def load(data_dir, fn=None) -> anndata.AnnData: - fn = os.path.join(data_dir, "my.h5ad") - adata = anndata.read(fn) # loading instruction into adata, use other ones if the data is not h5ad - return adata - -In summary, a python file for a mouse lung data set could look like this: - -.. code-block:: python - - class MyDataset(DatasetBase) - def __init__( - self, - path: Union[str, None] = None, - meta_path: Union[str, None] = None, - cache_path: Union[str, None] = None, - **kwargs - ): - super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs) - self.author = "me" - self.doi = ["my preprint", "my peer-reviewed publication"] - self.download_url_data = "my GEO upload" - self.normalisation = "raw" # because I uploaded raw counts, which is good practice! - self.organ = "lung" - self.organism = "mouse" - self.assay_sc = "smart-seq2" - self.year = "2020" - self.sample_source = "primary_tissue" - - self.obs_key_cell_types_original = "louvain_named" # i save my cell type names in here - - def load(data_dir, fn=None) -> anndata.AnnData: - fn = os.path.join(data_dir, "my.h5ad") - adata = anndata.read(fn) - return adata diff --git a/docs/development.rst b/docs/development.rst deleted file mode 100644 index 1d8488c31..000000000 --- a/docs/development.rst +++ /dev/null @@ -1,45 +0,0 @@ -Development -=========== - -Data zoo FAQ ------------- - -How are the meta data entries that I define in the constructor constrained or protected? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The items that are not free text are documented in the readthedocs data section, often, -these would require entries to be terms in an ontology. -If you make a mistake in defining these fields in a data loader that you contribute, -the template test data loader and any loading operation will throw an error -pointing at this meta data element. - -How is _load() used in data loading? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`_load()` contains all processing steps that load raw data files into a ready to use adata object. -`_load()` is wrapped in `load()`, the main loading function of a `Dataset` instance. -This adata object can be cached as an h5ad file named after the dataset ID for faster reloading -(if allow_caching=True). `_load()` can be triggered to reload from scratch even if cached data is available -(if use_cached=False). - -How is the feature space (gene names) manipulated during data loading? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Sfaira provides both gene names and ENSEMBL IDs. Missing IDs will automatically be inferred from the gene names and -vice versa. -Version tags on ENSEMBL gene IDs will be removed if specified (if remove_gene_version=True); -in this case, counts are aggregated across these features. -Sfaira makes sure that gene IDs in a dataset match IDs of chosen reference genomes. - -Datasets, DatasetGroups, DatasetSuperGroups - what are they? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Dataset: Custom class that loads a specific dataset. -DatasetGroup: A dataset group manages collection of data loaders (multiple instances of Dataset). -This is useful to group for example all data loaders corresponding to a certain study or a certain tissue. -DatasetSuperGroups: A group of DatasetGroups that allow easy addition of multiple instances of DatasetGroup. - -Basics of sfaira lazy loading via split into constructor and _load function. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The constructor of a dataset defines all metadata associated with this data set. -The loading of the actual data happens in the `load()` function and not in the constructor. -This is useful as it allows initialising the datasets and accessing dataset metadata -without loading the actual count data. -DatasetGroups can contain initialised Datasets and can be subsetted based on metadata -before loading is triggered across the entire group. diff --git a/docs/using_data.rst b/docs/using_data.rst deleted file mode 100644 index 24f0a1cbb..000000000 --- a/docs/using_data.rst +++ /dev/null @@ -1,153 +0,0 @@ -Using Data -========== - -.. image:: https://raw.githubusercontent.com/theislab/sfaira/master/resources/images/data_zoo.png - :width: 600px - :align: center - -Build data repository locally ------------------------------- - -Build a repository structure -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - - 1. Choose a directory to dedicate to the data base, called root in the following. - 2. Run the sfaira download script (sfaira.data.utils.download_all). Alternatively, you can manually set up a data base by making subfolders for each study. - -Note that the automated download is a feature of sfaira but not the core purpose of the package: -Sfaira allows you efficiently interact with such a local data repository. -Some data sets cannot be automatically downloaded and need you manual intervention, which we report in the download script output. - -Use 3rd party repositories -~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira. -One example for such an organization is the cellxgene_ data portal. -Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira. -Data loaders for cellxgene structured data objects will be available soon! -Contact us for support of any other repositories. - -.. _cellxgene: https://cellxgene.cziscience.com/ - -Genome management ------------------ - -We streamline feature spaces used by models by defining standardized gene sets that are used as model input. -Per default, sfaira works with the protein coding genes of a genome assembly right now. -A model topology version includes the genome it was trained for, which also defines the feature of this model as genes. -As genome assemblies are updated, model topology version can be updated and models retrained to reflect these changes. -Note that because protein coding genes do not change drastically between genome assemblies, -sample can be carried over to assemblies they were not aligned against by matching gene identifiers. -Sfaira automatically tries to overlap gene identifiers to the genome assembly selected through the current model. - -FAQ ---- - -How is the dataset’s ID structured? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi - -How do I assemble the data set ID if some of its element meta data are not unique? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The data set ID is designed to be a unique identifier of a data set. -Therefore, it is not an issue if it does not capture the full complexity of the data. -Simply choose the meta data value out of the list of corresponding values which comes first in the alphabet. - -What are cell-wise and sample-wise meta data? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Metadata can be set on a per sample level or, in some cases, per cell. -Sample-wise meta data can be directly set in the constructor (e.g self.organism = “human”). -Cell-wise metadata can be provided in `.obs` of the loaded data, here, -a Dataset attribute contains the name of the `.obs` column that contains these cell-wise labels -(e.g. self.obs_key_organism). -Note that sample-wise meta data should be yielded as such and not as a column in `.obs` to simplify loading. - -Which meta data objects are mandatory? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Mandatory on sample (self.attribute) or cell level (self.obs_key_attribute): - - - .id: Dataset ID. This is used to identify the data set uniquely. - Example: self.id = "human_colon_2019_10x_smilie_001_10.1016/j.cell.2019.06.029" - - .download_url_data: Link to data download website. - Example: self.download = "some URL" - - .download_url_meta: Download link to metadata. Assumes that meta data is defined in .download_url_data if not - specified. - Example: self.download_meta = "some URL" - - .gene_id_symbols_var_key, .gene_id_ensembl_var_key: Location of gene name as gene symbol and/or ENSEMBL ID in adata.var - (if index of adata.var, set to “index”, otherwise to column name). One of the two must be provided. - Example: self.gene_id_symbols_var_key = 'index', self.gene_id_ensembl_var_key = “GeneID” - - .author: First author of publication (or list of all authors). - self.author = "Last name, first name" # or ["Last name, first name", "Last name, first name"] - - .doi: Doi of publication - Example: self.doi = "10.1016/j.cell.2019.06.029" - - .organism (or .obs_key_organism): Organism sampled. - Example: self.organism = “human” - - .sample_source (or .obs_key_sample_source): Whether data was obtained from primary tissue or cell culture - Example: self.sample_source = "primary_tissue" - -Highly recommended: - - - .normalization: Normalization of count data: - Example: self.normalization = “raw” - - .organ (or .obs_key_organ): Organ sampled. - Example: self.organ = “liver” - - .assay_sc (or .obs_key_assay_sc): Protocol with which data was collected. - Example: self.assay_sc = “10x” - -Optional (if available): - - - .age (or .obs_key_age): Age of individual sampled. - Example: self.age = 80 # (80 years old for human) - - .dev_stage (or .obs_key_dev_stage): Developmental stage of individual sampled. - Example: self.dev_stage = “mature” - - .ethnicity (or .obs_key_ethnicity): Ethnicity of individual sampled (only for human). - Example: self.ethnicity = “free text” - - .healthy (or .obs_key_healthy): Is the sampled from a disease individual? (bool) - Example: self.healthy = True - - .sex (or .obs_key_sex): Sex of individual sampled. - Example: self.sex = “male” - - .state_exact (or .obs_key_state_exact): Exact disease state - self.state_exact = free text - - .obs_key_cell_types_original: Column in .obs in which free text cell type names are stored. - Example: self.obs_key_cell_types_original = 'CellType' - - .year: Year of publication: - Example: self.year = 2019 - - .cell_line: Which cell line was used for the experiment (for cell culture samples) - Example: self.cell_line = "409B2 (CVCL_K092)" - - .assay_differentiation: Which protocol was used for the differentiation of the cells (for cell culture samples) - - .assay_type_differentiation: Which protocol-type was used for the differentiation of the cells: guided or unguided - (for cell culture samples) - -How do I cache data sets? -~~~~~~~~~~~~~~~~~~~~~~~~~ -When loading a dataset with `Dataset.load(),`you can specify if the adata object -should be cached or not (allow_caching= True). -If set to True, the loaded adata object will be cached as an h5ad object for faster reloading. - -How do I add cell type annotation? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We are simplifying this right now, new instructions will be available second half of January. - -Why are constructor (`__init__`) and loading function (`_load`) split in the template data loader? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Initiation and data set loading are handled separately to allow lazy loading. -All steps that are required to load the count data and -additional metadata should be defined solely in the `_load` section. -Setting of class metadata such as `.doi`, `.id` etc. should be done in the constructor. - -How do I tell sfaira where the gene names are? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -By setting the attributes `.gene_id_symbols_var_key` or `.gene_id_ensembl_var_key` in the constructor. -If the gene names are in the index of this data frame, you can set “index” as the value of these attributes. - -I only have gene symbols (human readable names, often abbreviations), such as HGNC or MGI, but not ENSEMBL identifiers, is that a problem? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -No, that is not a problem. They will automatically be converted to Ensembl IDs. -You can, however, specify the reference genome in `Dataset.load(match_to_reference = ReferenceGenomeName)` -to which the names should be mapped to. - -I have CITE-seq data, where can I put the protein quantification? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We will soon provide a structured interface for loading and accessing CITE-seq data, -for now you can add it into `self.adata.obsm[“CITE”]`.