From 3757a87673e0a6eba811e55f0c283719f05fcc01 Mon Sep 17 00:00:00 2001
From: "david.seb.fischer" <david.seb.fischer@gmail.com>
Date: Wed, 9 Jun 2021 14:17:35 +0200
Subject: [PATCH] removed outdated files

---
 docs/adding_dataset_classes.rst | 112 -----------------------
 docs/development.rst            |  45 ----------
 docs/using_data.rst             | 153 --------------------------------
 3 files changed, 310 deletions(-)
 delete mode 100644 docs/adding_dataset_classes.rst
 delete mode 100644 docs/development.rst
 delete mode 100644 docs/using_data.rst

diff --git a/docs/adding_dataset_classes.rst b/docs/adding_dataset_classes.rst
deleted file mode 100644
index cb499949d..000000000
--- a/docs/adding_dataset_classes.rst
+++ /dev/null
@@ -1,112 +0,0 @@
-The class-based data loader python file
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-As an alternative to the preferred yaml-based dataloaders, users can provide a dataloader class together with the load function.
-In this scenario, meta data is described in a constructor of a class in the same python file as the loading function.
-
-1. A constructor of the following form that contains all the relevant metadata that is available before the actual dataset is loaded to memory.
-
-.. code-block:: python
-
-    def __init__(
-            self,
-            path: Union[str, None] = None,
-            meta_path: Union[str, None] = None,
-            cache_path: Union[str, None] = None,
-            **kwargs
-    ):
-        super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
-        # Data set meta data: You do not have to include all of these and can simply skip lines corresponding
-        # to attritbutes that you do not have access to. These are meta data on a sample level.
-        # The meta data attributes labeled with (*) may als be supplied per cell, see below,
-        # in this case, if you supply a .obs_key* attribute, you ccan leave out the sample-wise attribute.
-
-        self.id = x  # unique identifier of data set (Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi).
-
-        self.author = x  # author (list) who sampled / created the data set
-        self.doi = x  # doi of data set accompanying manuscript
-
-        self.download_url_data = x  # download website(s) of data files
-        self.download_url_meta = x  # download website(s) of meta data files
-
-        self.assay_sc = x  # (*, optional) protocol used to sample data (e.g. smart-seq2)
-        self.assay_differentiation = x  # (*, optional) protocol used to differentiate the cell line (e.g. Lancaster, 2014)
-        self.assay_type_differentiation = x  # (*, optional) type of protocol used to differentiate the cell line (guided/unguided)
-        self.cell_line = x # (*, optional) cell line used (for cell culture samples)
-        self.dev_stage = x  # (*, optional) developmental stage of organism
-        self.ethnicity = x  # (*, optional) ethnicity of sample
-        self.healthy = x  # (*, optional) whether sample represents a healthy organism
-        self.normalisation = x  # (optional) normalisation applied to raw data loaded (ideally counts, "raw")
-        self.organ = x  # (*, optional) organ (anatomical structure)
-        self.organism = x  # (*) species / organism
-        self.sample_source = x  # (*) whether the sample came from primary tissue or cell culture
-        self.sex = x  # (*, optional) sex
-        self.state_exact = x  # (*, optional) exact disease, treatment or perturbation state of sample
-        self.year = x  # year in which sample was acquired
-
-        # The following meta data may instead also be supplied on a cell level if an appropriate column is present in the
-        # anndata instance (specifically in .obs) after loading.
-        # You need to make sure this is loaded in the loading script)!
-        # See above for a description what these meta data attributes mean.
-        # Again, if these attributes are note available, you can simply leave this out.
-        self.obs_key_assay_sc = x  # (optional, see above, do not provide if .assay_sc is provided)
-        self.obs_key_assay_differentiation = x  # (optional, see above, do not provide if .age is assay_differentiation)
-        self.obs_key_assay_type_differentiation = x  # (optional, see above, do not provide if .assay_type_differentiation is provided)
-        self.obs_key_cell_line = x # (optional, see above, do not provide if .cell_line is provided)
-        self.obs_key_dev_stage = x  # (optional, see above, do not provide if .dev_stage is provided)
-        self.obs_key_ethnicity = x  # (optional, see above, do not provide if .ethnicity is provided)
-        self.obs_key_healthy = x  # (optional, see above, do not provide if .healthy is provided)
-        self.obs_key_organ = x  # (optional, see above, do not provide if .organ is provided)
-        self.obs_key_organism = x  # (optional, see above, do not provide if .organism is provided)
-        self.obs_key_sample_source = x  # (optional, see above, do not provide if .sample_source is provided)
-        self.obs_key_sex = x  # (optional, see above, do not provide if .sex is provided)
-        self.obs_key_state_exact = x  # (optional, see above, do not provide if .state_exact is provided)
-        # Additionally, cell type annotation is ALWAYS provided per cell in .obs, this annotation is optional though.
-        # name of column which contain streamlined cell ontology cell type classes:
-        self.obs_key_cell_types_original = x  # (optional)
-        # This cell type annotation is free text but is mapped to an ontology via a .tsv file with the same name and
-        # directory as the python file of this data loader (see below).
-
-
-2. A function called to load the data set into memory:
-It is important to set an automated path indicating the location of the raw files here.
-Our recommendation for this directory set-up is that you define a directory folder in your directory structure
-in which all of these raw files will be (self.path) and then add a sub-directory named as
-`self.directory_formatted_doi` (ie. the doi with all special characters replaced by "_" and place the raw files
-directly into this sub directory.
-
-.. code-block:: python
-
-    def load(data_dir, fn=None) -> anndata.AnnData:
-        fn = os.path.join(data_dir, "my.h5ad")
-        adata = anndata.read(fn)  # loading instruction into adata, use other ones if the data is not h5ad
-        return adata
-
-In summary, a python file for a mouse lung data set could look like this:
-
-.. code-block:: python
-
-    class MyDataset(DatasetBase)
-        def __init__(
-                self,
-                path: Union[str, None] = None,
-                meta_path: Union[str, None] = None,
-                cache_path: Union[str, None] = None,
-                **kwargs
-        ):
-            super().__init__(path=path, meta_path=meta_path, cache_path=cache_path, **kwargs)
-            self.author = "me"
-            self.doi = ["my preprint", "my peer-reviewed publication"]
-            self.download_url_data = "my GEO upload"
-            self.normalisation = "raw"  # because I uploaded raw counts, which is good practice!
-            self.organ = "lung"
-            self.organism = "mouse"
-            self.assay_sc = "smart-seq2"
-            self.year = "2020"
-            self.sample_source = "primary_tissue"
-
-            self.obs_key_cell_types_original = "louvain_named"  # i save my cell type names in here
-
-    def load(data_dir, fn=None) -> anndata.AnnData:
-        fn = os.path.join(data_dir, "my.h5ad")
-        adata = anndata.read(fn)
-        return adata
diff --git a/docs/development.rst b/docs/development.rst
deleted file mode 100644
index 1d8488c31..000000000
--- a/docs/development.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-Development
-===========
-
-Data zoo FAQ
-------------
-
-How are the meta data entries that I define in the constructor constrained or protected?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The items that are not free text are documented in the readthedocs data section, often,
-these would require entries to be terms in an ontology.
-If you make a mistake in defining these fields in a data loader that you contribute,
-the template test data loader and any loading operation will throw an error
-pointing at this meta data element.
-
-How is _load() used in data loading?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`_load()` contains all processing steps that load raw data files into a ready to use adata object.
-`_load()` is wrapped in `load()`, the main loading function of a `Dataset` instance.
-This adata object can be cached as an h5ad file named after the dataset ID for faster reloading
-(if allow_caching=True). `_load()` can be triggered to reload from scratch even if cached data is available
-(if use_cached=False).
-
-How is the feature space (gene names) manipulated during data loading?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Sfaira provides both gene names and ENSEMBL IDs. Missing IDs will automatically be inferred from the gene names and
-vice versa.
-Version tags on ENSEMBL gene IDs will be removed if specified (if remove_gene_version=True);
-in this case, counts are aggregated across these features.
-Sfaira makes sure that gene IDs in a dataset match IDs of chosen reference genomes.
-
-Datasets, DatasetGroups, DatasetSuperGroups - what are they?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Dataset: Custom class that loads a specific dataset.
-DatasetGroup: A dataset group manages collection of data loaders (multiple instances of Dataset).
-This is useful to group for example all data loaders corresponding to a certain study or a certain tissue.
-DatasetSuperGroups: A group of DatasetGroups that allow easy addition of multiple instances of DatasetGroup.
-
-Basics of sfaira lazy loading via split into constructor and _load function.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The constructor of a dataset defines all metadata associated with this data set.
-The loading of the actual data happens in the `load()` function and not in the constructor.
-This is useful as it allows initialising the datasets and accessing dataset metadata
-without loading the actual count data.
-DatasetGroups can contain initialised Datasets and can be subsetted based on metadata
-before loading is triggered across the entire group.
diff --git a/docs/using_data.rst b/docs/using_data.rst
deleted file mode 100644
index 24f0a1cbb..000000000
--- a/docs/using_data.rst
+++ /dev/null
@@ -1,153 +0,0 @@
-Using Data
-==========
-
-.. image:: https://raw.githubusercontent.com/theislab/sfaira/master/resources/images/data_zoo.png
-   :width: 600px
-   :align: center
-
-Build data repository locally
-------------------------------
-
-Build a repository structure
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-    1. Choose a directory to dedicate to the data base, called root in the following.
-    2. Run the sfaira download script (sfaira.data.utils.download_all). Alternatively, you can manually set up a data base by making subfolders for each study.
-
-Note that the automated download is a feature of sfaira but not the core purpose of the package:
-Sfaira allows you efficiently interact with such a local data repository.
-Some data sets cannot be automatically downloaded and need you manual intervention, which we report in the download script output.
-
-Use 3rd party repositories
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-Some organization provide streamlined data objects that can be directly consumed by data zoos such as sfaira.
-One example for such an organization is the cellxgene_ data portal.
-Through these repositories, one can easily build or extend a collection of data sets that can be easily interfaced with sfaira.
-Data loaders for cellxgene structured data objects will be available soon!
-Contact us for support of any other repositories.
-
-.. _cellxgene: https://cellxgene.cziscience.com/
-
-Genome management
------------------
-
-We streamline feature spaces used by models by defining standardized gene sets that are used as model input.
-Per default, sfaira works with the protein coding genes of a genome assembly right now.
-A model topology version includes the genome it was trained for, which also defines the feature of this model as genes.
-As genome assemblies are updated, model topology version can be updated and models retrained to reflect these changes.
-Note that because protein coding genes do not change drastically between genome assemblies,
-sample can be carried over to assemblies they were not aligned against by matching gene identifiers.
-Sfaira automatically tries to overlap gene identifiers to the genome assembly selected through the current model.
-
-FAQ
----
-
-How is the dataset’s ID structured?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Organism_Organ_Year_AssaySc_NumberOfDataset_FirstAuthorLastname_doi
-
-How do I assemble the data set ID if some of its element meta data are not unique?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The data set ID is designed to be a unique identifier of a data set.
-Therefore, it is not an issue if it does not capture the full complexity of the data.
-Simply choose the meta data value out of the list of corresponding values which comes first in the alphabet.
-
-What are cell-wise and sample-wise meta data?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Metadata can be set on a per sample level or, in some cases, per cell.
-Sample-wise meta data can be directly set in the constructor (e.g self.organism = “human”).
-Cell-wise metadata can be provided in `.obs` of the loaded data, here,
-a Dataset attribute contains the name of the `.obs` column that contains these cell-wise labels
-(e.g. self.obs_key_organism).
-Note that sample-wise meta data should be yielded as such and not as a column in `.obs` to simplify loading.
-
-Which meta data objects are mandatory?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Mandatory on sample (self.attribute) or cell level (self.obs_key_attribute):
-
-    - .id: Dataset ID. This is used to identify the data set uniquely.
-        Example: self.id = "human_colon_2019_10x_smilie_001_10.1016/j.cell.2019.06.029"
-    - .download_url_data: Link to data download website.
-        Example: self.download = "some URL"
-    - .download_url_meta: Download link to metadata. Assumes that meta data is defined in .download_url_data if not
-        specified.
-        Example: self.download_meta = "some URL"
-    - .gene_id_symbols_var_key, .gene_id_ensembl_var_key: Location of gene name as gene symbol and/or ENSEMBL ID in adata.var
-        (if index of adata.var, set to “index”, otherwise to column name). One of the two must be provided.
-        Example: self.gene_id_symbols_var_key = 'index', self.gene_id_ensembl_var_key = “GeneID”
-    - .author: First author of publication (or list of all authors).
-        self.author = "Last name, first name" # or ["Last name, first name", "Last name, first name"]
-    - .doi: Doi of publication
-        Example: self.doi = "10.1016/j.cell.2019.06.029"
-    - .organism (or .obs_key_organism): Organism sampled.
-        Example: self.organism = “human”
-    - .sample_source (or .obs_key_sample_source): Whether data was obtained from primary tissue or cell culture
-        Example: self.sample_source = "primary_tissue"
-
-Highly recommended:
-
-    - .normalization: Normalization of count data:
-        Example: self.normalization = “raw”
-    - .organ (or .obs_key_organ): Organ sampled.
-        Example: self.organ = “liver”
-    - .assay_sc (or .obs_key_assay_sc): Protocol with which data was collected.
-        Example: self.assay_sc = “10x”
-
-Optional (if available):
-
-    - .age (or .obs_key_age): Age of individual sampled.
-        Example: self.age = 80  # (80 years old for human)
-    - .dev_stage (or .obs_key_dev_stage): Developmental stage of individual sampled.
-        Example: self.dev_stage = “mature”
-    - .ethnicity (or .obs_key_ethnicity): Ethnicity of individual sampled (only for human).
-        Example: self.ethnicity = “free text”
-    - .healthy (or .obs_key_healthy): Is the sampled from a disease individual? (bool)
-        Example: self.healthy = True
-    - .sex (or .obs_key_sex): Sex of individual sampled.
-        Example: self.sex = “male”
-    - .state_exact (or .obs_key_state_exact): Exact disease state
-        self.state_exact = free text
-    - .obs_key_cell_types_original: Column in .obs in which free text cell type names are stored.
-        Example: self.obs_key_cell_types_original = 'CellType'
-    - .year: Year of publication:
-        Example: self.year = 2019
-    - .cell_line: Which cell line was used for the experiment (for cell culture samples)
-        Example: self.cell_line = "409B2 (CVCL_K092)"
-    - .assay_differentiation: Which protocol was used for the differentiation of the cells (for cell culture samples)
-    - .assay_type_differentiation: Which protocol-type was used for the differentiation of the cells: guided or unguided
-        (for cell culture samples)
-
-How do I cache data sets?
-~~~~~~~~~~~~~~~~~~~~~~~~~
-When loading a dataset with `Dataset.load(),`you can specify if the adata object
-should be cached or not  (allow_caching= True).
-If set to True, the loaded adata object will be cached as an h5ad object for faster reloading.
-
-How do I add cell type annotation?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We are simplifying this right now, new instructions will be available second half of January.
-
-Why are constructor (`__init__`) and loading function (`_load`) split in the template data loader?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Initiation and data set loading are handled separately to allow lazy loading.
-All steps that are required to load the count data and
-additional metadata should be defined solely in the `_load` section.
-Setting of class metadata such as `.doi`, `.id` etc. should be done in the constructor.
-
-How do I tell sfaira where the gene names are?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-By setting the attributes `.gene_id_symbols_var_key` or `.gene_id_ensembl_var_key` in the constructor.
-If the gene names are in the index of this data frame, you can set “index” as the value of these attributes.
-
-I only have gene symbols (human readable names, often abbreviations), such as HGNC or MGI, but not ENSEMBL identifiers, is that a problem?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-No, that is not a problem. They will automatically be converted to Ensembl IDs.
-You can, however, specify the reference genome in `Dataset.load(match_to_reference = ReferenceGenomeName)`
-to which the names should be mapped to.
-
-I have CITE-seq data, where can I put the protein quantification?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We will soon provide a structured interface for loading and accessing CITE-seq data,
-for now you can add it into `self.adata.obsm[“CITE”]`.