Skip to content

Commit

Permalink
neurips loader update (#472)
Browse files Browse the repository at this point in the history
* neurips loader update
* added documentation of compressed and r file reading

Co-authored-by: davidsebfischer <david.seb.fischer@gmail.com>
  • Loading branch information
xlancelottx and davidsebfischer authored Feb 8, 2022
1 parent 07353d7 commit 1c6d08a
Show file tree
Hide file tree
Showing 9 changed files with 225 additions and 28 deletions.
133 changes: 129 additions & 4 deletions docs/adding_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,11 @@ Phase 1 is sub-structured into 2 sub-phases:
all files associated with the current dataset.
The CLI tells you how to continue from here, phase 1b) is always necessary, phase 2) is case-dependent and mistakes
in naming the data folder in phase Pd) are flagged here.
As indicated at appropriate places by the CLI, some meta data are ontology constrained.
You should input symbols, ie. readable words and not IDs in these places.
For example, the `.yaml` entry ``organ`` could be "lung", which is a symbol in the UBERON ontology,
whereas ``organ_obs_key`` could be any string pointing to a column in the ``.obs`` in the ``anndata`` instance
that is output by ``load()``, where the elements of the column are then mapped to UBERON terms in phase 2.

1a-docker.
.. code-block::
Expand All @@ -259,13 +264,40 @@ Phase 1 is sub-structured into 2 sub-phases:
sfaira create-dataloader --path-data DATA_DIR
..
1b. Manual completion of created files (manual).
1. Correct yaml file.
1. Correct the `.yaml` file.
Correct errors in `<path_loader>/<DOI-name>/ID.yaml` file and add
further attributes you may have forgotten in step 2.
See :ref:`sec-multiple-files` for short-cuts if you have multiple data sets.
This step is can be skipped if there are the `.yaml` is complete after phase 1a).
2. Write load function.
Complete the `load()` function in `<path_loader>/<DOI-name>/ID.py`.
Note on lists and dictionaries in the yaml file format:
Some times, you need to write a list in yaml, e.g. because you have multiple data URLs.
A list looks as follows:
.. code-block::
# Single URL:
download_url_data: "URL1"
# Two URLs:
download_url_data:
- "URL1"
- "URL2"
..
As suggested in this example, do not use lists of length 1.
In contrast, you may need to map a specific ``sample_fns`` to a meta data in multi file loaders:
.. code-block::
sample_fns:
- "FN1"
- "FN2"
[...]
assay_sc:
FN1: 10x 3' v2
FN2: 10x 3' v3
..
Take particular care with the usage of quotes and ":" when using maps as outlined in this example.
2. Complete the load function.
Complete the ``load()`` function in `<path_loader>/<DOI-name>/ID.py`.
If you need to read compressed files directly from python, consider our guide :ref:`reading-compressed-files`.
If you need to read R files directly from python, consider our guide :ref:`reading-r-files`.

Phase 2: annotate
~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -341,6 +373,9 @@ Phase 2 is sub-structured into 2 sub-phases:
If you accidentally replace it with `" "`, you will receive errors in phase 3, so do a visual check after finishing
your work on each `ID*.tsv` file.

Note 3: Perfect matches are filled wihtout further suggestions,
you can often directly leave these rows as they are after a brief sanity check.

.. _OLS:https://www.ebi.ac.uk/ols/ontologies/cl
Phase 3: finalize
Expand Down Expand Up @@ -596,6 +631,83 @@ You can use any combination of orthogonal meta data, e.g. organ and disease anno
which are all direct outputs of V(D)J alignment pipelines and are are stored in ``.obs``.
This features are documented :ref:`feature-wise`.

.. _sec-reading-compressed-files:
Reading compressed files
~~~~~~~~~~~~~~~~~~~~~~~~~

This is a collection of code snippets that can be used in tha ``load()`` function to read compressed download files.
See also the anndata_ and scanpy_ IO documentation.

- Read a .gz compressed .mtx (.mtx.gz):
Note that this often occurs in cellranger output for which their is a scanpy load function that
applies to data of the following structure ``./PREFIX_matrix.mtx.gz``, ``./PREFIX_barcodes.tsv.gz``, and
``./PREFIX_features.mtx.gz``. This can be read as:

.. code-block:: python
import scanpy
adata = scanpy.read_10x_mtx("./", prefix="PREFIX_")
..
- Read from within a .gz archive (.gz):
Note: this requires temporary files, so avoid if read_function can read directly from .gz.

.. code-block:: python
import gzip
from tempfile import TemporaryDirectory
import shutil
# Insert the file type as a string here so that read_function recognizes the decompressed file:
uncompressed_file_type = ""
with TemporaryDirectory() as tmpdir:
tmppth = tmpdir + f"/decompressed.{uncompressed_file_type}"
with gzip.open(fn, "rb") as input_f, open(tmppth, "wb") as output_f:
shutil.copyfileobj(input_f, output_f)
x = read_function(tmppth)
..
- Read from within a .tar archive (.tar.gz):
It is often useful to decompress the tar archive once manually to understand its internal directory structure.
Let's assume you are interested in a file ``fn_target`` within a tar archive ``fn_tar``,
i.e. after decompressing the tar the director is ``<fn_tar>/<fn_target>``.

.. code-block:: python
import pandas
import tarfile
with tarfile.open(fn_tar) as tar:
# Access files in archive with tar.extractfile(fn_target), e.g.
tab = pandas.read_csv(tar.extractfile(sample_fn))
..
.. _anndata: https://anndata.readthedocs.io/en/latest/api.html#reading
.. _scanpy: https://scanpy.readthedocs.io/en/stable/api.html#reading

.. _sec-reading-r-files:
Reading R files
~~~~~~~~~~~~~~~~

Some studies deposit single-cell data in R language files, e.g. ``.rdata``, ``.Rds`` or Seurat objects.
These objects can be read with python functions in sfaira using anndata2ri and rpy2.
These modules allow you to run R code from within this python code:

.. code-block:: python
def load(data_dir, **kwargs):
import anndata2ri
from rpy2.robjects import r
fn = os.path.join(data_dir, "SOME_FILE.rdata")
anndata2ri.activate()
adata = r(
f"library(Seurat)\n"
f"load('{fn}')\n"
f"new_obj = CreateSeuratObject(counts = tissue@raw.data)\n"
f"new_obj@meta.data = tissue@meta.data\n"
f"as.SingleCellExperiment(new_obj)\n"
)
return adata
..

Loading third party annotation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -631,6 +743,7 @@ Here an example of a `.py` file with additional annotation:
"meta_study_y": load_annotation_meta_study_y,
}
..
The table returned by `load_annotation_meta_study_x` needs to be indexed with the observation names used in `.adata`,
the object generated in `load()`.
Expand Down Expand Up @@ -748,6 +861,11 @@ Note that in both cases the value, or the column values, have to fulfill constra
- feature_reference and feature_reference_var_key [string]
The genome annotation release that was used to quantify the features presented here,
e.g. "Homo_sapiens.GRCh38.105".
You can find all ENSEMBL gtf files on the ensembl_ ftp server.
Here, you ll find a summary of the gtf files by release, e.g. for 105_.
You will find a list across organisms for this release, the target release name is the name of the gtf files that
ends on ``.RELEASE.gtf.gz`` under the corresponding organism.
For homo_sapiens_ and release 105, this yields the following reference name "Homo_sapiens.GRCh38.105".
- feature_type and feature_type_var_key {"rna", "protein", "peak"}
The type of a feature:

Expand All @@ -758,6 +876,10 @@ Note that in both cases the value, or the column values, have to fulfill constra
- "peak": chromatin accessibility by peak
e.g. from scATAC-seq

.. _ensembl: http://ftp.ensembl.org/pub/
.. _105: http://ftp.ensembl.org/pub/release-105/gtf/
.. _homo_sapiens: http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/

.. _sec-dataset-or-observation-wise:
Dataset- or observation-wise
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -824,8 +946,11 @@ outlined below.
The UBERON_ label of the sample.
This meta data item ontology is for tissue or organ identifiers from UBERON.
- organism and organism_obs_key. [ontology term]
The NCBItaxon_ label of the sample.
The NCBItaxon_ label of the main organism sampled here.
For a data matrix of an infection sample aligned against a human and virus joint reference genome,
this would "Homo sapiens" as it is the "main organism" in this case.
For example, "Homo sapiens" or "Mus musculus".
See also the documentation of feature_reference to see which orgainsms are supported.
- primary_data [bool]
Whether contains cells that were measured in this study (ie this is not a meta study on published data).
- sample_source and sample_source_obs_key. {"primary_tissue", "2d_culture", "3d_culture", "tumor"}
Expand Down
11 changes: 9 additions & 2 deletions sfaira/commands/create_dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,10 @@ def format_q_mat_key(attr) -> str:
'If these meta data vary across cells in a data set, skip them here and annotate them in the next '
'section. '
'These items can also later be modified manually in the .yaml which has the same '
'effect as setting them here.')
'effect as setting them here. '
'A lot of these meta data are ontology constrained.'
'You should input symbols, ie. readable words and not IDs here.'
'You can look up term symbols here https://www.ebi.ac.uk/ols/index.')

def format_q_uns_key(attr, onto) -> str:
return f"Dataset-wide {attr} annotation (from {onto})"
Expand All @@ -324,14 +327,18 @@ def format_q_uns_key(attr, onto) -> str:
function='text',
question=format_q_uns_key("assay", "EFO"),
default='')
self.template_attributes.cell_type = sfaira_questionary(
function='text',
question=format_q_uns_key("cell type", "CL (Cell ontology)"),
default='')
self.template_attributes.development_stage = sfaira_questionary(
function='text',
question=format_q_uns_key("developmental stage", "hsapdv for human, mmusdv for mouse"),
default='')
self.template_attributes.disease = sfaira_questionary(
function='text',
question=format_q_uns_key("disease", "MONDO"),
default='healthy')
default='')
self.template_attributes.ethnicity = sfaira_questionary(
function='text',
question=format_q_uns_key("ethnicity", "HANCESTRO for human, skip for non-human"),
Expand Down
1 change: 0 additions & 1 deletion sfaira/commands/test_dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
import shutil
import pydoc

from rich import print
from sfaira.consts.utils import clean_doi
from sfaira.data import DatasetGroupDirectoryOriented

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
import pandas as pd
import scanpy
import scanpy as sc

# This is provided in plain text on GEO
sample_map = {"Sample1": "nCoV 1 scRNA-seq",
Expand All @@ -27,7 +27,7 @@


def load(data_dir, sample_fn, **kwargs):
adata = scanpy.read_10x_mtx(data_dir, prefix="GSE149689_")
adata = sc.read_10x_mtx(data_dir, prefix="GSE149689_")
adata.obs["sample"] = "Sample" + adata.obs.index.str.split("-").str[1]
adata.obs["GEO upload info"] = adata.obs["sample"].map(sample_map)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
import anndata
import gzip
import os
import shutil
from tempfile import TemporaryDirectory


def load(data_dir, sample_fn, **kwargs):
fn = os.path.join(data_dir, sample_fn)
adata = anndata.read(fn)
adata.X = adata.layers["counts"]
adata.obs["donor"] = ["d" + x.split("d")[1] for x in adata.obs["batch"].values]
adata.obs["site"] = [x.split("d")[0] for x in adata.obs["batch"].values]

with TemporaryDirectory() as tmpdir:
tmppth = tmpdir + "/decompressed.h5ad"
with gzip.open(fn, "rb") as input_f, open(tmppth, "wb") as output_f:
shutil.copyfileobj(input_f, output_f)
adata = anndata.read_h5ad(tmppth)
adata.var["feature_types"] = [
{"ATAC": "peak", "GEX": "rna", "ADT": "protein"}[x]
for x in adata.var["feature_types"].values
]
return adata
Original file line number Diff line number Diff line change
@@ -1,34 +1,36 @@
dataset_structure:
dataset_index: 1
sample_fns:
- "cite/cite_gex_processed_training.h5ad"
- "multiome/multiome_gex_processed_training.h5ad"
- "GSE194122_openproblems_neurips2021_cite_BMMC_processed.h5ad.gz"
- "GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad.gz"
dataset_wise:
author: "Luecken, Malte"
default_embedding:
default_embedding: "GEX_X_umap"
doi_preprint:
doi_journal: "no_doi_luecken"
download_url_data: "s3://openproblems-bio/public/explore"
download_url_data:
- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE194122&format=file&file=GSE194122%5Fopenproblems%5Fneurips2021%5Fcite%5FBMMC%5Fprocessed%2Eh5ad%2Egz"
- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE194122&format=file&file=GSE194122%5Fopenproblems%5Fneurips2021%5Fmultiome%5FBMMC%5Fprocessed%2Eh5ad%2Egz"
download_url_meta:
primary_data: True
year: 2021
layers:
layer_counts: "X"
layer_processed:
layer_counts: "counts"
layer_processed: "X"
layer_spliced_counts:
layer_spliced_processed:
layer_unspliced_counts:
layer_unspliced_processed:
layer_velocity:
dataset_or_feature_wise:
feature_reference:
feature_reference: "Homo_sapiens.GRCh38.98"
feature_reference_var_key:
feature_type: "rna"
feature_type_var_key:
feature_type:
feature_type_var_key: "feature_types"
dataset_or_observation_wise:
assay_sc:
cite/cite_gex_processed_training.h5ad: "10x 3' v3"
multiome/multiome_gex_processed_training.h5ad: "10x 3' v3"
GSE194122_openproblems_neurips2021_cite_BMMC_processed.h5ad.gz: "CITE-seq (cell surface protein profiling)"
GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad.gz: "10x multiome"
assay_sc_obs_key:
assay_differentiation:
assay_differentiation_obs_key:
Expand All @@ -45,7 +47,7 @@ dataset_or_observation_wise:
disease: "healthy"
disease_obs_key:
ethnicity:
ethnicity_obs_key:
ethnicity_obs_key: "Ethnicity"
gm:
gm_obs_key:
individual:
Expand All @@ -57,11 +59,11 @@ dataset_or_observation_wise:
sample_source: "primary_tissue"
sample_source_obs_key:
sex:
sex_obs_key:
sex_obs_key: "DonorGender"
source_doi:
source_doi_obs_key:
state_exact:
state_exact_obs_key:
state_exact_obs_key: "DonorSmoker"
tech_sample:
tech_sample_obs_key: "site*donor"
treatment:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
source target target_id
B1 B B cell CL:0000236
B1 B IGKC+ B cell CL:0000236
B1 B IGKC- B cell CL:0000236
CD14+ Mono monocyte CL:0000576
CD16+ Mono monocyte CL:0000576
CD4+ T CD314+ CD45RA+ CD4-positive, alpha-beta T cell CL:0000624
CD4+ T activated activated CD4-positive, alpha-beta T cell CL:0000896
CD4+ T activated integrinB7+ activated CD4-positive, alpha-beta T cell CL:0000896
CD4+ T naive CD4-positive, alpha-beta T cell CL:0000624
CD8+ T CD8-positive, alpha-beta T cell CL:0000625
CD8+ T CD49f+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T CD57+ CD45RA+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T CD57+ CD45RO+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T CD69+ CD45RA+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T CD69+ CD45RO+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T TIGIT+ CD45RA+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T TIGIT+ CD45RO+ CD8-positive, alpha-beta T cell CL:0000625
CD8+ T naive CD8-positive, alpha-beta T cell CL:0000625
CD8+ T naive CD127+ CD26- CD101- CD8-positive, alpha-beta T cell CL:0000625
Erythroblast erythroblast CL:0000765
G/M prog granulocyte monocyte progenitor cell CL:0000557
HSC hematopoietic stem cell CL:0000037
ID2-hi myeloid prog common myeloid progenitor CL:0000049
ILC lymphocyte CL:0000542
ILC1 lymphocyte CL:0000542
Lymph prog early lymphoid progenitor CL:0000936
MAIT mucosal invariant T cell CL:0000940
MK/E prog megakaryocyte-erythroid progenitor cell CL:0000050
NK natural killer cell CL:0000623
NK CD158e1+ natural killer cell CL:0000623
Naive CD20+ B naive B cell CL:0000788
Naive CD20+ B IGKC+ naive B cell CL:0000788
Naive CD20+ B IGKC- naive B cell CL:0000788
Normoblast erythroblast CL:0000765
Plasma cell plasma cell CL:0000786
Plasma cell IGKC+ plasma cell CL:0000786
Plasma cell IGKC- plasma cell CL:0000786
Plasmablast IGKC+ plasmablast CL:0000980
Plasmablast IGKC- plasmablast CL:0000980
Proerythroblast proerythroblast CL:0000547
Reticulocyte reticulocyte CL:0000558
T prog cycling T cell CL:0000084
T reg regulatory T cell CL:0000815
Transitional B transitional stage B cell CL:0000818
cDC1 conventional dendritic cell CL:0000990
cDC2 conventional dendritic cell CL:0000990
dnT double negative thymocyte CL:0002489
gdT gamma-delta T cell CL:0000798
gdT gamma-delta T cell CL:0000798
pDC plasmacytoid dendritic cell CL:0000784
Loading

0 comments on commit 1c6d08a

Please sign in to comment.