Releases: Munfred/wormcells-data
packer2019taylor2020bendavid2021_scvi_model_v0.11.0
Model for scvi-tools v0.11.0 integrating three C. elegans datasets.
More information at https://wormbase.github.io/single-cell/
Short Name | Total cells | Method | h5ad | Summary | Article/preprint | Original Data | Notes |
---|---|---|---|---|---|---|---|
Taylor 2020 | 100,955 | 10x v2/v3 | Download at Caltech Data | L4 larvae neurons selected via flow cytometry | Molecular topography of an entire nervous system. | GSE136049 | CeNGEN website Shiny R app to explore the data |
Ben-David 2021 | 55,508 | 10x v2 | Download at Caltech Data | L2 larvae | Whole-organism mapping of the genetics of gene expression at cellular resolution biorxiv 2020. | PRJNA658829 | Gene count matrix was kindly provided by the authors on request |
Packer 2019 | 89,701 | 10x v2 | Download at Caltech Data | Several timepoints of embryo development | A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution Science 2019. | GSE126954 | VisCello app for data exploration |
Taylor et al: Cengen 2020 data release 100955 cells
Data from the Cengen 2020 preprint https://www.biorxiv.org/content/10.1101/2020.12.15.422897v1
These are the counts as outputted by cellranger without the soupX modifications. The 100955 barcodes that were labeled as cells were retained, including neuron and non neuron.
AnnData object with n_obs × n_vars = 100955 × 46911
obs: 'dropbox_id', 'counts', 'experiment_code', 'cell_type', 'tissue'
print(adata.obs.head())
dropbox_id counts experiment_code cell_type \
1806-ST-1-AAACCTGAGAGACGAA 1806-ST-1 65 Pan-1 Unannotated
1806-ST-1-AAACCTGAGGTAAACT 1806-ST-1 367 Pan-1 AVF
1806-ST-1-AAACCTGAGGTAGCCA 1806-ST-1 1792 Pan-1 AVH
1806-ST-1-AAACCTGAGTAACCCT 1806-ST-1 1229 Pan-1 RIA
1806-ST-1-AAACCTGAGTACGCGA 1806-ST-1 1401 Pan-1 AUA
tissue
1806-ST-1-AAACCTGAGAGACGAA Unannotated
1806-ST-1-AAACCTGAGGTAAACT Neuron
1806-ST-1-AAACCTGAGGTAGCCA Neuron
1806-ST-1-AAACCTGAGTAACCCT Neuron
1806-ST-1-AAACCTGAGTACGCGA Neuron
print(adata.var.head())
bendavid2020
bendavid2020.h5ad
AnnData object with n_obs × n_vars = 55508 × 20138
obs: 'experiment', 'neuronal_subtype', 'barcode', 'study', 'cell_class', 'cell_type'
var: 'wbps_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'external_gene_id', 'external_transcript_id', 'wormbase_locus', 'wormbase_gseq', 'gene_short_name', 'gene_name'
obs entries look like so:
experiment neuronal_subtype cell_class \
barcode
F4_1_TGTAACGGTTAGCTAC-1 F4_1 nan Intestine
F4_1_GGCAGTCCAGCCTATA-1 F4_1 nan Intestine
F4_1_AAGTACCGTCATCCCT-1 F4_1 nan Somatic Gonad
F4_1_AAGATAGTCCCTCTAG-1 F4_1 nan Intestine
F4_1_ACCAAACCAGCTGTAT-1 F4_1 nan Pharynx and Arcade Cells
cell_type
barcode
F4_1_TGTAACGGTTAGCTAC-1 Intestine
F4_1_GGCAGTCCAGCCTATA-1 Intestine
F4_1_AAGTACCGTCATCCCT-1 Somatic Gonad
F4_1_AAGATAGTCCCTCTAG-1 Intestine
F4_1_ACCAAACCAGCTGTAT-1 Pharynx and Arcade Cells
var entries look like so:
wbps_gene_id chromosome_name start_position end_position \
gene_id
WBGene00010957 WBGene00010957 MtDNA 113 549
WBGene00010958 WBGene00010958 MtDNA 549 783
WBGene00010959 WBGene00010959 MtDNA 1763 2635
WBGene00010960 WBGene00010960 MtDNA 2634 3235
WBGene00010961 WBGene00010961 MtDNA 3389 4269
strand external_gene_id external_transcript_id wormbase_locus \
gene_id
WBGene00010957 1 nduo-6 MTCE.3.1 nduo-6
WBGene00010958 1 ndfl-4 MTCE.4.1 ndfl-4
WBGene00010959 1 nduo-1 MTCE.11.1 nduo-1
WBGene00010960 1 atp-6 MTCE.12.1 atp-6
WBGene00010961 1 nduo-2 MTCE.16.1 nduo-2
wormbase_gseq gene_short_name gene_name
gene_id
WBGene00010957 MTCE.3 nduo-6 nduo-6
WBGene00010958 MTCE.4 ndfl-4 ndfl-4
WBGene00010959 MTCE.11 nduo-1 nduo-1
WBGene00010960 MTCE.12 atp-6 atp-6
WBGene00010961 MTCE.16 nduo-2 nduo-2
packer2019.h5ad
Packer 2019 C. elegans 10xv2 data
Original article:
A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution
https://science.sciencemag.org/content/365/6459/eaax1971.long
Data on GEO:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126954
the is annotated with the following obs (one example shown)
index AAACCTGAGACAATAC-300.1.1
cell AAACCTGAGACAATAC-300.1.1
n.umi 1630
time.point 300_minutes
batch Waterston_300_minutes
Size_Factor 1.02319
cell.type Body_wall_muscle
cell.subtype BWM_head_row_1
plot.cell.type BWM_head_row_1
raw.embryo.time 360
embryo.time 380
embryo.time.bin 330-390
raw.embryo.time.bin 330-390
lineage MSxpappp
passed_initial_QC_or_later_whitelisted True
eyal.h5ad
Data from the preprint Whole-organism mapping of the genetics of gene expression at cellular resolution
by
Ben-David, James Boocock, Longhua Guo, Stefan Zdraljevic, Joshua S. Bloom, and Leonid Kruglyak
https://doi.org/10.1101/2020.08.23.263798
https://www.biorxiv.org/content/10.1101/2020.08.23.263798v1
AnnData object with n_obs × n_vars = 55508 × 20138
obs: 'Batch', 'Size_Factor', 'cell_type', 'neuronal_subtype', 'total', 'barcode', 'doublet'
var: 'wbps_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'external_gene_id', 'external_transcript_id', 'wormbase_locus', 'wormbase_gseq', 'gene_short_name', 'gene_name'
adata.obs['Batch'].value_counts()
F4_5 11633
F4_4 11464
F4_2 11424
F4_1 11336
F4_3 9651
adata.obs['cell_type'].value_counts()
Hypodermis 13219
Body Wall Muscle 9630
Intestine 4859
Pharynx and Arcade Cells 2842
Seam Cells 2717
Glia 2034
Germline 1373
Coelomocytes 1151
Somatic Gonad 951
VA 872
DD_VD 705
Pharyngeal Gland Cells 697
VB 622
Excretory Gland 586
Unknown 528
AVK 498
GLR 486
Vulval Precursor Cells 471
DA 465
Sex Myoblast 463
Excretory Cells 460
SIA_SIB 333
XXX 313
RMH 267
Sphincter and Anal Muscles 261
ALM_PLM_AVM_PVM 254
PVP 254
Unknown_glut_2 246
AIB 240
RIC 233
RIF 224
AIZ 203
AVL 203
AVJ 193
Unknown_touch 181
ALN_PLN_SDQ 179
M2 178
RIA 172
AQR_PQR_URX 171
CAN 171
AVH 165
BDU 163
IL2_DV 162
AVF 161
RME 160
DVB 159
PVQ 159
Unknown_ACh_3 156
AIN 152
AFD 148
AIA 148
PVD 145
AIM 143
I5 137
PDE 137
PHB 136
OLL_URY 135
I2_I3 131
ADA 130
PHA 130
MC 123
RMD 108
DVC 107
ADE 105
AVD_PVC 104
Unknown_3 102
RIM 102
RMG 98
AVG 97
Unknown_2 95
PVT 95
ASI_ASJ 92
RIG 91
Unknown_1 90
CEM 85
URA 85
MI 83
AUA 71
AWA 70
Unknown_4 70
LUA 69
ADL 69
IL1 61
I1 60
ALA 57
PVR 52
ASK 45
AWC 42
BAG 40
AWB 40
ASER 35
OLQ 34
M1 31
ADF 31
DVA 24
ASH 23
IL2_LR 22
FLP 21
ASG 17
First 5 entries of adata.obs
and adata.var
:
print(adata.obs.head())
Batch Size_Factor cell_type neuronal_subtype total \
0 F4_1 102.532200 Intestine nan 104863
1 F4_1 60.046012 Intestine nan 61411
2 F4_1 60.384321 Somatic Gonad nan 61757
3 F4_1 51.692898 Intestine nan 52868
4 F4_1 59.001750 Pharynx and Arcade Cells nan 60343
barcode doublet
0 F4_1_TGTAACGGTTAGCTAC-1 False
1 F4_1_GGCAGTCCAGCCTATA-1 False
2 F4_1_AAGTACCGTCATCCCT-1 False
3 F4_1_AAGATAGTCCCTCTAG-1 False
4 F4_1_ACCAAACCAGCTGTAT-1 False
wbps_gene_id chromosome_name start_position end_position \
wormbase_gene
WBGene00010957 WBGene00010957 MtDNA 113 549
WBGene00010958 WBGene00010958 MtDNA 549 783
WBGene00010959 WBGene00010959 MtDNA 1763 2635
WBGene00010960 WBGene00010960 MtDNA 2634 3235
WBGene00010961 WBGene00010961 MtDNA 3389 4269
strand external_gene_id external_transcript_id wormbase_locus \
wormbase_gene
WBGene00010957 1 nduo-6 MTCE.3.1 nduo-6
WBGene00010958 1 ndfl-4 MTCE.4.1 ndfl-4
WBGene00010959 1 nduo-1 MTCE.11.1 nduo-1
WBGene00010960 1 atp-6 MTCE.12.1 atp-6
WBGene00010961 1 nduo-2 MTCE.16.1 nduo-2
wormbase_gseq gene_short_name gene_name
wormbase_gene
WBGene00010957 MTCE.3 nduo-6 nduo-6
WBGene00010958 MTCE.4 ndfl-4 ndfl-4
WBGene00010959 MTCE.11 nduo-1 nduo-1
WBGene00010960 MTCE.12 atp-6 atp-6
WBGene00010961 MTCE.16 nduo-2 nduo-2
Packer 2019 Taylor2019 Cao2019 data wangle 2020-03-30
VAE trained on full data with scVI v0.6.1 (works on v0.6.3)
New data wrangle with packer labels to include cell_plot_type
as cell_type
The wormcells-data-2020-03-30.h5ad
anndata file is provided with the following entries:
AnnData object with n_obs × n_vars = 191138 × 22761
obs: 'barcode', 'cell_subtype', 'cell_type', 'embryo_time', 'embryo_time_bin', 'experiment', 'lineage', 'numi', 'passed_qc', 'raw_embryo_time', 'raw_embryo_time_bin', 'size_factor', 'study', 'time_point', 'tissue_type'
var: 'gene_name', 'gene_description'
The first and last entries of the data for each study can be printed with this snippet
import anndata
import pandas as pd
adata = anndata.read('wormcells-data-2020-03-30.h5ad')
pd.concat([adata.obs[adata.obs['study'] == 'cao'].head(1).T,
adata.obs[adata.obs['study'] == 'cao'].tail(1).T,
adata.obs[adata.obs['study'] == 'packer'].head(1).T,
adata.obs[adata.obs['study'] == 'packer'].tail(1).T,
adata.obs[adata.obs['study'] == 'taylor'].head(1).T,
adata.obs[adata.obs['study'] == 'taylor'].tail(1).T],
axis=1)
It looks as below. Note that the display is transposed for convenience, the entries in first column below and the anndata obs names
0-cao 35986-cao 0-packer 89700-packer 0-taylor 65449-taylor
barcode A01_A02_AACTACCGAC B02_B42_TTCTACGCCA AAACCTGAGACAATAC-300.1.1 TGGGCGTTCAGGCCCA-b02 acr2_AAACCCAAGATCGCTT-1 u3_TTTGTCATCTTCGGTC-1
cell_subtype nan nan BWM_head_row_1 nan nan nan
cell_type hyp_4_to_7_bin_3_around_L2_molt Intestine_far_posterior BWM_head_row_1 nan Unknown_NT VB
embryo_time NaN NaN 380 265 NaN NaN
embryo_time_bin nan nan 330-390 210-270 nan nan
experiment L2_experiment_1 L2_experiment_2 Waterston_300_minutes Murray_b02 acr-2 unc-3
lineage nan nan MSxpappp nan nan nan
numi NaN NaN 1630 1132 NaN NaN
passed_qc nan nan True True nan nan
raw_embryo_time NaN NaN 360 260 NaN NaN
raw_embryo_time_bin nan nan 330-390 210-270 nan nan
size_factor NaN NaN 1.02319 0.70682 NaN NaN
study cao cao packer packer taylor taylor
time_point nan nan 300_minutes mixed nan nan
tissue_type nan nan Body_wall_muscle nan Neuron Neuron
In the variables, Gene annotations include WormBase short gene descriptions, for example the first 5 entries look like:
gene_id gene_name gene_description
0 WBGene00000001 aap-1 Exhibits protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Localizes to the phosphatidylinositol 3-kinase complex. Human ortholog(s) of this gene implicated in several diseases, including astroblastoma; carcinoma (multiple); endometrial cancer (multiple); primary immunodeficiency disease (multiple); and type 2 diabetes mellitus. Is expressed in intestine and neurons. Orthologous to several human genes including PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
1 WBGene00000002 aat-1 Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Localizes to the amino acid transport complex. Is expressed in several structures, including excretory system; gonadal sheath cell; nervous system; pharynx; and rectal gland cell. Orthologous to several human genes including SLC7A8 (solute carrier family 7 member 8).
2 WBGene00000003 aat-2 Predicted to have L-amino acid transmembrane transporter activity. Predicted to be involved in amino acid transmembrane transport. Predicted to localize to the integral component of membrane. Human ortholog(s) of this gene implicated in lysinuric protein intolerance. Orthologous to several human genes including SLC7A7 (solute carrier family 7 member 7).
3 WBGene00000004 aat-3 Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Localizes to the amino acid transport complex. Orthologous to human SLC7A5 (solute carrier family 7 member 5) and SLC7A8 (solute carrier family 7 member 8).
4 WBGene00000005 aat-4 Predicted to have L-amino acid transmembrane transporter activity. Predicted to be involved in amino acid transmembrane transport. Predicted to localize to the integral component of membrane. Human ortholog(s) of this gene implicated in lysinuric protein intolerance. Orthologous to human SLC7A6 (solute carrier family 7 member 6) and SLC7A7 (solute carrier family 7 member 7).
taylor2019
H5AD file with data from Taylor et al BIORXIV 2019;
"Expression profiling of the mature C. elegans nervous system by single-cell RNA-Sequencing"
https://www.biorxiv.org/content/10.1101/737577v2
https://doi.org/10.1101/737577
The cell annotations have the following structure (2 entries included as example)
barcode acr2_AAACCCAAGATCGCTT-1 acr2_AAACCCAAGTCATAGA-1
barcode acr2_AAACCCAAGATCGCTT-1 acr2_AAACCCAAGTCATAGA-1
experiment acr-2 acr-2
tissue Neuron Neuron
neuron_type Unknown_NT VB
AnnData has the following entries
AnnData object with n_obs × n_vars = 65450 × 21393
obs: 'barcode', 'experiment', 'tissue', 'neuron_type'
var: 'gene_id', 'gene_symbol'
Packer and Taylor data
Concatenated C. elegans data from Packer 2019 (89k cells) and Taylor 2019 (65k cells) together with pre-trained scVI model.
Packer 2019
A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution
https://science.sciencemag.org/content/365/6459/eaax1971.long
Taylor 2019
"Expression profiling of the mature C. elegans nervous system by single-cell RNA-Sequencing"
Concatenated anndata with Cao 2017, Packer 2019 and Taylor 2019 data
AnnData object with n_obs × n_vars = 191138 × 22761
obs: 'barcode', 'cell_type', 'embryo_time', 'embryo_time_bin', 'experiment', 'lineage', 'numi', 'passed_qc', 'plot_cell_type', 'raw_embryo_time', 'raw_embryo_time_bin', 'size_factor', 'study', 'time_point', 'tissue_type'
var: 'gene_id', 'gene_name', 'gene_description'
Cell from each study
packer 89701
taylor 65450
cao 35987
Cells in each experiment (batches)
L2_experiment_1 35480
Waterston_400_minutes 25875
Waterston_300_minutes 17168
eat-4 12743
Murray_b01 12129
acr-2 11719
Waterston_500_minutes_batch_2 11589
Waterston_500_minutes_batch_1 10532
Murray_r17 9363
Pan 9216
unc-3 6165
tph-1_ceh-10 4810
ift-20 4056
cho-1_1 3849
cho-1_2 3471
unc-47_2 3123
Murray_b02 3045
ceh-34 2648
nmr-1 2389
unc-47_1 1261
L2_experiment_2 507
cao2017
h5ad file with data for 36k C. elegans cells from two sci-rna-seq experiments published in Cao et al, Science 2017 (https://doi.org/10.1126/science.aam8940)
Breakdown of cells per experiment:
L2_experiment_1 35480
L2_experiment_2 507
Head of cell annotation as example:
barcode experiment cell_type
0 A01_A02_AACTACCGAC L2_experiment_1 hyp_4_to_7_bin_3_around_L2_molt
1 A01_A02_AACTACGGCT L2_experiment_1 ASI
2 A01_A02_AACTATTATA L2_experiment_1 mu_sph
3 A01_A02_AAGACGGCCA L2_experiment_1 Germline
4 A01_A02_AAGTTGCCAT L2_experiment_1 Germline
Gene annotations include WormBase short gene descriptions, for example:
gene_id gene_name gene_description
0 WBGene00000001 aap-1 Exhibits protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling pathway. Localizes to the phosphatidylinositol 3-kinase complex. Human ortholog(s) of this gene implicated in several diseases, including astroblastoma; carcinoma (multiple); endometrial cancer (multiple); primary immunodeficiency disease (multiple); and type 2 diabetes mellitus. Is expressed in intestine and neurons. Orthologous to several human genes including PIK3R3 (phosphoinositide-3-kinase regulatory subunit 3).
1 WBGene00000002 aat-1 Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Localizes to the amino acid transport complex. Is expressed in several structures, including excretory system; gonadal sheath cell; nervous system; pharynx; and rectal gland cell. Orthologous to several human genes including SLC7A8 (solute carrier family 7 member 8).
2 WBGene00000003 aat-2 Predicted to have L-amino acid transmembrane transporter activity. Predicted to be involved in amino acid transmembrane transport. Predicted to localize to the integral component of membrane. Human ortholog(s) of this gene implicated in lysinuric protein intolerance. Orthologous to several human genes including SLC7A7 (solute carrier family 7 member 7).
3 WBGene00000004 aat-3 Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Localizes to the amino acid transport complex. Orthologous to human SLC7A5 (solute carrier family 7 member 5) and SLC7A8 (solute carrier family 7 member 8).
4 WBGene00000005 aat-4 Predicted to have L-amino acid transmembrane transporter activity. Predicted to be involved in amino acid transmembrane transport. Predicted to localize to the integral component of membrane. Human ortholog(s) of this gene implicated in lysinuric protein intolerance. Orthologous to human SLC7A6 (solute carrier family 7 member 6) and SLC7A7 (solute carrier family 7 member 7).