ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

djinnome · 2020-05-27T18:41:36Z

Hi folks,

I am interested in analyzing the single cell expression of the Epithelial cells from your severe and healthy controls, but the ID in the all.cell.annotation.met.txt does not appear to correspond to the Barcodes in the data files on GEO.

For example:

import pandas as pd, os
all_cells = pd.read_csv('all.cell.annotation.meta.txt',sep='\t')
s2_cells = all_cells[(all_cells['sample_new'] == 'S2')]
s2_cells.head().to_html())

	ID	sample	sample_new	group	disease	hasnCoV	cluster	celltype
29194	AAACCTGAGACTTTCG_8	C143	S2	S	Y	N	4	Macrophages
29195	AAACCTGAGCATGGCA_8	C143	S2	S	Y	N	0	Macrophages
29196	AAACCTGAGCCATCGC_8	C143	S2	S	Y	N	3	Macrophages
29197	AAACCTGAGCGTGAAC_8	C143	S2	S	Y	N	15	Neutrophil
29198	AAACCTGAGGACATTA_8	C143	S2	S	Y	N	5	Macrophages

Contrast this with the Barcode IDs from the raw data for patient S2:

s2_raw = get_matrix_from_h5(os.path.join(balf_dir, "GSM4339772_C144_filtered_feature_bc_matrix.h5"))
print(s2_raw.iloc[:10,:10].to_html())

<class 'str'> genome

	AAACCTGAGCTAGTGG-1	AAACCTGCAGCCTTTC-1	AAACCTGCAGCTGTGC-1	AAACCTGGTCTAGTGT-1	AAACCTGGTGACTCAT-1	AAACCTGTCGTCCGTT-1	AAACGGGAGACGCTTT-1	AAACGGGAGATCGGGT-1	AAACGGGAGCGTCTAT-1	AAACGGGCAATCGAAA-1
MIR1302-2HG	0	0	0	0	0	0	0	0	0	0
FAM138A	0	0	0	0	0	0	0	0	0	0
OR4F5	0	0	0	0	0	0	0	0	0	0
AL627309.1	0	0	0	0	0	0	0	0	0	0
AL627309.3	0	0	0	0	0	0	0	0	0	0
AL627309.2	0	0	0	0	0	0	0	0	0	0
AL627309.4	0	0	0	0	0	0	0	0	0	0
AL732372.1	0	0	0	0	0	0	0	0	0	0
OR4F29	0	0	0	0	0	0	0	0	0	0
AC114498.1	0	0	0	0	0	0	0	0	0	0

Note that there is an underscore with the metadata ID, but a hyphen with the S2 barcodes.

However, even if I switch the metadata to hyphen, I still am unable to align the IDs with the barcodes from the s2 dataset for all but 45 barcodes:

all_barcodes = all_cells['ID'].str.replace('_','-').values.tolist()
len(s2_raw.columns), len(all_barcodes)

(3716, 65813)

Note that the number of s2 barcodes is much smaller than the number of all barcodes, which is as it should be. However, when I check to see how many s2 raw columns there are, there is barely any overlap:

len(set(s2_raw.columns) & set(all_barcodes))

45

When I read the HC4 datafile I had a similar problem:

hc4_raw =parse_mtx(os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_barcodes.tsv.gz'),
          os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_genes.tsv.gz'),
          os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_matrix.mtx.gz'))
print(hc4_raw.iloc[:10,:10].to_html())

barcode	AAACCTGAGAAACCAT-1	AAACCTGAGAAACCGC-1	AAACCTGAGAAACCTA-1	AAACCTGAGAAACGAG-1	AAACCTGAGAAACGCC-1	AAACCTGAGAAAGTGG-1	AAACCTGAGAACAACT-1	AAACCTGAGAACAATC-1	AAACCTGAGAACTCGG-1	AAACCTGAGAACTGTA-1
gene_name
MIR1302-2HG	0	0	0	0	0	0	0	0	0	0
FAM138A	0	0	0	0	0	0	0	0	0	0
OR4F5	0	0	0	0	0	0	0	0	0	0
AL627309.1	0	0	0	0	0	0	0	0	0	0
AL627309.3	0	0	0	0	0	0	0	0	0	0
AL627309.2	0	0	0	0	0	0	0	0	0	0
AL627309.4	0	0	0	0	0	0	0	0	0	0
AL732372.1	0	0	0	0	0	0	0	0	0	0
OR4F29	0	0	0	0	0	0	0	0	0	0
AC114498.1	0	0	0	0	0	0	0	0	0	0

len(hc4_raw.columns), len(all_barcodes)

(737280, 65813)

Note that in this case, the number of HC4 barcodes dwarfs the number of all barcodes.
but there still isn't much of an overlap:

len(set(hc4_raw.columns) & set(all_barcodes))

8454

Please advise.

The text was updated successfully, but these errors were encountered:

polojacky · 2020-05-28T01:42:15Z

Hi Jeremy,

We used Seurat to integrate the samples. Since the same barcode exists in different samples, so Seurat removed the _1 in the original barcode and add a suffix serially based on the order of the sample integrated (HC1 HC2 HC3 HC4 M1 M2 M3 S2 S1 S3 S4 S5 S6). So you see the barcode of S2 end with _8.

So to match the barcode, you can just remove the suffix in metadata and barcode of original data (ID_new), then use ID_new and sample to match the data.

Xiaojieqiu · 2020-12-05T21:47:56Z

@polojacky I would like to double-check with you on the mapping from samples to suffix (batch) added to the cell barcodes. From your all.cell.annotation.meta.txt file, I found the batch with smallest number of cells is batch 7 (which is M3 according to the order you mentioned above HC1 HC2 HC3 HC4 M1 M2 M3 S2 S1 S3 S4 S5 S6 :

8     16883
9     11762
1      8454
2      8153
5      3539
6      3409
13     2897
4      2710
3      2566
12     2069
11     1716
10     1292
7       363

However, I found that the smallest batches are batch 1/2 instead after I re-quantify the raw fastq files with kb-python.

./SRR11181955_10xV2_kb_output cur_batch:  6 12032
./SRR11181956_10xV2_kb_output cur_batch:  8 22418
./SRR11181957_10xV2_kb_output cur_batch:  7 8553
./SRR11181958_10xV2_kb_output cur_batch:  9 21382
./SRR11181959_10xV2_kb_output cur_batch:  10 15167
./SRR11537946_10xV3_kb_output cur_batch:  1 1541
./SRR11537947_10xV3_kb_output cur_batch:  2 1578
./SRR11537948_10xV3_kb_output cur_batch:  3 9123
./SRR11537949_10xV2_kb_output cur_batch:  11 3160
./SRR11537950_10xV2_kb_output cur_batch:  12 4210
./SRR11537951_10xV2_kb_output cur_batch:  13 12882

The mapping from SRR run IDs to the batches are based on this file

Run	sample	GEO_Accession (exp)	sample_new	experiment	10x_scRNA_seq_tech
SRR11537946	C51	GSM4475048	HC1	scRNA-seq	10xV3
SRR11537947	C52	GSM4475049	HC2	scRNA-seq	10xV3
SRR11537948	C100	GSM4475050	HC3	scRNA-seq	10xV3
SRR11181955	C142	GSM4339770	M2	scRNA-seq	10xV2
SRR11181957	C144	GSM4339772	M3	scRNA-seq	10xV2
SRR11245351	C141	GSM4385990	M1	TCR-seq	None
SRR11245352	C142	GSM4385991	M2	TCR-seq	None
SRR11537949	C148	GSM4475051	S4	scRNA-seq	10xV2
SRR11537950	C149	GSM4475052	S5	scRNA-seq	10xV2
SRR11537951	C152	GSM4475053	S6	scRNA-seq	10xV2
SRR11537952	C148	GSM4475054	S4	TCR-seq	None
SRR11537953	C149	GSM4475055	S5	TCR-seq	None
SRR11537954	C152	GSM4475056	S6	TCR-seq	None
SRR11181954	C141	GSM4339769	M1	scRNA-seq	10xV3
SRR11181956	C143	GSM4339771	S2	scRNA-seq	10xV2
SRR11181958	C145	GSM4339773	S1	scRNA-seq	10xV2
SRR11181959	C146	GSM4339774	S3	scRNA-seq	10xV2
SRR11245353	C143	GSM4385992	S2	TCR-seq	None
SRR11245354	C144	GSM4385993	M3	TCR-seq	None
SRR11245355	C145	GSM4385994	S1	TCR-seq	None
SRR11245356	C146	GSM4385995	S3	TCR-seq	None

This is obviously a mismatch. for example, how came your all.cell.annotation.meta.txt file shows there is 8000+ cells for batch 1/2 while in my case there is only about 1000 + cells. Please advise, appreciated!

zhangzlab · 2020-12-06T02:38:23Z

The metadata in all.cell.annotation.meta.txt is correct. The sample quality is good for healthy controls in our previous experiment so you can see batch 1/2 get many cells (8000+). And for M3, the sample quality is not so good and we harvest fewer cells in our experiment.

We used cellranger to quantify the gene expression. I don't know if there is some discrepancy between the two tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

djinnome commented May 27, 2020 •

edited

Loading

polojacky commented May 28, 2020

Xiaojieqiu commented Dec 5, 2020 •

edited

Loading

zhangzlab commented Dec 6, 2020 •

edited

Loading

ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

Comments

djinnome commented May 27, 2020 • edited Loading

polojacky commented May 28, 2020

Xiaojieqiu commented Dec 5, 2020 • edited Loading

zhangzlab commented Dec 6, 2020 • edited Loading

djinnome commented May 27, 2020 •

edited

Loading

Xiaojieqiu commented Dec 5, 2020 •

edited

Loading

zhangzlab commented Dec 6, 2020 •

edited

Loading