Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID in all.cell.annotation.meta.txt does not correspond to Barcodes in GEO .h5 files #6

Open
djinnome opened this issue May 27, 2020 · 3 comments

Comments

@djinnome
Copy link

djinnome commented May 27, 2020

Hi folks,

I am interested in analyzing the single cell expression of the Epithelial cells from your severe and healthy controls, but the ID in the all.cell.annotation.met.txt does not appear to correspond to the Barcodes in the data files on GEO.

For example:

import pandas as pd, os
all_cells = pd.read_csv('all.cell.annotation.meta.txt',sep='\t')
s2_cells = all_cells[(all_cells['sample_new'] == 'S2')]
s2_cells.head().to_html())
ID sample sample_new group disease hasnCoV cluster celltype
29194 AAACCTGAGACTTTCG_8 C143 S2 S Y N 4 Macrophages
29195 AAACCTGAGCATGGCA_8 C143 S2 S Y N 0 Macrophages
29196 AAACCTGAGCCATCGC_8 C143 S2 S Y N 3 Macrophages
29197 AAACCTGAGCGTGAAC_8 C143 S2 S Y N 15 Neutrophil
29198 AAACCTGAGGACATTA_8 C143 S2 S Y N 5 Macrophages
Contrast this with the Barcode IDs from the raw data for patient S2:
s2_raw = get_matrix_from_h5(os.path.join(balf_dir, "GSM4339772_C144_filtered_feature_bc_matrix.h5"))
print(s2_raw.iloc[:10,:10].to_html())

<class 'str'> genome

AAACCTGAGCTAGTGG-1 AAACCTGCAGCCTTTC-1 AAACCTGCAGCTGTGC-1 AAACCTGGTCTAGTGT-1 AAACCTGGTGACTCAT-1 AAACCTGTCGTCCGTT-1 AAACGGGAGACGCTTT-1 AAACGGGAGATCGGGT-1 AAACGGGAGCGTCTAT-1 AAACGGGCAATCGAAA-1
MIR1302-2HG 0 0 0 0 0 0 0 0 0 0
FAM138A 0 0 0 0 0 0 0 0 0 0
OR4F5 0 0 0 0 0 0 0 0 0 0
AL627309.1 0 0 0 0 0 0 0 0 0 0
AL627309.3 0 0 0 0 0 0 0 0 0 0
AL627309.2 0 0 0 0 0 0 0 0 0 0
AL627309.4 0 0 0 0 0 0 0 0 0 0
AL732372.1 0 0 0 0 0 0 0 0 0 0
OR4F29 0 0 0 0 0 0 0 0 0 0
AC114498.1 0 0 0 0 0 0 0 0 0 0

Note that there is an underscore with the metadata ID, but a hyphen with the S2 barcodes.

However, even if I switch the metadata to hyphen, I still am unable to align the IDs with the barcodes from the s2 dataset for all but 45 barcodes:

all_barcodes = all_cells['ID'].str.replace('_','-').values.tolist()
len(s2_raw.columns), len(all_barcodes)

(3716, 65813)

Note that the number of s2 barcodes is much smaller than the number of all barcodes, which is as it should be. However, when I check to see how many s2 raw columns there are, there is barely any overlap:

len(set(s2_raw.columns) & set(all_barcodes))

45

When I read the HC4 datafile I had a similar problem:

hc4_raw =parse_mtx(os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_barcodes.tsv.gz'),
          os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_genes.tsv.gz'),
          os.path.join(balf_dir,'GSM3660650_SC249NORbal_fresh_matrix.mtx.gz'))
print(hc4_raw.iloc[:10,:10].to_html())          
barcode AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 AAACCTGAGAAACCTA-1 AAACCTGAGAAACGAG-1 AAACCTGAGAAACGCC-1 AAACCTGAGAAAGTGG-1 AAACCTGAGAACAACT-1 AAACCTGAGAACAATC-1 AAACCTGAGAACTCGG-1 AAACCTGAGAACTGTA-1
gene_name
MIR1302-2HG 0 0 0 0 0 0 0 0 0 0
FAM138A 0 0 0 0 0 0 0 0 0 0
OR4F5 0 0 0 0 0 0 0 0 0 0
AL627309.1 0 0 0 0 0 0 0 0 0 0
AL627309.3 0 0 0 0 0 0 0 0 0 0
AL627309.2 0 0 0 0 0 0 0 0 0 0
AL627309.4 0 0 0 0 0 0 0 0 0 0
AL732372.1 0 0 0 0 0 0 0 0 0 0
OR4F29 0 0 0 0 0 0 0 0 0 0
AC114498.1 0 0 0 0 0 0 0 0 0 0
len(hc4_raw.columns), len(all_barcodes)

(737280, 65813)

Note that in this case, the number of HC4 barcodes dwarfs the number of all barcodes.
but there still isn't much of an overlap:

len(set(hc4_raw.columns) & set(all_barcodes))

8454

Please advise.

@polojacky
Copy link

Hi Jeremy,

We used Seurat to integrate the samples. Since the same barcode exists in different samples, so Seurat removed the _1 in the original barcode and add a suffix serially based on the order of the sample integrated (HC1 HC2 HC3 HC4 M1 M2 M3 S2 S1 S3 S4 S5 S6). So you see the barcode of S2 end with _8.

So to match the barcode, you can just remove the suffix in metadata and barcode of original data (ID_new), then use ID_new and sample to match the data.

@Xiaojieqiu
Copy link

Xiaojieqiu commented Dec 5, 2020

@polojacky I would like to double-check with you on the mapping from samples to suffix (batch) added to the cell barcodes. From your all.cell.annotation.meta.txt file, I found the batch with smallest number of cells is batch 7 (which is M3 according to the order you mentioned above HC1 HC2 HC3 HC4 M1 M2 M3 S2 S1 S3 S4 S5 S6 :

8     16883
9     11762
1      8454
2      8153
5      3539
6      3409
13     2897
4      2710
3      2566
12     2069
11     1716
10     1292
7       363

However, I found that the smallest batches are batch 1/2 instead after I re-quantify the raw fastq files with kb-python.

./SRR11181955_10xV2_kb_output cur_batch:  6 12032
./SRR11181956_10xV2_kb_output cur_batch:  8 22418
./SRR11181957_10xV2_kb_output cur_batch:  7 8553
./SRR11181958_10xV2_kb_output cur_batch:  9 21382
./SRR11181959_10xV2_kb_output cur_batch:  10 15167
./SRR11537946_10xV3_kb_output cur_batch:  1 1541
./SRR11537947_10xV3_kb_output cur_batch:  2 1578
./SRR11537948_10xV3_kb_output cur_batch:  3 9123
./SRR11537949_10xV2_kb_output cur_batch:  11 3160
./SRR11537950_10xV2_kb_output cur_batch:  12 4210
./SRR11537951_10xV2_kb_output cur_batch:  13 12882

The mapping from SRR run IDs to the batches are based on this file

Run sample GEO_Accession (exp) sample_new experiment 10x_scRNA_seq_tech
SRR11537946 C51 GSM4475048 HC1 scRNA-seq 10xV3
SRR11537947 C52 GSM4475049 HC2 scRNA-seq 10xV3
SRR11537948 C100 GSM4475050 HC3 scRNA-seq 10xV3
SRR11181955 C142 GSM4339770 M2 scRNA-seq 10xV2
SRR11181957 C144 GSM4339772 M3 scRNA-seq 10xV2
SRR11245351 C141 GSM4385990 M1 TCR-seq None
SRR11245352 C142 GSM4385991 M2 TCR-seq None
SRR11537949 C148 GSM4475051 S4 scRNA-seq 10xV2
SRR11537950 C149 GSM4475052 S5 scRNA-seq 10xV2
SRR11537951 C152 GSM4475053 S6 scRNA-seq 10xV2
SRR11537952 C148 GSM4475054 S4 TCR-seq None
SRR11537953 C149 GSM4475055 S5 TCR-seq None
SRR11537954 C152 GSM4475056 S6 TCR-seq None
SRR11181954 C141 GSM4339769 M1 scRNA-seq 10xV3
SRR11181956 C143 GSM4339771 S2 scRNA-seq 10xV2
SRR11181958 C145 GSM4339773 S1 scRNA-seq 10xV2
SRR11181959 C146 GSM4339774 S3 scRNA-seq 10xV2
SRR11245353 C143 GSM4385992 S2 TCR-seq None
SRR11245354 C144 GSM4385993 M3 TCR-seq None
SRR11245355 C145 GSM4385994 S1 TCR-seq None
SRR11245356 C146 GSM4385995 S3 TCR-seq None

This is obviously a mismatch. for example, how came your all.cell.annotation.meta.txt file shows there is 8000+ cells for batch 1/2 while in my case there is only about 1000 + cells. Please advise, appreciated!

@zhangzlab
Copy link
Owner

zhangzlab commented Dec 6, 2020

The metadata in all.cell.annotation.meta.txt is correct. The sample quality is good for healthy controls in our previous experiment so you can see batch 1/2 get many cells (8000+). And for M3, the sample quality is not so good and we harvest fewer cells in our experiment.

We used cellranger to quantify the gene expression. I don't know if there is some discrepancy between the two tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants