Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

样本id substr和重复问题 #349

Closed
ShixiangWang opened this issue Aug 12, 2024 · 5 comments
Closed

样本id substr和重复问题 #349

ShixiangWang opened this issue Aug 12, 2024 · 5 comments

Comments

@ShixiangWang
Copy link
Member

@lishensuo 我发现这个数据有点问题,你有空看看能否检查和重新生成下,不是很着急。

#--------purity and ploidy data-----------------------------
# access date:2020-06-17
# from https://gdc.cancer.gov/about-data/publications/PanCanStemness-2018
## genome instability
gi_data <- data.table::fread("data-raw/Purity_Ploidy_All_Samples_9_28_16.tsv", data.table = F)
gi_data <- gi_data %>%
dplyr::select(c(3, 5, 6, 7, 9, 10))
gi_data <- gi_data %>%
dplyr::select(sample, purity, ploidy, Genome_doublings = `Genome doublings`, Cancer_DNA_fraction = `Cancer DNA fraction`, Subclonal_genome_fraction = `Subclonal genome fraction`) %>%
dplyr::mutate(sample = stringr::str_sub(sample, 1, 15))
tcga_genome_instability <- gi_data
attr(tcga_genome_instability, "data_source") <- "DOI:https://doi.org/10.1016/j.cell.2018.03.034"

@lishensuo
Copy link
Collaborator

老师,您指的问题是数据操作层面还是数据生物学含义层面?目前复现了下步骤,应该是可以得到相同的结果的

@ShixiangWang
Copy link
Member Author

@lishensuo stringr::str_sub(sample, 1, 15) 这里写的有问题,并不是所有样本都是按标准 TCGA 样本 id 标注的。

@lishensuo
Copy link
Collaborator

收到,老师。已发现这个问题。如下是相应的解决方法。

library(dplyr)

# 部分TCGA样本ID不符合常规的格式
gi_data_raw <- data.table::fread("Purity_Ploidy_All_Samples_9_28_16.tsv", data.table = F)
gi_data <- gi_data_raw %>%
  dplyr::select(c(3, 5, 6, 7, 9, 10)) %>%
  dplyr::select(sample, purity, ploidy, Genome_doublings = `Genome doublings`, 
                Cancer_DNA_fraction = `Cancer DNA fraction`, 
                Subclonal_genome_fraction = `Subclonal genome fraction`)

head(gi_data$sample[!grepl("^TCGA", gi_data$sample)])
# [1] "GBM-TCGA-02-0001-Tumor" "GBM-TCGA-02-0006-Tumor" "GBM-TCGA-02-0007-Tumor" "GBM-TCGA-02-0009-Tumor"
# [5] "GBM-TCGA-02-0010-Tumor" "GBM-TCGA-02-0011-Tumor"

table(grepl("^TCGA", gi_data$sample))
# FALSE  TRUE 
# 795  9997 
  • 9997个样本符合常规TCGA ID
  • 795个样本不符合

如下想到两种解决的方式

方式1:直接丢弃795个样本

gi_data_res1 = gi_data %>% 
  dplyr::filter(grepl("^TCGA", sample)) %>%
  dplyr::mutate(sample = stringr::str_sub(sample, 1, 15))


gi_data_res1 = gi_data_res1[order(gi_data_res1$sample),]
rownames(gi_data_res1) = seq(nrow(gi_data_res1))

write.csv(gi_data_res1, file = "gi_data_res1.csv", row.names = FALSE)

方式2:将795个样本ID转为规范的ID

gi_data_out = gi_data[!grepl("^TCGA", gi_data$sample),]
table(!grepl("Tumor", gi_data_out$sample) )
# FALSE  TRUE 
# 775    20 

有两种不符合的ID类型

# (1) Tumor标记
head(gi_data_out$sample[grepl("Tumor", gi_data_out$sample)])
# [1] "GBM-TCGA-02-0001-Tumor" "GBM-TCGA-02-0006-Tumor" "GBM-TCGA-02-0007-Tumor"
# [4] "GBM-TCGA-02-0009-Tumor" "GBM-TCGA-02-0010-Tumor" "GBM-TCGA-02-0011-Tumor"

# (2) 其它类型
gi_data_out$sample[!grepl("Tumor", gi_data_out$sample)]
# [1] "WU-BRCA-TCGA-A8-A06Y-01A-21W-A019-09"  "WU-BRCA-TCGA-A8-A07R-01A-21W-A050-09" 
# [3] "WU-BRCA-TCGA-A8-A083-01A-21W-A019-09"  "WU-BRCA-TCGA-A8-A084-01A-21W-A019-09" 
# [5] "WU-BRCA-TCGA-A8-A08F-01A-11W-A019-09"  "WU-BRCA-TCGA-A8-A09E-01A-11W-A019-09" 
# [7] "WU-BRCA-TCGA-AO-A0JL-01A-11W-A071-09"  "WU-BRCA-TCGA-AO-A12B-01A-11D-A10M-09" 
# [9] "WU-BRCA-TCGA-AR-A0TS-01A-11D-A10Y-09"  "WU-BRCA-TCGA-AR-A2LL-01A-11D-A17W-09" 
# [11] "WU-BRCA-TCGA-AR-A2LR-01A-12D-A18P-09"  "WU-BRCA-TCGA-B6-A0X4-01A-11D-A10G-09" 
# [13] "WU-BRCA-TCGA-BH-A1F5-01A-12D-A13L-09"  "WU-BRCA-TCGA-D8-A146-01A-31D-A10Y-09" 
# [15] "WU-BRCA-TCGA-E9-A1RD-01A-11D-A159-09"  "BCM-KICH-TCGA-KL-8327-01A-11D-2310-10"
# [17] "BCM-KICH-TCGA-KL-8333-01A-11D-2310-10" "BCM-KICH-TCGA-KL-8342-01A-11D-2310-10"
# [19] "BCM-KICH-TCGA-KN-8419-01A-11D-2310-10" "BCM-KICH-TCGA-KN-8430-01A-11D-2310-10"
  • 修改方式:均修改为TCGA的01肿瘤label
gi_data_out = gi_data_out %>% 
  dplyr::mutate(sample = stringr::str_extract(sample, 'TCGA-\\w{2}-\\w{4}')) %>% 
  dplyr::mutate(sample = paste0(sample, "-01")) %>% 
  dplyr::filter(sample != "NA-01")

gi_data_res2 = rbind(gi_data_res1, gi_data_out)
gi_data_res2 = gi_data_res2[order(gi_data_res2$sample),]
rownames(gi_data_res2) = seq(nrow(gi_data_res2))

dim(gi_data_res2)
# [1] 10791     6

write.csv(gi_data_res2, file = "gi_data_res2.csv", row.names = FALSE)

附件分别为两种方式的计算结果,老师觉得哪种方式更好一些呢?

gi_data_res1.csv
gi_data_res2.csv

@ShixiangWang
Copy link
Member Author

采用方式1吧,目前针对其他类型的数据也没什么用。

@lishensuo
Copy link
Collaborator

嗯好的,已按照方式1进行了修改,并提交

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants