Analyses including multiple samples from the same individual #155

jashapiro · 2019-10-11T20:07:01Z

Question/issue

Many of the analyses that have been proposed are sensitive to the fact that we have multiples tumor samples from different time points or tumors from the same individual. For example, biospecimens BS_K07KNTFY and BS_AQMKA8NC are both tumor WGS data (initial and recurrence, respectively) from participant PT_00G007DM. While this is extremely useful data, it presents questions for many particular analyses, which I would like to discuss in this issue.

In particular, analyses of mutation prevalence, variant allele frequency distributions, classification accuracy, etc. are likely to be affected by these non-independent samples. In some cases, a simple awareness of the issue will be sufficient, and analyses can be written to account for or take advantage of the redundancy in the data. However for many analyses, decisions of which samples to include or exclude will need to be made, and it would be good to have an agreed upon set of standards and procedures.

For a specific example, in the analysis of mutation co-occurrence (#13), including all samples would result in many spurious reports of co-occurrence, as it is quite common for two samples from the same individual to have the same sets of mutations. Similarly, analyses of recurrent fusions (#10), distribution of tumor mutation burden (#3), etc. will likely be affected.

One potential solution is to use only primary tumors and/or the earliest sampled tumor from each individual in analyses such as this. However, this would miss some potential co-occurrence patterns that may be important in progression and recurrence, which might suggest that the latest tumor from each individual would be better. Doing both is of course an option as well, but I am curious to hear what others think is most appropriate. Ultimately, we may want to add a recommendation to the documentation for future analyses.

The text was updated successfully, but these errors were encountered:

jharenza · 2019-10-11T20:15:57Z

@jashapiro I recently had this on my mind as I saw some of the mutation analyses you described, so I am glad you are bringing this up in an issue. As you allude to, I think how we use the data also depends upon how data are being used/plotted. For example, I think it makes sense to limit to one primary tumor per patient when doing mutual exclusivity/co-occurrence analyses (#13), but we could calculate TMB (#3) for all patients and when these are plotted, plot with diagnosis and relapse separated. In fact, we expect a higher TMB in relapse tumors compared to primary, so this would be good validation. For fusions (#10), I think this is also important to discuss. We also have cell lines and some of the fusions may be lost in culture or at relapse, so these would be important things to note and I would think could be something within the scope of another issue.

We also have WGS and WXS for some samples at the same phase of therapy and so I was thinking that for consensus calls (#69), we should collapse this data on a per-patient level. For the oncoprint (#6), we could add tracks for phase of therapy and/or separate dx/relapse/cell lines, but we can flush this out in this issue.

cgreene · 2019-10-13T20:00:46Z

I agree that we should ask people to - by default - report on only primary tumors and filter to one per patient for most of the overall atlas work. How many patients had multiple primaries?

We'll probably also want to go back and make sure that existing analyses thoughtfully consider whether this rule should be applied.

jashapiro · 2019-10-18T19:04:41Z

I thought I would give a bit of specificity to this question. First up: highlighting the WGS samples where we have multiple biospecimens from the same participants, with no obvious way to decide which of these is earliest.

For these data, I have chosen the samples with the smallest age at diagnosis for each participant, and removed all derived cell line samples.

Kids_First_Participant_ID	samples	descriptors	age_diagnosis	OS_days
PT_9GKVQ9QS	BS_JRFVST47, BS_3Z40EZHD	Diagnosis	1825	171
PT_DTP4MMRA	BS_KH3859M5, BS_NASADC3P	Second Malignancy	4988	842
PT_KBFM551M	BS_M0B42FPR, BS_M5FM63EB, BS_J8EK6RNF, BS_9P4NDTKJ	Diagnosis, Progressive Disease Post-Mortem	3285	395
PT_KTRJ8TFY	BS_BQ81D2BP, BS_HYKV2TH9, BS_EE73VE7V, BS_3VKW5988, BS_AF5D41PD, BS_5968GBGT	Diagnosis, Progressive Disease Post-Mortem	1825	274
PT_KZ56XHJT	BS_H8NWA41N, BS_AK9BV52G, BS_X5VN0FW0, BS_22VCR7DF, BS_D6STCMQS, BS_0ATJ22QA, BS_YHXMYDBN, BS_1Q524P3B	Progressive, Progressive Disease Post-Mortem	2555	241
PT_M9XXJ4GR	BS_R6CKWZW6, BS_682Z7WH6	Diagnosis	3650	260
PT_MNSEJCDM	BS_ZSH09N84, BS_J8EH1N7V, BS_Y74XAFJX, BS_CBMAWSAR	Diagnosis	2190
PT_NK8A49X5	BS_AHAXPFG3, BS_1MME7FBS, BS_HEJ72V3F	Diagnosis, Progressive	4745	704
PT_QD6KKKJH	BS_B91XGSA5, BS_9BN45DFK	Initial CNS Tumor	3129	106

As you can see, there are still many participants with multiple samples. I am not sure how we should choose among these, or if they should be merged before analysis. In some cases, you can see that they all have the same tumor descriptor, but sometimes there is more than one, and while making a decision on which of these to select might be helpful, it would not eliminate any particular problem, as most of these samples have more than one with each particular descriptor.

More generally, is there somewhere where the fields are described? I am unsure of what the OS_days field refers to.

jaclyn-taroni · 2019-10-31T13:09:22Z

With #191, I reran @jashapiro's independent-samples module with the data from release-v6-20191030.

That means that we have 4 lists of independent samples here that can be used for downstream analysis when there should not be multiple samples from the same individual in an analysis (from #171):

WGS primary specimens
WGS specimens (including non-primary)
WGS+WXS primary specimens
WGS+WXS specimens (including non-primary)

Because we would like all analyses to use the same set of samples when accounting for the multiple samples issue (see thread here), we need to figure out how to facilitate use by 1) deciding if/how to include these files in the data download process before the next planned data release and 2) figuring out how to document this in the README.

@jashapiro and I have talked about putting the files (using the specific commit linked above) in S3 and having that be included in download-data.sh.

As far as the second point goes, it might be helpful for me to first complete the breakdown by tumor_descriptor portion of #162 so we have something we can link to in the README alongside this issue.

jharenza · 2019-10-31T13:14:35Z

@jaclyn-taroni - I had added his files to the latest release (v6) - forgot to note this. Do you want these in a new release or can we use the first set?

cgreene · 2019-10-31T13:22:02Z

I think that we should advise people to filter samples to WGS primary specimens.

This way folks would need to actively think about the scientific purpose to go beyond that. It seems like this best balances ease (providing one recommendation) with care. We should provide guidance on how to filter. Perhaps we could provide template filtering code for this default for each of the file types that have already been worked with in the README or an importable function, either of which which would ease review.

jaclyn-taroni · 2019-10-31T13:22:29Z

Did you rerun it with the v6 data or use the files already in the repository (v5)? I downloaded the v6 files yesterday and they were not present. Perhaps an issue with them missing from md5sum.txt?

cgreene · 2019-10-31T13:22:47Z

I guess the question was about method of distribution: adding this list to the md5sum and data download would seem like the preferred method of distribution.

jharenza · 2019-10-31T13:27:00Z

Ahh, ok, those were missed in the S3 transfer on our end - I took the lists from the v5 release (clinical subjects did not change) that @jashapiro created from his PR. We can add these with the new controlFreeC file - just let me know which versions you want!

jaclyn-taroni · 2019-10-31T13:41:40Z

Can you add the files here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0c2d0d25c01dcbbbd63f94b064a69afc9dc44ea8/analyses/independent-samples/results

Those are updated to use v6. The files did change because of changes to the clinical data: #191 (review)

jharenza · 2019-10-31T13:44:25Z

Will do!

jharenza · 2019-10-31T13:52:35Z

Should we rename these files as independent-patient-cohort-** to be clear these are from unique patient cohorts, since the specimens are already all unique?

jashapiro · 2019-10-31T13:56:56Z

I like independent-specimens, as the other lists include replicate (non-independent) specimens, wheras the ones here are truly independent.

jharenza · 2019-11-01T00:39:45Z

@jashapiro, once #205 is merged, will you please describe the 4 independent specimen files to the README along with usage information?

jashapiro · 2019-11-01T13:39:47Z

Will do.

jashapiro added the data label Oct 11, 2019

jharenza mentioned this issue Oct 11, 2019

Planned Analysis: Analysis of recurrent fusions #10

Closed

jaclyn-taroni mentioned this issue Oct 23, 2019

Update: sample distribution plots accounting for multiple samples from the same individual #162

Closed

jashapiro mentioned this issue Oct 23, 2019

Independent sample analysis (1 of 2?) #167

Merged

2 tasks

jaclyn-taroni mentioned this issue Oct 24, 2019

Sample distribution plots: account for multiple samples from same individual #170

Merged

2 tasks

jashapiro mentioned this issue Oct 24, 2019

Independent sample analysis (2 of 2) #171

Merged

2 tasks

This was referenced Oct 25, 2019

Add an issue template specifically for updating an analysis #175

Merged

Data dictionary for tumor_descriptor field #178

Closed

jharenza mentioned this issue Oct 31, 2019

Planned Data release: v7 #194

Closed

This was referenced Oct 31, 2019

V7 Release #200

Merged

V7 Release #204

Merged

This was referenced Nov 1, 2019

Description of independent sample selection AlexsLemonade/OpenPBTA-manuscript#50

Merged

Independent specimen docs #206

Merged

jaclyn-taroni closed this as completed in #206 Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyses including multiple samples from the same individual #155

Analyses including multiple samples from the same individual #155

jashapiro commented Oct 11, 2019

jharenza commented Oct 11, 2019

cgreene commented Oct 13, 2019

jashapiro commented Oct 18, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

cgreene commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

cgreene commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jharenza commented Oct 31, 2019 •

edited

Loading

jashapiro commented Oct 31, 2019

jharenza commented Nov 1, 2019

jashapiro commented Nov 1, 2019

Analyses including multiple samples from the same individual #155

Analyses including multiple samples from the same individual #155

Comments

jashapiro commented Oct 11, 2019

Question/issue

jharenza commented Oct 11, 2019

cgreene commented Oct 13, 2019

jashapiro commented Oct 18, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

cgreene commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

cgreene commented Oct 31, 2019

jharenza commented Oct 31, 2019

jaclyn-taroni commented Oct 31, 2019

jharenza commented Oct 31, 2019

jharenza commented Oct 31, 2019 • edited Loading

jashapiro commented Oct 31, 2019

jharenza commented Nov 1, 2019

jashapiro commented Nov 1, 2019

jharenza commented Oct 31, 2019 •

edited

Loading