Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Analyses including multiple samples from the same individual #155

Closed
jashapiro opened this issue Oct 11, 2019 · 15 comments · Fixed by #206
Closed

Analyses including multiple samples from the same individual #155

jashapiro opened this issue Oct 11, 2019 · 15 comments · Fixed by #206
Labels

Comments

@jashapiro
Copy link
Member

Question/issue

Many of the analyses that have been proposed are sensitive to the fact that we have multiples tumor samples from different time points or tumors from the same individual. For example, biospecimens BS_K07KNTFY and BS_AQMKA8NC are both tumor WGS data (initial and recurrence, respectively) from participant PT_00G007DM. While this is extremely useful data, it presents questions for many particular analyses, which I would like to discuss in this issue.

In particular, analyses of mutation prevalence, variant allele frequency distributions, classification accuracy, etc. are likely to be affected by these non-independent samples. In some cases, a simple awareness of the issue will be sufficient, and analyses can be written to account for or take advantage of the redundancy in the data. However for many analyses, decisions of which samples to include or exclude will need to be made, and it would be good to have an agreed upon set of standards and procedures.

For a specific example, in the analysis of mutation co-occurrence (#13), including all samples would result in many spurious reports of co-occurrence, as it is quite common for two samples from the same individual to have the same sets of mutations. Similarly, analyses of recurrent fusions (#10), distribution of tumor mutation burden (#3), etc. will likely be affected.

One potential solution is to use only primary tumors and/or the earliest sampled tumor from each individual in analyses such as this. However, this would miss some potential co-occurrence patterns that may be important in progression and recurrence, which might suggest that the latest tumor from each individual would be better. Doing both is of course an option as well, but I am curious to hear what others think is most appropriate. Ultimately, we may want to add a recommendation to the documentation for future analyses.

@jashapiro jashapiro added the data label Oct 11, 2019
@jharenza
Copy link
Collaborator

@jashapiro I recently had this on my mind as I saw some of the mutation analyses you described, so I am glad you are bringing this up in an issue. As you allude to, I think how we use the data also depends upon how data are being used/plotted. For example, I think it makes sense to limit to one primary tumor per patient when doing mutual exclusivity/co-occurrence analyses (#13), but we could calculate TMB (#3) for all patients and when these are plotted, plot with diagnosis and relapse separated. In fact, we expect a higher TMB in relapse tumors compared to primary, so this would be good validation. For fusions (#10), I think this is also important to discuss. We also have cell lines and some of the fusions may be lost in culture or at relapse, so these would be important things to note and I would think could be something within the scope of another issue.

We also have WGS and WXS for some samples at the same phase of therapy and so I was thinking that for consensus calls (#69), we should collapse this data on a per-patient level. For the oncoprint (#6), we could add tracks for phase of therapy and/or separate dx/relapse/cell lines, but we can flush this out in this issue.

@cgreene
Copy link
Collaborator

cgreene commented Oct 13, 2019

I agree that we should ask people to - by default - report on only primary tumors and filter to one per patient for most of the overall atlas work. How many patients had multiple primaries?

We'll probably also want to go back and make sure that existing analyses thoughtfully consider whether this rule should be applied.

@jashapiro
Copy link
Member Author

I thought I would give a bit of specificity to this question. First up: highlighting the WGS samples where we have multiple biospecimens from the same participants, with no obvious way to decide which of these is earliest.

For these data, I have chosen the samples with the smallest age at diagnosis for each participant, and removed all derived cell line samples.

Kids_First_Participant_ID samples descriptors age_diagnosis OS_days
PT_9GKVQ9QS BS_JRFVST47, BS_3Z40EZHD Diagnosis 1825 171
PT_DTP4MMRA BS_KH3859M5, BS_NASADC3P Second Malignancy 4988 842
PT_KBFM551M BS_M0B42FPR, BS_M5FM63EB, BS_J8EK6RNF, BS_9P4NDTKJ Diagnosis, Progressive Disease Post-Mortem 3285 395
PT_KTRJ8TFY BS_BQ81D2BP, BS_HYKV2TH9, BS_EE73VE7V, BS_3VKW5988, BS_AF5D41PD, BS_5968GBGT Diagnosis, Progressive Disease Post-Mortem 1825 274
PT_KZ56XHJT BS_H8NWA41N, BS_AK9BV52G, BS_X5VN0FW0, BS_22VCR7DF, BS_D6STCMQS, BS_0ATJ22QA, BS_YHXMYDBN, BS_1Q524P3B Progressive, Progressive Disease Post-Mortem 2555 241
PT_M9XXJ4GR BS_R6CKWZW6, BS_682Z7WH6 Diagnosis 3650 260
PT_MNSEJCDM BS_ZSH09N84, BS_J8EH1N7V, BS_Y74XAFJX, BS_CBMAWSAR Diagnosis 2190  
PT_NK8A49X5 BS_AHAXPFG3, BS_1MME7FBS, BS_HEJ72V3F Diagnosis, Progressive 4745 704
PT_QD6KKKJH BS_B91XGSA5, BS_9BN45DFK Initial CNS Tumor 3129 106

As you can see, there are still many participants with multiple samples. I am not sure how we should choose among these, or if they should be merged before analysis. In some cases, you can see that they all have the same tumor descriptor, but sometimes there is more than one, and while making a decision on which of these to select might be helpful, it would not eliminate any particular problem, as most of these samples have more than one with each particular descriptor.

More generally, is there somewhere where the fields are described? I am unsure of what the OS_days field refers to.

@jaclyn-taroni
Copy link
Member

With #191, I reran @jashapiro's independent-samples module with the data from release-v6-20191030.

That means that we have 4 lists of independent samples here that can be used for downstream analysis when there should not be multiple samples from the same individual in an analysis (from #171):

  • WGS primary specimens
  • WGS specimens (including non-primary)
  • WGS+WXS primary specimens
  • WGS+WXS specimens (including non-primary)

Because we would like all analyses to use the same set of samples when accounting for the multiple samples issue (see thread here), we need to figure out how to facilitate use by 1) deciding if/how to include these files in the data download process before the next planned data release and 2) figuring out how to document this in the README.

@jashapiro and I have talked about putting the files (using the specific commit linked above) in S3 and having that be included in download-data.sh.

As far as the second point goes, it might be helpful for me to first complete the breakdown by tumor_descriptor portion of #162 so we have something we can link to in the README alongside this issue.

@jharenza
Copy link
Collaborator

@jaclyn-taroni - I had added his files to the latest release (v6) - forgot to note this. Do you want these in a new release or can we use the first set?

@cgreene
Copy link
Collaborator

cgreene commented Oct 31, 2019

I think that we should advise people to filter samples to WGS primary specimens.

This way folks would need to actively think about the scientific purpose to go beyond that. It seems like this best balances ease (providing one recommendation) with care. We should provide guidance on how to filter. Perhaps we could provide template filtering code for this default for each of the file types that have already been worked with in the README or an importable function, either of which which would ease review.

@jaclyn-taroni
Copy link
Member

Did you rerun it with the v6 data or use the files already in the repository (v5)? I downloaded the v6 files yesterday and they were not present. Perhaps an issue with them missing from md5sum.txt?

@cgreene
Copy link
Collaborator

cgreene commented Oct 31, 2019

I guess the question was about method of distribution: adding this list to the md5sum and data download would seem like the preferred method of distribution.

@jharenza
Copy link
Collaborator

Ahh, ok, those were missed in the S3 transfer on our end - I took the lists from the v5 release (clinical subjects did not change) that @jashapiro created from his PR. We can add these with the new controlFreeC file - just let me know which versions you want!

@jaclyn-taroni
Copy link
Member

Can you add the files here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0c2d0d25c01dcbbbd63f94b064a69afc9dc44ea8/analyses/independent-samples/results

Those are updated to use v6. The files did change because of changes to the clinical data: #191 (review)

@jharenza
Copy link
Collaborator

Will do!

@jharenza
Copy link
Collaborator

jharenza commented Oct 31, 2019

Should we rename these files as independent-patient-cohort-** to be clear these are from unique patient cohorts, since the specimens are already all unique?

@jashapiro
Copy link
Member Author

I like independent-specimens, as the other lists include replicate (non-independent) specimens, wheras the ones here are truly independent.

This was referenced Oct 31, 2019
@jharenza
Copy link
Collaborator

jharenza commented Nov 1, 2019

@jashapiro, once #205 is merged, will you please describe the 4 independent specimen files to the README along with usage information?

@jashapiro
Copy link
Member Author

Will do.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants