-
Notifications
You must be signed in to change notification settings - Fork 67
Analyses including multiple samples from the same individual #155
Comments
@jashapiro I recently had this on my mind as I saw some of the mutation analyses you described, so I am glad you are bringing this up in an issue. As you allude to, I think how we use the data also depends upon how data are being used/plotted. For example, I think it makes sense to limit to one primary tumor per patient when doing mutual exclusivity/co-occurrence analyses (#13), but we could calculate TMB (#3) for all patients and when these are plotted, plot with diagnosis and relapse separated. In fact, we expect a higher TMB in relapse tumors compared to primary, so this would be good validation. For fusions (#10), I think this is also important to discuss. We also have cell lines and some of the fusions may be lost in culture or at relapse, so these would be important things to note and I would think could be something within the scope of another issue. We also have WGS and WXS for some samples at the same phase of therapy and so I was thinking that for consensus calls (#69), we should collapse this data on a per-patient level. For the oncoprint (#6), we could add tracks for phase of therapy and/or separate dx/relapse/cell lines, but we can flush this out in this issue. |
I agree that we should ask people to - by default - report on only primary tumors and filter to one per patient for most of the overall atlas work. How many patients had multiple primaries? We'll probably also want to go back and make sure that existing analyses thoughtfully consider whether this rule should be applied. |
I thought I would give a bit of specificity to this question. First up: highlighting the WGS samples where we have multiple biospecimens from the same participants, with no obvious way to decide which of these is earliest. For these data, I have chosen the samples with the smallest age at diagnosis for each participant, and removed all derived cell line samples.
As you can see, there are still many participants with multiple samples. I am not sure how we should choose among these, or if they should be merged before analysis. In some cases, you can see that they all have the same tumor descriptor, but sometimes there is more than one, and while making a decision on which of these to select might be helpful, it would not eliminate any particular problem, as most of these samples have more than one with each particular descriptor. More generally, is there somewhere where the fields are described? I am unsure of what the OS_days field refers to. |
With #191, I reran @jashapiro's That means that we have 4 lists of independent samples here that can be used for downstream analysis when there should not be multiple samples from the same individual in an analysis (from #171):
Because we would like all analyses to use the same set of samples when accounting for the multiple samples issue (see thread here), we need to figure out how to facilitate use by 1) deciding if/how to include these files in the data download process before the next planned data release and 2) figuring out how to document this in the README. @jashapiro and I have talked about putting the files (using the specific commit linked above) in S3 and having that be included in As far as the second point goes, it might be helpful for me to first complete the breakdown by |
@jaclyn-taroni - I had added his files to the latest release (v6) - forgot to note this. Do you want these in a new release or can we use the first set? |
I think that we should advise people to filter samples to WGS primary specimens. This way folks would need to actively think about the scientific purpose to go beyond that. It seems like this best balances ease (providing one recommendation) with care. We should provide guidance on how to filter. Perhaps we could provide template filtering code for this default for each of the file types that have already been worked with in the README or an importable function, either of which which would ease review. |
Did you rerun it with the v6 data or use the files already in the repository (v5)? I downloaded the v6 files yesterday and they were not present. Perhaps an issue with them missing from |
I guess the question was about method of distribution: adding this list to the md5sum and data download would seem like the preferred method of distribution. |
Ahh, ok, those were missed in the S3 transfer on our end - I took the lists from the v5 release (clinical subjects did not change) that @jashapiro created from his PR. We can add these with the new controlFreeC file - just let me know which versions you want! |
Can you add the files here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0c2d0d25c01dcbbbd63f94b064a69afc9dc44ea8/analyses/independent-samples/results Those are updated to use v6. The files did change because of changes to the clinical data: #191 (review) |
Will do! |
Should we rename these files as |
I like independent-specimens, as the other lists include replicate (non-independent) specimens, wheras the ones here are truly independent. |
@jashapiro, once #205 is merged, will you please describe the 4 independent specimen files to the README along with usage information? |
Will do. |
Question/issue
Many of the analyses that have been proposed are sensitive to the fact that we have multiples tumor samples from different time points or tumors from the same individual. For example, biospecimens
BS_K07KNTFY
andBS_AQMKA8NC
are both tumor WGS data (initial and recurrence, respectively) from participantPT_00G007DM
. While this is extremely useful data, it presents questions for many particular analyses, which I would like to discuss in this issue.In particular, analyses of mutation prevalence, variant allele frequency distributions, classification accuracy, etc. are likely to be affected by these non-independent samples. In some cases, a simple awareness of the issue will be sufficient, and analyses can be written to account for or take advantage of the redundancy in the data. However for many analyses, decisions of which samples to include or exclude will need to be made, and it would be good to have an agreed upon set of standards and procedures.
For a specific example, in the analysis of mutation co-occurrence (#13), including all samples would result in many spurious reports of co-occurrence, as it is quite common for two samples from the same individual to have the same sets of mutations. Similarly, analyses of recurrent fusions (#10), distribution of tumor mutation burden (#3), etc. will likely be affected.
One potential solution is to use only primary tumors and/or the earliest sampled tumor from each individual in analyses such as this. However, this would miss some potential co-occurrence patterns that may be important in progression and recurrence, which might suggest that the latest tumor from each individual would be better. Doing both is of course an option as well, but I am curious to hear what others think is most appropriate. Ultimately, we may want to add a recommendation to the documentation for future analyses.
The text was updated successfully, but these errors were encountered: