-
Notifications
You must be signed in to change notification settings - Fork 67
Convert the cnv consensus output table to a segfile #441
Convert the cnv consensus output table to a segfile #441
Conversation
Follows the format of the cnvkit seg file in OpenPBTA/data/
@jashapiro CI is failing with:
Full output is here: build_2567_step_116_container_0.txt |
@jashapiro I am unable to update this branch so it is up to date with the AlexsLemonade master branch. Can you allow edits from maintainers? If that is already enabled, then something else is going on and I'm not sure what it might be but I'm sure we could get to the bottom of it. Something similar happened on #439 so would love to figure it out if it's not that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I had a few questions about the output and I think the README needs more documentation before merging. You and I had talked in person about revisiting how CNVkit calculates seg.mean
-- did you look at that and come to the decision that weighted mean is the way to go here?
@@ -0,0 +1,7070 @@ | |||
ID chrom loc.start loc.end num.mark seg.mean copy.num | |||
BS_007JTNB8 chr11 771036 866778 NA 0.214821 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all of the num.mark
values are NA
- is this expected behavior? Also some NA
values in the seg.mean
column. I don't have an intuition for whether or not this is expected or if NA
values should be handed in the seg_*
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, does the seg.mean
become NA
when a call is the result of consensus between Manta and ControlFreeC but not CNVkit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Only CNVkit uses seg.mean, so if there is no CNVkit call, we have nothing to work from. The num.mark column is also something potentially specific to CNVkit, but was not carried through in the consensus scripts. It seems to be optional to the .seg file format; I could just leave it off if nothing downstream depends on it. I thought preserving the column structure was worth doing for the first pass.
As to the contents of num.mark: it is "bins" from CNVkit, so it doesn't really correspond to anything we can calculate easily, though I suppose we could fill in length or something like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you've implemented sounds good to me in light of this additional information, thanks for the explanation!
Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted one typo - LGTM otherwise!
@@ -35,6 +43,10 @@ Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-free | |||
|
|||
11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file | |||
12) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv` | |||
13) The `results/cnv_consensus.tsv` is translated into a `pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`. | |||
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment lenth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment lenth. | |
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length. |
Purpose/implementation Section
What scientific question is your analysis addressing?
How should various CNV calls be merged for downstream analysis?
What was your approach?
I wrote an R script to take the output from the copy number consensus analysis, and reduce it to a set of data that can be consumed as if it had come from cnvkit (more or less). Notably, as the copy number consensus may merge multiple nearby CNVs into a single event, this this requires some way of merging
seg.mean
values from multiple CNV calls.In order to merge
seg.mean
values, I am currently taking the approach of calculating the weighted mean of all CNVkit calls that comprise each consensus call. The code does include the possibility of various other schemes such as weighted medians with or without interpolation.While not used in this output, the script does also calculate a weighted median of copy number values (integers).
What GitHub issue does your pull request address?
#392
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is the choice of weighted means to produce a single value reasonable?
I am using weighted median copy number from CNVkit where I can, and ControlFreeC values when CNVkit did not contribute to the consensus. Does that make sense?
Is there anything that you want to discuss further?
I do not think the discussions around #392 are complete: is this the format of output that we want to use, or should we include some of the other information in the final output file, such as alternative consensus seg.mean estimates?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Results
What types of results are included (e.g., table, figure)?
A segfile table:
pbta-cnv-consensus.seg
What is your summary of the results?
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.