Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Convert the cnv consensus output table to a segfile #441

Merged

Conversation

jashapiro
Copy link
Member

@jashapiro jashapiro commented Jan 15, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

How should various CNV calls be merged for downstream analysis?

What was your approach?

I wrote an R script to take the output from the copy number consensus analysis, and reduce it to a set of data that can be consumed as if it had come from cnvkit (more or less). Notably, as the copy number consensus may merge multiple nearby CNVs into a single event, this this requires some way of merging seg.mean values from multiple CNV calls.

In order to merge seg.mean values, I am currently taking the approach of calculating the weighted mean of all CNVkit calls that comprise each consensus call. The code does include the possibility of various other schemes such as weighted medians with or without interpolation.

While not used in this output, the script does also calculate a weighted median of copy number values (integers).

What GitHub issue does your pull request address?

#392

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the choice of weighted means to produce a single value reasonable?

I am using weighted median copy number from CNVkit where I can, and ControlFreeC values when CNVkit did not contribute to the consensus. Does that make sense?

Is there anything that you want to discuss further?

I do not think the discussions around #392 are complete: is this the format of output that we want to use, or should we include some of the other information in the final output file, such as alternative consensus seg.mean estimates?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

A segfile table: pbta-cnv-consensus.seg

What is your summary of the results?

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jashapiro jashapiro marked this pull request as ready for review January 16, 2020 16:24
@jaclyn-taroni jaclyn-taroni self-requested a review January 16, 2020 19:50
@jaclyn-taroni
Copy link
Member

@jashapiro CI is failing with:

Complete log: /rocker-build/analyses/copy_number_consensus_call/.snakemake/log/2020-01-16T171857.497395.snakemake.log
Error in file.path(analysis_dir, opts$cnv_file) : 
  object 'analysis_dir' not found
Calls: <Anonymous> ... read_delimited -> source_name -> is.connection -> file.path
Execution halted

Full output is here: build_2567_step_116_container_0.txt

@jaclyn-taroni
Copy link
Member

@jashapiro I am unable to update this branch so it is up to date with the AlexsLemonade master branch. Can you allow edits from maintainers? If that is already enabled, then something else is going on and I'm not sure what it might be but I'm sure we could get to the bottom of it. Something similar happened on #439 so would love to figure it out if it's not that!

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I had a few questions about the output and I think the README needs more documentation before merging. You and I had talked in person about revisiting how CNVkit calculates seg.mean -- did you look at that and come to the decision that weighted mean is the way to go here?

analyses/copy_number_consensus_call/README.md Show resolved Hide resolved
@@ -0,0 +1,7070 @@
ID chrom loc.start loc.end num.mark seg.mean copy.num
BS_007JTNB8 chr11 771036 866778 NA 0.214821 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all of the num.mark values are NA - is this expected behavior? Also some NA values in the seg.mean column. I don't have an intuition for whether or not this is expected or if NA values should be handed in the seg_* functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, does the seg.mean become NA when a call is the result of consensus between Manta and ControlFreeC but not CNVkit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Only CNVkit uses seg.mean, so if there is no CNVkit call, we have nothing to work from. The num.mark column is also something potentially specific to CNVkit, but was not carried through in the consensus scripts. It seems to be optional to the .seg file format; I could just leave it off if nothing downstream depends on it. I thought preserving the column structure was worth doing for the first pass.

As to the contents of num.mark: it is "bins" from CNVkit, so it doesn't really correspond to anything we can calculate easily, though I suppose we could fill in length or something like it.

Copy link
Member

@jaclyn-taroni jaclyn-taroni Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you've implemented sounds good to me in light of this additional information, thanks for the explanation!

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted one typo - LGTM otherwise!

@@ -35,6 +43,10 @@ Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-free

11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
12) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`
13) The `results/cnv_consensus.tsv` is translated into a `pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`.
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment lenth.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment lenth.
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants