-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update focal-cn-file-preparation to resolve duplicate status calls #292
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
The code runs on EC2 with no issue. Merging and overlap calculations look fine. There is a small percentage of unresolved duplications, however for downstream analyses we can use a bootstrap method to resolve this when generating plots and reporting in tables.
The latest changes have reduced the number of duplicated calls from 126 to 99, with the majority (N=94) being in TARGET samples. |
Purpose/implementation Section
Resolve duplicate locus status calls generated by the
focal-cn-file-preparation
analysis moduleWhat scientific question is your analysis addressing?
In cases where multiple status calls are present for the same gene in the same biospecimen, determine if these can be resolved to only include a single status call per gene and biospecimen ID.
What was your approach?
Duplicate locus status calls were resolved depending on the cause of duplication:
- CNV segment annotation was duplicated in output file due to overlapping gene mapping to multiple cytobands. In many instances, the AnnotationDbi::select() function maps genes to >1 cytoband, resulting in duplicated segment annotation for each unique gene-cytoband pair. To resolve this, the annotation file was filtered to include only one unique cytoband annotation per gene.
- Multiple segments with the same status call overlap a gene. In these instances, only one segment annotation with the status call is now retained.
- Multiple segments with conflicting status calls overlap a gene. In these instances, the
resolve_duplicate_annotations()
function in theresolve_duplicate_annotations.R
script is implemented to identify a dominant status call per gene using the following criteria:NA
status --> retain non-NA status callneutral
status --> retain non-neutral status callamplification
status with different copy numbers --> retain one call with average copy number across segmentsamplification
/gain
ordeep deletion
/loss
--> retain onlyamplification
ordeep deletion
call, respectively.What GitHub issue does your pull request address?
Ticket #436 and Ticket #437
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
04-prepare-cn-file.R
script should be reviewed to ensure that merging of overlapping exonic regions in the gtf annotation file is performed correctly.process_annotate_overlaps.R
script should be reviewed to ensure that % gene exon overlap for each segment is calculated correctly.resolve_duplicated_annotations.R
script and function should be reviewed to ensure that filtering and resolution of duplicated calls is being performed as expected.Is there anything that you want to discuss further?
The changes made in this PR resolves >99% of duplicated locus status calls in the
consensus_wgs_plus_cnvkit_wxs.tsv.gz
output file; however 126 duplicated calls could not be resolved and are included in the output as duplicates:These genes/calls should be analyzed further to determine if any further resolutions can be reached. Additionally, 792 duplicate calls (43.5%) in
controlfreec_annotated_cn_wxs_autosomes.tsv.gz
andcontrolfreec_annotated_cn_wxs_x_and_y.tsv.gz
files could not be resolved.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
NOTE: due to large size of some results files, they were not pushed as part of this PR. Please run analysis module to generate all results.
results
directoryplots
directoryWhat is your summary of the results?
Over 99% of duplicate calls in
consensus_wgs_plus_cnvkit_wxs.tsv.gz
were resolved, and 126 duplicate calls are still present in this results file. The columnpct_overlap
is now included and denotes the percent of the gene's exonic regions that are overlapping by the segment. 56.51% of duplicate calls incontrolfreec_annotated_cn_wxs_autosomes.tsv.gz
andcontrolfreec_annotated_cn_wxs_x_and_y.tsv.gz
results file were also resolved, and these files also now include thepct_overlap
column.Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.