Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update focal-cn-file-preparation to resolve duplicate status calls #292

Merged
merged 13 commits into from
Jan 7, 2023

Conversation

rjcorb
Copy link

@rjcorb rjcorb commented Nov 15, 2022

Purpose/implementation Section

Resolve duplicate locus status calls generated by the focal-cn-file-preparation analysis module

What scientific question is your analysis addressing?

In cases where multiple status calls are present for the same gene in the same biospecimen, determine if these can be resolved to only include a single status call per gene and biospecimen ID.

What was your approach?

Duplicate locus status calls were resolved depending on the cause of duplication:

- CNV segment annotation was duplicated in output file due to overlapping gene mapping to multiple cytobands. In many instances, the AnnotationDbi::select() function maps genes to >1 cytoband, resulting in duplicated segment annotation for each unique gene-cytoband pair. To resolve this, the annotation file was filtered to include only one unique cytoband annotation per gene.

- Multiple segments with the same status call overlap a gene. In these instances, only one segment annotation with the status call is now retained.

- Multiple segments with conflicting status calls overlap a gene. In these instances, the resolve_duplicate_annotations() function in the resolve_duplicate_annotations.R script is implemented to identify a dominant status call per gene using the following criteria:

  1. One of two duplicate calls has NA status --> retain non-NA status call
  2. One of two duplicate calls has neutral status --> retain non-neutral status call
  3. One duplicate call comes from segment with >50% exon overlap, or segment with >1.5x % exon overlap relative to other segments --> maintain major segment call
  4. Duplicate calls both have amplification status with different copy numbers --> retain one call with average copy number across segments
  5. Duplicate calls are amplification/gain or deep deletion/loss --> retain only amplification or deep deletion call, respectively.

What GitHub issue does your pull request address?

Ticket #436 and Ticket #437

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • The 04-prepare-cn-file.R script should be reviewed to ensure that merging of overlapping exonic regions in the gtf annotation file is performed correctly.
  • The process_annotate_overlaps.R script should be reviewed to ensure that % gene exon overlap for each segment is calculated correctly.
  • The resolve_duplicated_annotations.R script and function should be reviewed to ensure that filtering and resolution of duplicated calls is being performed as expected.

Is there anything that you want to discuss further?

The changes made in this PR resolves >99% of duplicated locus status calls in the consensus_wgs_plus_cnvkit_wxs.tsv.gz output file; however 126 duplicated calls could not be resolved and are included in the output as duplicates:

# A tibble: 252 x 8
   biospecimen_id           status        copy_number ploidy ensembl         gene_symbol cytoband     pct_overlap
   <chr>                    <chr>               <dbl>  <dbl> <chr>           <chr>       <chr>              <dbl>
 1 TARGET-10-PANVKH-09A-01D gain                    3      2 ENSG00000066735 KIF26A      14q32.33            43.9
 2 TARGET-10-PANVKH-09A-01D loss                    1      2 ENSG00000066735 KIF26A      14q32.33            40.4
 3 TARGET-10-PAPADT-09A-01D loss                    1      3 ENSG00000271968 ADAM6       14q32.33            48.1
 4 TARGET-10-PAPADT-09A-01D gain                    4      3 ENSG00000271968 ADAM6       14q32.33            42.2
 5 TARGET-10-PAPAVR-09A-01D loss                    1      2 ENSG00000206503 HLA-A       6p22.1              47.3
 6 TARGET-10-PAPAVR-09A-01D gain                    3      2 ENSG00000206503 HLA-A       6p22.1              42.4
 7 TARGET-10-PAPDDA-09A-01D gain                    4      3 ENSG00000134545 KLRC1       12p13.2             45.0
 8 TARGET-10-PAPDDA-09A-01D loss                    2      3 ENSG00000134545 KLRC1       12p13.2             34.8
 9 TARGET-10-PARPGJ-09A-01D gain                    3      2 ENSG00000259458 MGC15885    15q22.2             44.7
10 TARGET-10-PARPGJ-09A-01D loss                    1      2 ENSG00000259458 MGC15885    15q22.2             34.7
11 TARGET-10-PARSKH-09A-01D gain                    5      3 ENSG00000185158 LRRC37B     17q11.2             49.0
12 TARGET-10-PARSKH-09A-01D loss                    2      3 ENSG00000185158 LRRC37B     17q11.2             48.8
13 TARGET-10-PATXNK-09A-01D gain                    3      2 ENSG00000120159 CAAP1       9p21.2              49.1
14 TARGET-10-PATXNK-09A-01D loss                    1      2 ENSG00000120159 CAAP1       9p21.2              35.0
15 TARGET-20-PANLIZ-09A-01D loss                    1      2 ENSG00000104388 RAB2A       8q12.1-q12.2        45.5
16 TARGET-20-PANLIZ-09A-01D gain                    3      2 ENSG00000104388 RAB2A       8q12.1-q12.2        41.9
17 TARGET-30-PAIMDT-01A-01W amplification           8      2 ENSG00000089876 DHX32       10q26.2             45.6
18 TARGET-30-PAIMDT-01A-01W loss                    1      2 ENSG00000089876 DHX32       10q26.2             39.1
19 TARGET-30-PAKZRE-01A-01W loss                    2      3 ENSG00000152518 ZFP36L2     2p21                49.0
20 TARGET-30-PAKZRE-01A-01W gain                    4      3 ENSG00000152518 ZFP36L2     2p21                37.5
21 TARGET-30-PALBFW-01A-01W gain                    3      2 ENSG00000196531 NACA        12q13.3             20.3
22 TARGET-30-PALBFW-01A-01W loss                    1      2 ENSG00000196531 NACA        12q13.3             17.1
23 TARGET-30-PALETP-01A-01W gain                    3      2 ENSG00000185437 SH3BGR      21q22.2             49.6
24 TARGET-30-PALETP-01A-01W loss                    1      2 ENSG00000185437 SH3BGR      21q22.2             44.4
25 TARGET-30-PALETP-01A-01W gain                    3      2 ENSG00000182670 TTC3        21q22.13            49.5
26 TARGET-30-PALETP-01A-01W loss                    1      2 ENSG00000182670 TTC3        21q22.13            47.3
27 TARGET-30-PALEVG-01A-01W loss                    1      2 ENSG00000161958 FGF11       17p13.1             44.5
28 TARGET-30-PALEVG-01A-01W gain                    3      2 ENSG00000161958 FGF11       17p13.1             40.6
29 TARGET-30-PALEVG-01A-01W gain                    3      2 ENSG00000084764 MAPRE3      2p23.3              44.3
30 TARGET-30-PALEVG-01A-01W loss                    1      2 ENSG00000084764 MAPRE3      2p23.3              40.5
# … with 222 more rows

These genes/calls should be analyzed further to determine if any further resolutions can be reached. Additionally, 792 duplicate calls (43.5%) in controlfreec_annotated_cn_wxs_autosomes.tsv.gz and controlfreec_annotated_cn_wxs_x_and_y.tsv.gz files could not be resolved.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

NOTE: due to large size of some results files, they were not pushed as part of this PR. Please run analysis module to generate all results.

  • annotated cnv files in results directory
  • cnv plots in plots directory

What is your summary of the results?

Over 99% of duplicate calls in consensus_wgs_plus_cnvkit_wxs.tsv.gz were resolved, and 126 duplicate calls are still present in this results file. The column pct_overlap is now included and denotes the percent of the gene's exonic regions that are overlapping by the segment. 56.51% of duplicate calls in controlfreec_annotated_cn_wxs_autosomes.tsv.gz and controlfreec_annotated_cn_wxs_x_and_y.tsv.gz results file were also resolved, and these files also now include the pct_overlap column.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Copy link

@afarrel afarrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
The code runs on EC2 with no issue. Merging and overlap calculations look fine. There is a small percentage of unresolved duplications, however for downstream analyses we can use a bootstrap method to resolve this when generating plots and reporting in tables.

@rjcorb
Copy link
Author

rjcorb commented Dec 20, 2022

The latest changes have reduced the number of duplicated calls from 126 to 99, with the majority (N=94) being in TARGET samples.

@jharenza jharenza merged commit 9ad3755 into dev Jan 7, 2023
@jharenza jharenza deleted the update-focal-cn branch February 19, 2023 02:11
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants