Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

Closed
PichaiRaman opened this issue Jul 12, 2019 · 22 comments
Assignees
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! updated analysis

Comments

@PichaiRaman
Copy link
Contributor

The idea would be to visualize trends of deletions / amplifications in each of the cancer types (where relevant). Pointing out highly amplified/deleted regions with oncogenes/tumor suppressors may be interesting.

@cgreene
Copy link
Collaborator

cgreene commented Sep 19, 2019

@jharenza : is this still blocked? It seems like something that could be done now and if the CNV calls get improved, then this would just get re-run.

@jharenza
Copy link
Collaborator

@jharenza : is this still blocked? It seems like something that could be done now and if the CNV calls get improved, then this would just get re-run.

I think I may have added it that way thinking consensus calls @gonzolgarcia creates would go into this, but perhaps consensus for CNV and SV would be separate issues?

@cgreene
Copy link
Collaborator

cgreene commented Sep 19, 2019

Ok!

@cbethell
Copy link
Contributor

I have begun to tackle this issue. Thus far, I believe the package GenVisR would be useful for this issue based on the literature I found here. If there are any suggestions for the implementation of an alternative package, please feel free to express them here.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Sep 25, 2019
@jaclyn-taroni
Copy link
Member

Related: #128

@jharenza
Copy link
Collaborator

jharenza commented Sep 25, 2019

Hi @cbethell ! Thanks for working on this! I also just created an issue #128 for creation of consensus calls for CN to improve our accuracy for downstream analytics like this issue. I imagine the output would still be a SEG file, such that you can start with any of the SEGs we provide. I previously have used GISTIC, but it does not work well for small cohorts (ie you need a large N to call recurrence within a caner). We may want to focus this issue on histologies with a specified >= N so that we have enough power to detect recurrent changes already known in the literature. For example, we have a large cohort of low-grade gliomas, we have a good number of medulloblastomas (could see how many of each subtype we have since we know each subtype has different CNAs - Figure 2), high-grade gliomas. I found this recent paper describing TAGCNA, which may work similar to GISTIC.

@jaclyn-taroni jaclyn-taroni added the cnv Related to or requires CNV data label Oct 26, 2019
@jaclyn-taroni
Copy link
Member

Summarizing my understanding of the status of this issue:

  • The mention of GISTIC and TAGCNA = I understand we're looking for significant amplifications or deletions and therefore some metric and test for significance.

  • As far as visualization is concerned, @jharenza has requested

    having this plotted in GISTIC-type format for the different histologies or as an IGV-type figure, grouped by histology, would be most effective for showing recapitulation of previous findings. Addition of R script to compare CNV caller output #142 (comment)

    Here's an example from Chekaluk et al.:


Also I would note if anyone is going to continue development prior to #128, probably use CNVkit and not ControlFREEC (see caveat in README and the caveat in action in this plot).

@jaclyn-taroni
Copy link
Member

GISTIC 2.0 is available as a GenePattern module. This requires a reference file:

Reference Genome File (-refgene) (REQUIRED)
The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in MATLAB TM and are not viewable with a text editor. The GISTIC 2.0 release includes the following reference genomes: hg16.mat, hg17.mat, hg18.mat, and hg19.mat).

That description makes it seem like it might not be the most straightforward to obtain/derive?

Also requires:

Markers File (-mk) (REQUIRED)
The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three-column, tab-delimited file with an optional header. The column headers are:

Marker Name (marker name)
Chromosome (chromosome number)
Marker Position (in bases)

I think if we could run this in something like GenePattern Notebook so it's something public facing we can link to in a README, that could be good. Alternatively, something publicly facing on CAVATICA that can be linked to in the same way.

This requires a little bit more research before we know if it's feasible but wanted to report what I know so far.

@jharenza
Copy link
Collaborator

jharenza commented Oct 28, 2019

@jaclyn-taroni I may be able to run GISTIC tomorrow and provide in the data release, once I get the updated seg files.

@cansavvy
Copy link
Collaborator

I am working on getting this GISTIC plot set up in preparation for when the GISTIC data should be added to the release on v12.

@cansavvy
Copy link
Collaborator

Did we want this plot to be exactly like the example above? Here is what I have so far:
Screen Shot 2019-12-13 at 11 55 48 AM

@cansavvy
Copy link
Collaborator

@jharenza , do you have thoughts about this rough draft GISTIC plot? What changes would you like to see?

jharenza pushed a commit that referenced this issue Dec 17, 2019
### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
jaclyn-taroni pushed a commit that referenced this issue Dec 19, 2019
* Release V12 data

### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Update release-notes.md

fix link

* Update data-files-description.md

fix GISTIC table sectioning

* Update data-files-description.md

fix spacing on data description table

* Update data-files-description.md

fix more spacing in data file description file

* Update download-data.sh

add new release date to download script

* Update the TMB file descriptions

* Update TMB file formats section

* Update fusion section of data formats

Also more specific description of the by sample file

* Add GISTIC file to data-formats

* Update download-data.sh

* Update download-data.sh

* data description md is also included in md5sum

* TMB exon -> coding sequence

* Coding TMB CDS, not exon
@cansavvy
Copy link
Collaborator

Per @jharenza , I will update this graph to be split up by specific histologies and annotate the chromosomes more clearly, perhaps like this example:
Screen Shot 2019-12-19 at 2 17 15 PM

@cansavvy
Copy link
Collaborator

@jharenza It doesn't look like the GISTIC results are split up by histology or have scores by samples, so I won't be able to do the individually plots by histology unless you run it separately. That being said, here's a re-done version of the GISTIC plot with the other fixes. I can functionalize this and apply it to histology groups if we get that data.
gistic_plot

@jharenza
Copy link
Collaborator

@cansavvy yeah, that's right - I can't really do it by histology because there will be too low of an N for many histologies. For that reason, what I had done in the past is just subset the samples from the seg file per each histology and plot LRR as in the NBL figure above, pointing out key regions of amp/del, so there was no statistical test for those plots. I wonder if there is a way to use GISTIC all_data_by_genes.txt or the chr arm data matrices to do enrichments? Otherwise, maybe we can focus on just the histologies with a certain N samples and I can rerun.

The plot looks better! Is there a way to scale by chr size such that chr 1 is the largest and 22 the smallest? Are 23 and 24 X and Y?

@cansavvy
Copy link
Collaborator

Ah good point, I’ll try to make the chromosomes scaled and change 23 and 24 to X and Y. I’ll look into making LRR plots by histology.

@cansavvy
Copy link
Collaborator

cansavvy commented Dec 27, 2019

@jharenza , do you have code associated with the graph above that you would be able to send me so I can see how the y axis data was calculated?

@jharenza
Copy link
Collaborator

Hi @cansavvy - I added it here but could not get the hg19.mat added and able to read into R - I tried to add and change permissions - even if I moved it on my computer to be outside of the GISTIC program, I was unable to the read it into R. Strange... so, unfortunately, it seems the user will have to unpack GISTIC to get that file. Sent you another message about that.

@cansavvy
Copy link
Collaborator

@jharenza , Per our discussion in Slack, it appears the difference between the code you linked above (which used ControlFreeC) and our data we are using here (CNVKit derived) is that we do not have Log R Ratios calculated. This being said, what stat would we like to use instead to report a CNV summary for each histology group? Median copy number (using copy.num from the CNVKit seg file) seems like the simplest approach so I will use it as a stand in for now.

I am not familiar with what is standard practice for reporting these data so if there is a preferred alternative to median copy number, let me know and I should be able to fairly easily switch out the calculation. I will look at some papers and see if I can figure out something better as well.

@cansavvy
Copy link
Collaborator

It appears that Log R Ratio/LRR that ControlFreeC gives is conceptually similar/equivalent to CNVKit's seg.mean column (See CNVKit's docs). We should probably add this revelation to the documentation and I will make an issue for that and for other CNV related data descriptions we may need to elaborate on.

That being said I'm going to move forward with using seg.mean data and take some sort of average across samples of the same histology group.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 18, 2020

We have a consensus SEG file that includes the seg.mean as of #441: analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg

The cnv-chrom-plot module plots the mean seg.mean value for a histology. This plotting should be updated to use the consensus file. This is why I have added the updated analysis label.

Updating the GISTIC plots portion of that module depends on #453.

@cansavvy
Copy link
Collaborator

cansavvy commented Feb 4, 2020

I believe this issue can be closed now?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! updated analysis
Projects
None yet
Development

No branches or pull requests

6 participants