Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

PichaiRaman · 2019-07-12T14:36:12Z

The idea would be to visualize trends of deletions / amplifications in each of the cancer types (where relevant). Pointing out highly amplified/deleted regions with oncogenes/tumor suppressors may be interesting.

cgreene · 2019-09-19T18:24:52Z

@jharenza : is this still blocked? It seems like something that could be done now and if the CNV calls get improved, then this would just get re-run.

jharenza · 2019-09-19T18:40:10Z

@jharenza : is this still blocked? It seems like something that could be done now and if the CNV calls get improved, then this would just get re-run.

I think I may have added it that way thinking consensus calls @gonzolgarcia creates would go into this, but perhaps consensus for CNV and SV would be separate issues?

cgreene · 2019-09-19T18:52:03Z

Ok!

cbethell · 2019-09-23T16:34:23Z

I have begun to tackle this issue. Thus far, I believe the package GenVisR would be useful for this issue based on the literature I found here. If there are any suggestions for the implementation of an alternative package, please feel free to express them here.

jaclyn-taroni · 2019-09-25T13:18:53Z

Related: #128

jharenza · 2019-09-25T13:19:35Z

Hi @cbethell ! Thanks for working on this! I also just created an issue #128 for creation of consensus calls for CN to improve our accuracy for downstream analytics like this issue. I imagine the output would still be a SEG file, such that you can start with any of the SEGs we provide. I previously have used GISTIC, but it does not work well for small cohorts (ie you need a large N to call recurrence within a caner). We may want to focus this issue on histologies with a specified >= N so that we have enough power to detect recurrent changes already known in the literature. For example, we have a large cohort of low-grade gliomas, we have a good number of medulloblastomas (could see how many of each subtype we have since we know each subtype has different CNAs - Figure 2), high-grade gliomas. I found this recent paper describing TAGCNA, which may work similar to GISTIC.

jaclyn-taroni · 2019-10-26T17:51:50Z

Summarizing my understanding of the status of this issue:

The mention of GISTIC and TAGCNA = I understand we're looking for significant amplifications or deletions and therefore some metric and test for significance.
As far as visualization is concerned, @jharenza has requested

having this plotted in GISTIC-type format for the different histologies or as an IGV-type figure, grouped by histology, would be most effective for showing recapitulation of previous findings. Addition of R script to compare CNV caller output #142 (comment)

Here's an example from Chekaluk et al.:

Once a consensus set of copy number calls has been generated (Proposed Analysis: Copy number consensus calls #128; output should still be a SEG file), we will want to examine results by histology as specified by the title of this issue and Addition of R script to compare CNV caller output #142 (comment). Medulloblastoma is a good starting point because we have prior knowledge about what CNAs to expect (Figure 2 linked above by @jharenza).
GISTIC files may be included in a later release (Addition of R script to compare CNV caller output #142 (comment)), but given the limitations of GISTIC, we would drop histologies with low numbers (Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8 (comment)).

If GISTIC files are included in a later release, we could use maftools, which is already on the Docker container, to produce plots: https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html#8_processing_copy-number_data
We could also try using TAGCNA as an alternative to GISTIC. It is an R package on sourceforge rather than CRAN or Bioconductor that does not appear to be actively maintained and appears to be less frequently used compared to GISTIC.

Also I would note if anyone is going to continue development prior to #128, probably use CNVkit and not ControlFREEC (see caveat in README and the caveat in action in this plot).

jaclyn-taroni · 2019-10-28T12:36:13Z

GISTIC 2.0 is available as a GenePattern module. This requires a reference file:

Reference Genome File (-refgene) (REQUIRED)
The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in MATLAB TM and are not viewable with a text editor. The GISTIC 2.0 release includes the following reference genomes: hg16.mat, hg17.mat, hg18.mat, and hg19.mat).

That description makes it seem like it might not be the most straightforward to obtain/derive?

Also requires:

Markers File (-mk) (REQUIRED)
The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three-column, tab-delimited file with an optional header. The column headers are:

Marker Name (marker name)
Chromosome (chromosome number)
Marker Position (in bases)

I think if we could run this in something like GenePattern Notebook so it's something public facing we can link to in a README, that could be good. Alternatively, something publicly facing on CAVATICA that can be linked to in the same way.

This requires a little bit more research before we know if it's feasible but wanted to report what I know so far.

jharenza · 2019-10-28T20:09:06Z

@jaclyn-taroni I may be able to run GISTIC tomorrow and provide in the data release, once I get the updated seg files.

cansavvy · 2019-12-13T16:49:19Z

I am working on getting this GISTIC plot set up in preparation for when the GISTIC data should be added to the release on v12.

cansavvy · 2019-12-13T16:56:51Z

Did we want this plot to be exactly like the example above? Here is what I have so far:

cansavvy · 2019-12-16T18:52:04Z

@jharenza , do you have thoughts about this rough draft GISTIC plot? What changes would you like to see?

### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon

cansavvy · 2019-12-19T19:19:28Z

Per @jharenza , I will update this graph to be split up by specific histologies and annotate the chromosomes more clearly, perhaps like this example:

cansavvy · 2019-12-20T17:49:55Z

@jharenza It doesn't look like the GISTIC results are split up by histology or have scores by samples, so I won't be able to do the individually plots by histology unless you run it separately. That being said, here's a re-done version of the GISTIC plot with the other fixes. I can functionalize this and apply it to histology groups if we get that data.

jharenza · 2019-12-20T18:09:36Z

@cansavvy yeah, that's right - I can't really do it by histology because there will be too low of an N for many histologies. For that reason, what I had done in the past is just subset the samples from the seg file per each histology and plot LRR as in the NBL figure above, pointing out key regions of amp/del, so there was no statistical test for those plots. I wonder if there is a way to use GISTIC all_data_by_genes.txt or the chr arm data matrices to do enrichments? Otherwise, maybe we can focus on just the histologies with a certain N samples and I can rerun.

The plot looks better! Is there a way to scale by chr size such that chr 1 is the largest and 22 the smallest? Are 23 and 24 X and Y?

cansavvy · 2019-12-20T18:12:23Z

Ah good point, I’ll try to make the chromosomes scaled and change 23 and 24 to X and Y. I’ll look into making LRR plots by histology.

cansavvy · 2019-12-27T16:10:37Z

@jharenza , do you have code associated with the graph above that you would be able to send me so I can see how the y axis data was calculated?

jharenza · 2019-12-27T18:36:52Z

Hi @cansavvy - I added it here but could not get the hg19.mat added and able to read into R - I tried to add and change permissions - even if I moved it on my computer to be outside of the GISTIC program, I was unable to the read it into R. Strange... so, unfortunately, it seems the user will have to unpack GISTIC to get that file. Sent you another message about that.

cansavvy · 2019-12-27T21:07:47Z

@jharenza , Per our discussion in Slack, it appears the difference between the code you linked above (which used ControlFreeC) and our data we are using here (CNVKit derived) is that we do not have Log R Ratios calculated. This being said, what stat would we like to use instead to report a CNV summary for each histology group? Median copy number (using copy.num from the CNVKit seg file) seems like the simplest approach so I will use it as a stand in for now.

I am not familiar with what is standard practice for reporting these data so if there is a preferred alternative to median copy number, let me know and I should be able to fairly easily switch out the calculation. I will look at some papers and see if I can figure out something better as well.

cansavvy · 2019-12-27T21:17:12Z

It appears that Log R Ratio/LRR that ControlFreeC gives is conceptually similar/equivalent to CNVKit's seg.mean column (See CNVKit's docs). We should probably add this revelation to the documentation and I will make an issue for that and for other CNV related data descriptions we may need to elaborate on.

That being said I'm going to move forward with using seg.mean data and take some sort of average across samples of the same histology group.

jaclyn-taroni · 2020-01-18T18:45:19Z

We have a consensus SEG file that includes the seg.mean as of #441: analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg

The cnv-chrom-plot module plots the mean seg.mean value for a histology. This plotting should be updated to use the consensus file. This is why I have added the updated analysis label.

Updating the GISTIC plots portion of that module depends on #453.

cansavvy · 2020-02-04T17:02:32Z

I believe this issue can be closed now?

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Sep 25, 2019

cbethell mentioned this issue Oct 2, 2019

Addition of R script to compare CNV caller output #142

Merged

6 tasks

jaclyn-taroni added the cnv Related to or requires CNV data label Oct 26, 2019

jharenza mentioned this issue Dec 17, 2019

Release V12 data #347

Merged

cansavvy mentioned this issue Dec 30, 2019

GISTIC Plots #377

Merged

3 tasks

jaclyn-taroni mentioned this issue Jan 2, 2020

Discussion: consensus CN files for downstream analyses #392

Closed

jaclyn-taroni assigned cansavvy Jan 18, 2020

jaclyn-taroni mentioned this issue Jan 18, 2020

Updated analysis: rerun GISTIC with consensus SEG file #453

Closed

jaclyn-taroni added the updated analysis label Jan 18, 2020

cansavvy mentioned this issue Feb 3, 2020

Update files used for input in cnv-chrom-plot #508

Merged

5 tasks

jaclyn-taroni closed this as completed Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

PichaiRaman commented Jul 12, 2019

cgreene commented Sep 19, 2019

jharenza commented Sep 19, 2019

cgreene commented Sep 19, 2019

cbethell commented Sep 23, 2019

jaclyn-taroni commented Sep 25, 2019

jharenza commented Sep 25, 2019 •

edited

Loading

jaclyn-taroni commented Oct 26, 2019

jaclyn-taroni commented Oct 28, 2019

jharenza commented Oct 28, 2019 •

edited

Loading

cansavvy commented Dec 13, 2019

cansavvy commented Dec 13, 2019

cansavvy commented Dec 16, 2019

cansavvy commented Dec 19, 2019

cansavvy commented Dec 20, 2019

jharenza commented Dec 20, 2019

cansavvy commented Dec 20, 2019

cansavvy commented Dec 27, 2019 •

edited

Loading

jharenza commented Dec 27, 2019

cansavvy commented Dec 27, 2019

cansavvy commented Dec 27, 2019

jaclyn-taroni commented Jan 18, 2020 •

edited

Loading

cansavvy commented Feb 4, 2020

Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

Planned Analysis: Copy number plot showing recurrently amplified/deleted regions in different PBTA cancers #8

Comments

PichaiRaman commented Jul 12, 2019

cgreene commented Sep 19, 2019

jharenza commented Sep 19, 2019

cgreene commented Sep 19, 2019

cbethell commented Sep 23, 2019

jaclyn-taroni commented Sep 25, 2019

jharenza commented Sep 25, 2019 • edited Loading

jaclyn-taroni commented Oct 26, 2019

jaclyn-taroni commented Oct 28, 2019

jharenza commented Oct 28, 2019 • edited Loading

cansavvy commented Dec 13, 2019

cansavvy commented Dec 13, 2019

cansavvy commented Dec 16, 2019

cansavvy commented Dec 19, 2019

cansavvy commented Dec 20, 2019

jharenza commented Dec 20, 2019

cansavvy commented Dec 20, 2019

cansavvy commented Dec 27, 2019 • edited Loading

jharenza commented Dec 27, 2019

cansavvy commented Dec 27, 2019

cansavvy commented Dec 27, 2019

jaclyn-taroni commented Jan 18, 2020 • edited Loading

cansavvy commented Feb 4, 2020

jharenza commented Sep 25, 2019 •

edited

Loading

jharenza commented Oct 28, 2019 •

edited

Loading

cansavvy commented Dec 27, 2019 •

edited

Loading

jaclyn-taroni commented Jan 18, 2020 •

edited

Loading