Skip to content

Latest commit

 

History

History
422 lines (347 loc) · 11.8 KB

data-visualization.adoc

File metadata and controls

422 lines (347 loc) · 11.8 KB

RNA-seq data visualization in the UCSC Genome Browser

The UC Santa Cruz Genome Browser provides a great way to visualize genomic data in context.

Note
ENSEMBL is a similar tool, however we’ll focus on the UCSC Browser in this tutorial.
Note
UCSC has a European mirror here

The UCSC Browser contains many publicly available datasets (ENCODE data, ESTs, RefSeq, GENCODE gene models, etc.) organized in tracks. This is its "core" set of tracks, which can be turned on and off at will (see Track display below).

You can also add your own data to it, via Custom Tracks and Track Hubs.

In this part of the tutorial, we’ll learn how to use the UCSC Browser to visualize the data you’ve just generated with GRAPE (BAMs and bigWigs).

Basics

Tip
A lot of UCSC training resources are available here.

Accessing a locus / genome position

We’ll be looking at the top two differentially expressed genes in our dataset:

ENSEMBL id Genome position (mm9) Differential expression

ENSMUSG00000052187

chr7:111,000,259-111,001,754

E14 > E18

ENSMUSG00000032936

chr9:107,838,251-107,852,022

E14 < E18

You can input ENSEMBL ids, gene symbols or genome positions into the search box at the top of the page. Hit enter and you’ll be taken to the region of your gene of interest, which will be highlighted in the image.

Track display

Each track can be displayed with different levels of detail. These are, in ascending order:

  • hide

  • dense

  • squish

  • pack

  • full

You can access these settings either through individual drop-down track menus below the genome image:

UCSC-drop-down1

or by right-clicking the corresponding area on the image:

UCSC-drop-down2

Some track types (bigWigs, BAMs) have much more detailed configuration options:

bigWig-config

Visualizing your own data

  • If not done already, set custom shell environment:

    source ~ngs00/env/.ngsenv

Custom Tracks

The easiest way to load your data into the UCSC Browser is through a Custom Track.

First, we need to make this data accessible from the web, so that UCSC can download it. In your home directory you will find a public_docs/ folder, which is reachable through HTTP at this address: http://public-docs.crg.es/NGS/$USER (replace $USER with your ngsXX username, or type

echo http://public-docs.crg.es/NGS/$USER

in your terminal, and paste the output in your Web browser).

  • Make Custom Track directory (web-accessible through http://public-docs.crg.es/NGS/$USER/custom_tracks/)

    mkdir -p $customTrackDir
  • Copy GRAPE output files there (bigWigs + BAMs)

    awk '$5~/GenomeAlignment|^PlusRawSignal|^MinusRawSignal/{print $3}' $grapeDb | while read f; do
    # copy data files:
    rsync -av $f $customTrackDir/
    # copy BAM indices as well:
    [[ "$f" =~ bam$ ]] && rsync -av $f.bai $customTrackDir/
    done

Can you see the files in your Web Browser?

  • Open the Genome Browser

  • Make sure you’re using the correct genome assembly (mouse/mm9)

  • Click on "add custom tracks"

  • Go back to you terminal and convert local datafile paths to global web URLs:

    cd $customTrackDir
    for file in `ls . |grep -v .bai`; do
    echo "http://public-docs.crg.es/NGS/$USER/custom_tracks/$file"
    done

    Copy the output

  • Switch to your Web Browser, paste the URLs into the "Paste URLs or data:" text box and clisk "Submit". Your data will then be fetched by UCSC servers.

  • Check out our two gene examples:

ENSEMBL id Genome position (mm9) Differential expression

ENSMUSG00000052187

chr7:111,000,259-111,001,754

E14 > E18

ENSMUSG00000032936

chr9:107,838,251-107,852,022

E14 < E18

Custom tracks are viewable only on the machine from which they were uploaded and are automatically discarded 48 hours after the last time they are accessed, unless they are saved in a Session (in which case UCSC will erase them after 4 months). For a permanent solution, use Track Hubs instead.

Another important limitation is that the track display options need to be configured individually, which is cumbersome if you have multiple datasets.

Track Hubs

Overview

Track Hubs are Custom Tracks on steroids:

Custom Tracks Track Hubs

Configure tracks by groups

No

Yes

Where is the data?

Uploaded to UCSC servers (except binary indexed files)

Stays on your server

Accepted file types

All most common (BED, GTF, etc.)

Only binary indexed (bigWig, bigBed, BAM+BAI)

How long will it live?

48h

"Forever"

On exotic genome assemblies?

No

Yes (Assembly hubs)

Although originally developed at UCSC, they are also supported by ENSEMBL.

Track Hubs are very powerful: they allow you to reach the same level of sophistication as some "core" ENCODE tracks such as this one:

UCSC-mouseEncode-longRNAtracks

They are relatively complex to set up, though.

Introduction to quickTrackHub

Here we will use the quickTrackHub framework to make this task easier.

  • The idea is to group similar tracks together, based on their associated metadata (represented in their file names). Let’s see what our grouping options are:

    grouping

    We can organize our tracks the following way:

    • One superTrack per file type :

      • BAM: ReadAligns

      • bigWig: ReadSignal

    • Split each superTrack into composite dimensions:

      • (tissue , lifeStage) (matrix’s X dimension)

      • replicate (matrix’s Y dimension)

      • strand (for bigWigs only)

  • quickTrackHub will:

    • Read a Track Hub Definition File (JSON) that contain:

      • Basic track settings (genome assembly, URL, name, visibility, etc.)

      • Track grouping instructions

      • Filename parsing instructions (i.e. how to extract metadata from filenames)

        trackHubDefinition.json example
        {
        	"longLabel" : "ENCODE GRAPE sample data track hub, user ngs00",
        	"track" : "crgGrapeSample-ngs00",
        	"trackHubAssociatedEmail" : "your.email@yourinstitution.org",
        	"webPublicDir" : "http://public-docs.crg.es/NGS/ngs00/track_hub",
        	"superTracks" : [
        		{
        			"track" : "ENCODE_GRAPE_sample",
        			"longLabel" : "ENCODE GRAPE sample superTrack",
        			"visibility": "dense"
        		},
        		{
        			"track" : "ReadAligns",
        			"parent" : "ENCODE_GRAPE_sample",
        			"longLabel" : "Read alignments (BAMs)",
        			"visibility" : "dense",
        			"type" : "bam",
        			"fileNameMatch" : {
        				"fileExtension" : "bam"
        			},
        			"compositeDimensions" : {
        				"x" : [
        					"lifeStage",
        					"tissue"
        				],
        				"y" : [
        					"replicate"
        				]
        			}
        		},
        		{
        			"track" : "ReadSignal",
        			"parent" : "ENCODE_GRAPE_sample",
        			"longLabel" : "Read signal (BigWigs)",
        			"visibility" : "dense",
        			"type" : "bigWig",
        			"autoScale" : "on",
        			"alwaysZero" : "on",
        			"maxHeightPixels" : "128:28:11",
        			"fileNameMatch" : {
        				"fileExtension" : "bw"
        			},
        			"compositeDimensions" : {
        				"x" : [
        					"lifeStage",
        					"tissue"
        				],
        				"y" : [
        					"replicate"
        				],
        				"a" : [
        					"strand"
        				]
        			}
        		}
        	],
        	"dataFilesList" : "/users/ngs00/public_docs/track_hub/dataFiles.list",
        	"dataFileNameParsingInstructions" :	{
        		"fieldSeparator" : "_",
        		"fields" : {
        			"genome" : 0,
        			"tissue" : 1,
        			"lifeStage" : 2,
        			"replicate" : 3,
        			"strand" : 5,
        			"fileExtension" : -1
        		}
        	}
        }
    • Output the corresponding Track Hub file and directory structure that will be parsed by UCSC.

quickTrackHub in practice

  • First, create a new public subdirectory for the Track Hub

    mkdir -p $trackHubDir
  • Copy the Custom Track data files there and rename them.

    Note
    GRAPE’s native output filenames are not (yet) quickTrackHub-compliant, this is why we need this renaming extra step.
    for f in `find $customTrackDir/ -type f`; do
    # perform some string substitution magic to rename the files
    outFile=$(basename $f)
    outFile=${outFile/mouse/mm9}
    outFile=${outFile//.Unique./_Unique_}
    # copy/rename data files:
    rsync -av $f $trackHubDir/$outFile
    # copy/rename BAM indices as well:
    [[ "$f" =~ bam$ ]] && rsync -av $f.bai $trackHubDir/$outFile.bai
    done
  • Download quickTrackHub from its github repository to your home directory:

    cd $HOME
    git clone https://github.com/julienlag/quickTrackHub.git
  • Make the script executable:

    chmod u+x $HOME/quickTrackHub/quickTrackHub.pl
  • Download the hubCheck utility from UCSC (somewhat useful for Track Hub debugging purposes), and place it into $HOME/bin/

    mkdir -p $HOME/bin/
    wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/hubCheck -O $HOME/bin/hubCheck
  • Make it executable

    chmod u+x $HOME/bin/hubCheck
  • cd to public Track Hub directory

    cd $trackHubDir
  • Copy the template Track Hub Definition JSON file to your public Track Hub directory

    cp $HOME/quickTrackHub/trackHubDefinition.json .
  • Open and edit the JSON file:

    gedit trackHubDefinition.json &
    • Find and replace all instances of ngsXX in the file with your username.

    • Replace your.email@yourinstitution.org with your email address (Optional).

    • Save

  • Generate the list of files (BAMS + bigWigs) to include in the Track Hub:

    find . -type f | grep "\.bam\|\.bw" | grep -v "\.bai" > dataFiles.list
  • Make the Track Hub:

    quickTrackHub.pl trackHubDefinition.json
  • Load the Track Hub in the UCSC Browser

    Your hub’s URL is output by the following command:

    echo http://public-docs.crg.es/NGS/$USER/track_hub/hub.txt

    There are two ways to load your Track Hub:

    • Load manually:

      • Click on the "track hub" button below the genome image in the UCSC Browser

      • Select the "My Hubs" tab

      • In the "URL" box, paste the URL of your hub (http://public-docs.crg.es/NGS/$USER/track_hub/hub.txt)

      • Click on "Add Hub"

      • You should be redirected to the mm9 Browser Gateway

    • Load directly through URL:

      Get the direct link via:

      echo "http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm9&hubUrl=http://public-docs.crg.es/NGS/$USER/track_hub/hub.txt"

      And copy/paste the output in your browser.

      Tip
      Use this direct link to share your Track Hub with collaborators.

      The settings of your Track Hub are accessible here (below the genome image):

      trackHubsettings

  • Look at our two favorite differentially expressed genes:

    ENSEMBL id Genome position (mm9) Differential expression

    ENSMUSG00000052187

    chr7:111,000,259-111,001,754

    E14 > E18

    ENSMUSG00000032936

    chr9:107,838,251-107,852,022

    E14 < E18

  • Tune the track display parameters so as to visualize better the differential expression.