Skip to content
Katharina Hayer edited this page Sep 18, 2013 · 5 revisions

Output files

A number of files of interest will be created, those are described below. In order to understand these files note that there are three kinds of reads:

  • Those that align to a unique location (unique mappers)
  • Those that align to multiple locations (non-unique mappers)
  • Those that don't align anywhere (non-mappers)

feature_quantifications_NAME

This file gives the counts and normalized counts for all features (transcripts, exons, introns). A 'min' and 'max' value is given, the 'min' value is based on the unique mappers only, the 'max' value is based on all mappers. The value is the number of fragments mapping to that feature divided by the length of the feature, by the number of reads that aligned, and by 109 (the so-called the 'FPKM' value). So as long as differential expression is reasonably well balanced between two samples, it should be meaningful to compare the FPKM values across them, even if there are different numbers of reads for each sample. Normalized intensities between different features of the same sample are also comparable, even if they have different lengths.

RUM_Unique.cov, RUM_NU.cov

These are called coverage files -- strictly speaking they are bedgraph files. These give the depth of coverage at every location. These files are in zero-based start, one-based end coordinates so that it can be directly uploaded to the ucsc genome browser. UCSC accepts compressed files (zip and gzip), so you should probably compress all files before uploading, as they will upload faster. Note that files might be too big to upload even compressed - in this case you should use the "BigWig" format so that the file stays on your server and the browser only downloads what it needs in real time.

RUM_Unique, RUM_NU

The unique and non-unique mappers, respectively. These two files each give one alignment per line. Forward and reverse reads are merged into one line if their alignment overlaps, and their ID is given as a regular integer. Otherwise they are given in separate lines with the forward read ID indicated with an 'a' and the reverse read ID with a 'b'. The forward read always comes first, even if it maps downstream of the reverse. Each line has five fields:

  1. The sequence number
  2. Chromosome
  3. Spans of the alignment in genome coordinates
  4. Strand ("+" or "-")
  5. The sequence of the alignment
    • All sequence is plus strand
    • Sequence has a colon ":" where there is a junction
    • Sequence has a +XXX+ if XXX is an insertion, e.g. +AG+ means AG inserted in the sequenced genome w.r.t. the reference

RUM.sam

All alignments in SAM format. This one file has all the information contained in RUM_Unique and RUM_NU and the original reads files (including quality scores if those are provided). The following tags are used:

  • IH:i:N means the read aligns to N locations
  • HI:i:N is the N-th alignment for this read
  • XO:A:F means the forward and reverse reads do not overlap
  • XO:A:T means the forward and reverse reads do overlap,

Only the "N" is a variable in the above, the entry between colons is the "type": "i" means it's an "integer" and the "A" just means it's a printable character, in this case "T" for "true" or "F" for "false".

The full SAM specification can be found here at [http://samtools.sourceforge.net/SAM1.pdf] (http://samtools.sourceforge.net/SAM1.pdf)

From the specification, it seems that the only reliable place to see whether a sequence is mapped or unmapped is to analyze the bitwise FLAG field. Specifically, if the 0x4 segment unmapped value is present, then the read is unmapped. With samtools:

# The following excludes unmapped reads
samtools view -S -F 0x4 RUM.sam
# The following prints out records for only the unmapped reads
samtools view -S -f 0x4 RUM.sam

junctions_all.bed, junctions_all.rum, junctions_high-quality.bed

These files give information on junctions. The bed files can be directly uploaded to the UCSC browser. The 'high quality' bed file has junctions that have known splice signals and are crossed by at least one uniquely mapping read with at least 8 bases on each side of the junction. The known junctions (those in your transcript database) are colored blue, the others are colored green. Junctions with non-canonical splice signals are colored slightly lighter. The score is the number of uniquely mapping reads crossing the junction with at least 8 bases on each side. In the 'all' bed file, all junctions that were found are given. The score is the number of reads (uniquely mapping or not and any number of bases on either side) that crossed the junction. In the 'all' file, high quality junctions are colored blue and all others are colored red. The '.rum' file has expanded information on each junction, in spreadsheet (tab delimited) format.

  • intron: exact coordinates of the gap
  • strand: the strand of the intron. If the data are strand specific, then the strand of the intron is deduced from the direction of the read. If the data are not strand specific, then it is deduced from the splice signal. If, however, the splice signal is ambiguous (e.g. AT-AT), then this field is given a period "."
  • known: This is one if the intron is in the transcript annotation file, zero otherwise
  • standard_splice_signal: This is one if the splice signal is among the known splice signals: GTAG, GCAG, GCTG, GCAA, GCCG, GTTG, GTAA, ATAC, ATAG, ATAT.
  • signal_not_canonical: this is zero if the splice signal is GTAG, it is one otherwise.
  • ambiguous: This equals one if some number of initial bases of the intron could just as easily be the terminal bases of the exon. BLAT picks one, RUM will includes the other(s) and marks them all as ambiguous.
  • long_overlap_unique_reads: number of reads that align across the intron and align uniquely and with at least 8 bases on each side.
  • short_overlap_unique_reads: number of reads that align across the intron and align uniquely but do not have at least 8 bases on each side
  • long_overlap_nu_reads: number of reads that align across the intron and do not align uniquely but have at least 8 bases on each side.
  • short_overlap_nu_reads: number of reads that align across the intron and do not align uniquely and do not have at least 8 bases on each side.

inferred_internal_exons.bed

novel_inferred_internal_exons_quantifications_NAME

These files give information on novel (unannotated) exons that are not in your gene model file but were inferred to exist from the data. The bed file is formatted for the UCSC genome browser.

mapping_stats.txt

A file that gives a breakdown of what percentage of the reads mapped, and how many reads mapped to each chromosome

Log files

RUM produces several different types of log files: debug-level and error-level logs for the master job, debug-level and error-level logs for each chunk (if you ran the job with --chunks), and the stdout output for each chunk.

log/rum_errors.log

Records any errors the pipeline might have thrown, always check this file for every run. This file is updated as the run proceeds so you should keep an eye on it as the job is working. Ideally this should be completely empty. If it is not empty, it may indicate a serious problem or it may be a minor problem that did not affect the actual results. If your job fails, this is the first place to look.

log/rum_errors_CHUNK.log

Errors from the individual chunks are logged here (if you ran the job with --chunks). If the main rum_errors.log file indicates that one or more chunks had errors, it would be a good idea to take a look at one or more of these log files. Again, if everything goes perfectly, these files should all be empty.

log/rum.log

Contains detailed messages about the progress of the job. It's probably not worth looking at the file unless something went wrong and you're trying to track down an issue.

log/rum_CHUNK.log

Contains details about an individual chunk.

Next (optional): Running a job on a cluster