MUSIC is an algorithm for identification of enriched regions at multiple scales in the read depth signals from ChIP-Seq experiments.
It takes as input:
- Smoothing scale window lengths (in base pairs),
and outputs:
- Significantly enriched regions from all the scales.
Unlike other ER identification methods, MUSIC allows analyzing the scale length spectrum of ChIP-Seq datasets and also selecting a user specific slice in the scale length spectrum with custom granularity and generates the enriched regions at each length scale.
MUSIC does not strictly require a control experiment (for example mock IP, input DNA) to be performed. It is, however, strongly advised to generate control datasets matching the sample of interest (See Landt et al 2012).
You can download MUSIC C++ code here. There are no dependencies for building MUSIC. After download, type:cd MUSIC
make clean
make
to build MUSIC. The executable is located under directory bin/. It may be useful to install samtools for processing BAM files.
To get help on which options are available, use:
samtools view chip.bam | MUSIC -preprocess SAM stdin chip/
samtools view input.bam | MUSIC -preprocess SAM stdin input/
If there are multiple replicates to be pooled, they can be done at once or separately. If done separately, MUSIC pools the reads automatically. Then it is necessary to sort and remove duplicate reads in control and ChIP samples:
MUSIC -sort_reads chip chip/sorted
MUSIC -sort_reads input input/sorted
MUSIC -remove_duplicates chip/sorted 2 chip/dedup
MUSIC -remove_duplicates input/sorted 2 input/dedup
We do enriched region identification:
This code tells MUSIC to identify the enriched regions starting from 1kb smoothing window length upto 16kb with multiplicative factor of 1.5 using the default parameters for the remaining parameters. The ERs for each scale are dumped.
In case there is no control, skip the option that specifies preprocessed control reads directory (i.e., '-control input/dedup')
There are 3 different ER identification modes by default. Identifies point binding events. This uses very small scale level to identify the small transcription factor binding peaks. Use this mode for TF's like CTCF. MUSIC aims at trimming the reported regions and identifying the peaks with most dense signal in it. Identifies punctate enriched regions. Uses 100 base pairs to 2 kbps scale levels. This mode is useful for punctate histone marks like H3K4me3, H3K27ac, and marks that behave in a mixed manner with a dominating spectrum at punctate scales like H3K4me1. Identifies the ERs at broad scales that can span megabases. Use this option for marks like H3K9me3, H3K27me3, H3K36me3, H3K79me2, H4K20me1... Make sure you select p-value normalization parameter to balance power and false positive rate.
MUSIC can save the smoothed tracks in bedGraph format that can be viewed locally:
The smoothed bedGraph files are usually very small in size and can easily be stored/transferred. MUSIC output a large number of files that contain the SSERs at different scales (named SSER_....bed).
The final set of ERs are reported in two files: One is broadPeak formatted (http://genome.ucsc.edu/FAQ/FAQformat.html#format13)
Other file is in an extended BED format and has 9 columns:
The entries are sorted with respect to increasing Q-values.
Note that MUSIC reports log_10(Q-values) and not -log_10(Q-value).
If you are vieweing the BED file with IGV, it sometimes does not plot the ERs correctly. If you wish to look at the bed file in IGV, use only the first 6 columns in the file:There are two steps to parameter selection:
This option estimates the false positive and negative rates using a large selection of p-value window lengths. The output is a file where each row look like this:
...
l_win: 1700 FNR: (FC:0.001) (p-val:0.001) FPR: (FC:0.010) (p-val:0.005) Sentitivity: 0.999
...
This option evaluates several windows lengths and estimates the false positive rate and false negative rates. We recommends using the maximum window length (l_win) where false positive rate (FPR) for FC and p-val values are below 1%.
Note that you can skip going through the file manually if you specify p-value normalization window length as 0 ('-l_p 0') in the peak calling step; which tells MUSIC to select p-value normalization window length from the above file automatically. For this to work, make sure that you do not delete any files after running -get_per_win_p_vals_FC option, otherwise MUSIC will complain that it cannot find the file. See below for complete automation (of p-value normalization window length selection) with default parameters.
MUSIC generates the spectrum (using the scale lengths 100 base pairs to 1 megabase. The output is reported in a text file where each row corresponds to a scale length. In each row, the coverage of the SSERs is given. For example:
...
27 56815.13 111 11675769
...
where 2nd column is the scale length and 4th column is the coverage of the ERs that are specific to that scale. It is best to plot the spectrum, i.e., the scale lengths versus the fraction of coverage of the SSERs (4th column in the file), then match the spectrum with the studied HMs in the manuscript. If the spectrum is very different from all of the parametrized ChIP-Seq datasets, it is useful to generate the statistics on ER length distribution and distribution of ER-ER distances and use the procedure described in the manuscript to select the scale levels.
The other parameters (namely, gamma and sigma) do not depend on experimental variables and are optimized for minimizing overmerging and maximizing sensitivity. We suggest using the values specified in the manuscript, i.e., gamma=4, sigma=1.5.
After this analysis, if the scale spectrum turns out to be punctate, i.e., the dominating scale is smaller than 10kb; there is a possibility that the p-value normalization window length parameter (l_p) yields low sensitivity. To compensate for this, we recommend running with default l_p parameter of MUSIC. We just added a new script run_MUSIC.csh. This script automates the parameter selection for -l_p option. This script automates running MUSIC with default parameters. It simply calls MUSIC with above parameters in order. Here is how this script can be used:
mappability_map="mappability/36bp";
input_processed_dir="input/pruned";
mkdir preprocessed sorted pruned
run_MUSIC.csh -preprocess ${read_fp} preprocessed
run_MUSIC.csh -remove_duplicates preprocessed sorted pruned
run_MUSIC.csh -get_optimal_broad_ERs pruned ${input_processed_dir} ${mappability_map}
Make sure that run_MUSIC.csh is included in PATH. Currently only BAM files are supported in preprocessing. We will add more file types, soon. Note that it is necessary to install samtools for making sure this runs smoothly. The above code calls the ER's using the optimal l_p selection with the default parameters. Using Mappability correction increases the accuracy of MUSIC. You can download the multi-mappability signals for several common read lengths here. To generate the multi-mappability profile, MUSIC depends on a short read aligner. By default, bowtie2 generated SAM alignments are supported. Following are necessary for multi-Mappability profile generation:
- The read length
- bowtie2 installation and the genome indices for bowtie2
- Read length
Use this script generate_multimappability_signal.csh under bin/ directory to generate the multi-mappability profile. For example, to generate the 50 bp multi-Mappability profile for human genome,
./bin/generate_multimappability_signal.csh hg19.fa 50 /home/users/music_user/bt2_indices/hg19_indexes/hg19
Note that the bowtie indices must be supplied in the command line and the path must be the absolute path. For the above command, they are under /home/users/music_user/bt2_indices/hg19_indexes/hg19.
This command processes the FASTA file and writes two temporary scripts (temp_map_reads.csh, temp_process_mapping.csh) that fragments the genome, maps the fragments, and then generates the profile. Next, it is necessary to run these two scripts, in order:
./temp_process_mapping.csh
After the scripts are run, multi-mappability profile for each chromosome should be created with '.bin' extension. 'temp_map_reads.csh' is the most time consuming script that maps the fragments to the genome. Each line in the script is a command that maps the reads to a chromosome that can be run in parallel on a cluster. It is important to make sure 'temp_map_reads.csh' finishes before running 'temp_process_mapping.csh'.
Your can also email me (arif.o.harmanci@uth.tmc.edu) to generate multimappability profiles for new species.
The ENCODE datasets can be downloaded from UCSC Genome Browser.The H3K36me3 datasets for K562 and GM12878 cell lines can be downloaded from here.