This document presents several scripts and simple commands for the processing, vizualisation and primary analysis of 3C/Hi-C datasets. These scripts were used during the INSERM workshop Capturing chromosome conformation: toward a 3D view of genome regulation, May 9-13, 2016 in Paris at the Pasteur Institute and during the last session of [C3BI Training – Introduction to NGS data analysis] (https://c3bi.pasteur.fr/training-introduction-to-ngs-data-analysis ), 28th of June 2016.
For queries or help getting these running, you can send an email or open an issue at the github repository.
- Dependencies
- Raw data extraction and alignment
- Building of the contacts map
- Normalization of the data
- Computation of genomic distance law
- Computation of correlation matrices
- Directional Index tool to detect TADs
- Decomposition into eigen vectors
- Use of sparse formalism
- Miscellaneous
These scripts will run on Unix-based systems such as Linux or OS X. It basically requires to have Python installed on your machine which is commonly installed on Unix-based systems. For Windows, you may download Python at https://www.python.org/downloads/windows/. In order to run the Unix commands below, you will need something like Cygwin, Windows Subsystems for Linux, or manually install the command line tools with a package manager like Scoop. Then, a few python modules are necessary for diverses operations on arrays and vizualisation.
- Numpy
- Matplotlib
- pysam
- Scipy
- Biopython
- snakemake
These can be readily installed using the supplied requirements.txt
file:
pip install -Ur requirements.txt
A minimal test dataset is provided in the data
folder. It consists of a reference fasta file and two fastq files containing the pairs of Hi-C reads.
The whole pipeline is implemented in a Snakefile where file path and variables can easily be changed in the "USER CONFIG" section. Each step of the pipeline is described individually below.
You can run the whole pipeline at once using :
# change the value of -j to the number of cores on your machine
snakemake -j8
The python script iterative_alignment2.py
performs iterative mapping by first truncating the reads to a given length (20bp by default) and then iteratively extending and remapping the reads that did not map uniquely. This allows to map reads that would otherwise have a ligation site in the middle. Running python python_codes/iterative_alignment2.py -h
shows the different arguments it can take.
usage: iterative_alignment2.py [-h] -r REFERENCE [-p NB_PROCESSORS]
[-o OUT_SAM] [-T TEMPDIR] [-m] [-l MIN_LEN]
in_fq
positional arguments:
in_fq The fastq file containing Hi-C reads.
optional arguments:
-h, --help show this help message and exit
-r REFERENCE, --reference REFERENCE
Path to the reference genome, in FASTA format.
-p NB_PROCESSORS, --nb_processors NB_PROCESSORS
number of CPUs used for alignment.
-o OUT_SAM, --out_sam OUT_SAM
Path to the output SAM file for the alignment of
in_fq.
-T TEMPDIR, --tempdir TEMPDIR
Directory to write temporary files. Defaults to
current directory.
-m, --minimap2 Use minimap2 instead of bowtie for the alignment.
-l MIN_LEN, --min_len MIN_LEN
Minimum length to which reads should be truncated.
For example:
python python_codes/iterative_alignment2.py \
-r data/chr01.fa
-o end1.sam data/hic.end1.fastq.gz
The script needs to be run separately on both fastq files and each will return a SAM file containing each read once, including those that are not aligned.
The SAM files must then be sorted by read name, and the header remove to facilitate parsing. This can be achieved using the samtools
utilities.
samtools sort -n end1.sam | samtools view > end1.sorted.sam
samtools sort -n end2.sam | samtools view > end2.sorted.sam
The sam files can then be parsed and combined into a single file using the GNU coreutils suite of tools to match reads of the same pair. We also filter pairs where both reads have a mapping quality above 30.
paste <(awk -v OFS="\t" '{if($2 == 0) {$2 = "+"}
else {$2 = "-"}
print $3, $4, $5, $2}' end1.sorted.sam) \
<(awk -v OFS="\t" '{if($2 == 0) {$2 = "+"}
else {$2 = "-"}
print $3, $4, $5, $2}' end2.sorted.sam) \
| awk '{if($3 >= 30 && $7 >= 30)
print $1, $2, $4, $5, $6, $8}' > pairs.dat
The SAM flags 0 and 16 are converted into + and - strands, respectively and the reads are ordered into a convenient table with one pair per line. For each read, the chromosome, position and strand informations are present.
At this stage, you should have a file containing this information:
Sc_chr01 54412 - Sc_chr01 54276 +
Sc_chr01 52956 + Sc_chr01 45745 +
Sc_chr01 69448 - Sc_chr01 69297 +
Sc_chr01 79289 - Sc_chr01 79177 +
Sc_chr01 38824 + Sc_chr01 36185 -
Sc_chr01 131720 + Sc_chr01 131874 -
Sc_chr01 195698 - Sc_chr01 195459 +
Sc_chr01 138110 + Sc_chr01 138287 -
You can then assign a restriction fragment to each read using the python code fragment_attribution.py
:
Assigns restriction fragment ID to Hi-C records in a .dat file.
usage: fragment_attribution.py [-h] -e ENZYME -r REFERENCE
input_file [output_file]
positional arguments:
input_file Path to the input file containing the coordinates of
Hi-C interacting pairs (dat format). Use '-' to read
from stdin.
output_file Path to the output file (dat.indices format). Defaults
to stdout.
optional arguments:
-h, --help show this help message and exit
-e ENZYME, --enzyme ENZYME
The restriction enzyme used in the Hi-C protocol. E.g.
DpnII.
-r REFERENCE, --reference REFERENCE
Path to the reference genome in FASTA format. Can be
multiple files separated by commas.
For example
python python_codes/fragment_attribution.py \
-e DpnII \
-r data/chr01.fa \
pairs.dat \
pairs.dat.indices
This will output a file with two additional column (4th and 8th) containing the indices of the restriction fragments on which the reads are located.
Sc_chr01 1665 + 4 Sc_chr01 1838 - 4
Sc_chr01 157259 + 437 Sc_chr01 157541 - 437
Sc_chr01 107877 - 290 Sc_chr01 126038 - 343
Sc_chr01 68529 - 184 Sc_chr01 68276 + 184
Sc_chr01 81348 + 215 Sc_chr01 81527 - 215
Sc_chr01 126315 + 344 Sc_chr01 121347 + 326
A check for the proportion of uncrosslinked events (uncuts, loops...) can be carried out at this stage.
With these information, one can remove uninformative events more cautiously then just removing all events below 10 kb. This filtering is optional and might be necessary when you want to study the structure of chromatin at very short scales like several kb.
We used the home-made python code filter_events.py
:
usage: filter_events.py [-h] [-i | -t THRESHOLDS THRESHOLDS] [-p]
input_file [output_file]
positional arguments:
input_file The file containing the coordinates of Hi-C
interacting pairs and, the indices of their
restriction fragments (.dat.indices format).
output_file Path to the output file (filtered dat.indices).
Defaults to stdout.
optional arguments:
-h, --help show this help message and exit
-i, --interactive Interactively shows plots and asks the user for
thresholds. Overrides predefined thresholds. Disabled
by default.
-t THRESHOLDS THRESHOLDS, --thresholds THRESHOLDS THRESHOLDS
The minimum number of restriction fragments between
reads to consider loops and uncut events. Estimated
automatically by default. Must be two integers
separated by a space (-t <loop> <uncut>).
-p, --plot_summary Output a piechart summarizing library composition.
For example:
python filter_events.py pairs.dat.indices pairs.dat.indices.filtered
By default, the program will automatically estimate a threshold to remove loop and uncut events. This threshold represents the minimum number of restriction fragments separating reads in a Hi-C pair. This can be done interactively using the -i
flag and the proportion of each type of library events can also be displayed using the -p
flag.
You can have a look to the behaviors of the different pairs of reads taking into account the directions of the reads with this kind of plots:
At this step, close the window, the program will ask you the two thresholds for the uncuts and loops events that you can determine from the previous plot. Then, the code counts the different types of events, it gives information about the quality of your library preparation.
A file where non-crosslinked events have been removed is created and name output_alignment_idpt.dat.indices.filtered.
From the alignment file, you can build a binned contacts map. We used the home-made python code Matrice_Creator.py
: that converts the alignment file information into a dictionary structure and then into an array of the chosen chromosome(s).
There are two arguments to enter: the path of your output alignment file and the size of the bin (in bp).
Ex:
python Matrice_Creator.py pairs.indices.filtered 1000000
You should have a picture with all the chromosomes :
In this code, you can modify the list of the chromosome(s) you want to display. For example, you can display only the human chromosome 3 (list_chr = ["chr3"]) you shoud have a picture looking like this:
We used the normalization procedure called SCN (for Sequential Component Normalization presented in http://www.biomedcentral.com/1471-2164/13/436).
This procedure assumes that every bin should be detected with the same strength. We divide each line of the previous matrice by its sum then each column by its sum. We reiterate this loop several times. It can be shown that after several iterations, the matrice has converged.
Before the iterations, poor interacting bins must be discarded and considered as non detectable bins (due to experimental biases, unmapple sequences etc).
To do that, a threshold is given as an argument to the SCN function so that every bin with a reads number below this value will be replaced by vectors of zeros. To determine this threshold, you can plot the distribution in the number of reads by doing the histogram in a python terminal with histo_r.py
:
from pylab import *
from histo_r import *
MATRICE=loadtxt("chr3_RAW.txt");
histo_r(MATRICE.sum(axis=0),100);
The function histo_r(V,N) makes a histogram of the vector V in N bins and plots the result as a bar plot (it is an equivalent of the R function hist() ).
To normalise the raw contact map, we used the function scn_func coded in the module scn_human.
import scn_human
mn=scn_human.scn_func(MATRICE,1500);
imshow(mn**0.2,interpolation="none");
colorbar();
savefig('chr3_NORMALISED.png');
np.savetxt('chr3_NORMALISED.txt',MATRICE);
The normalised contacts map should look like this:
The matrice elements can be viewed as relative frequencies of contacts between different loci (the scores go now from 0 to 1). It should be noticed that this procedure does not conserve the symetry property of the matrix. To recover the symetry property, you can use a simple line in python:
mn=(mn+mn.T)/2;
where mn.T is the transposed matrice of mn.
This plot is important and must be computed at the very first steps of data processing. It reflects the polymer behaviour of the chromatin and thus allows to check the presence or absence of 3D signal. It consists in computing the mean number of reads in function of the genomic distance separating the two loci. The computation thus consists in scanning every diagonal of the matrice and taking the average of this sets of elements. We coded this computation in the function distance_law_human.
from pylab import *
import distance_law_human
# mn: normalised matrice
d=distance_law_human.dist_law(mn)
plt.plot(d,label="Chr 3",linewidth=3.0);
plt.loglog();
plt.xlabel("Genomic distance in bins of 100kb");
plt.ylabel("Contacts Frequency");
plt.legend();
savefig('chr3_distance_law.png');
Example of plot obtained for the chromosome 3:
The correlation matrice is often computed in Hi-C data analysis. Even if not always present in final publication, it can render more visible structures that are present in the data especially domains structures (squares in the contacts maps). It consists in looking for correlation in the shared neighbors between each line and column. It is simply computed by taking the Pearson Coefficient (or another correlation coefficient) between each line and each column of a matrice coded in the function corrcoef from Numpy.
n1=mn.shape[0];
# Computation of genomic distance law matrice:
MAT_DIST = np.zeros((n1, n1));
for i in range(0,n1) :
for j in range(0,n1) :
MAT_DIST[i,j] = d[abs(j-i)] ;
#imshow(MAT_DIST**0.2);
MAT2=mn/MAT_DIST;
imshow( MAT2**0.2);
# Computation of the correlation matrice:
mc=np.corrcoef(MAT2);
mc[np.isnan(mc)] = 0.0;
mc2=np.corrcoef(mc);
mc2[np.isnan(mc2)] = 0.0;
imshow( mc2,interpolation='none');
colorbar();
savefig('chr3_CORRELATION2.png');
Example of plot for the human chromosome 3 with 100 kb bins and 2 iterations of correlation:
This tool is commonly used in Hi-C data analysis. It looks for change in the directionality between "left vector" and "right vector" at a certain loci in the genome. A change could come from the presence of a border between two different compartments in the genome. It consists in doing a paired T test between "left vector" and "right vector" on each bin along the genome. The size of the "left vector" and "right vector" is put as a parameter and allows to look for domains structures at a specific scale. We coded this computation in the function directional_indice.
import directional_indice
import matplotlib.gridspec as gridspec
# Computation and plot of Directional Index:
M = np.corrcoef(mn);
n1 = M.shape[0];
scale=20;
DI = directional_indice.directional(M,scale).T;
# PLOT of contacts Map and Directional Index:
gs = gridspec.GridSpec(2, 1, height_ratios=[5,1] ); # to have several subplots on the same graph
ax0 = plt.subplot(gs[0]);
ax0.imshow(mn**0.2,interpolation="none")
ax0.set_title("Contacts Map");
ax2 = plt.subplot(gs[1], sharex=ax0);
b1=0;
b2=len(DI.T);
borders = list(range(b1,b2));
borders2 = DI.T
borders2 = array(borders2[:,0]);
ax2.set_xlim([b1, b2]);
ax2.set_ylim([-2.0, 2.0]);
ax2.fill_between( borders, 0,borders2, color='red' );
ax2.fill_between( borders, 0,borders2,borders2<0 ,color='green' );
ax2.set_ylabel("DI scale = 2Mb)");
savefig('chr3_DI_bin100kb.png');
Example of plot for the human chromosome 3 with 100 kb bins and DI carried out at the scale of 2Mb:
This geometrical transformation allows to decompose the matrice into eigen values and vectors. It is a way to simplify the data or at least to decrease the dimentionality of the mathematical object. It can be shown that the first eigen vector corresponds to the 2 compartments partition signal of the genome. It should be kept in mind that this is a first approximation of the 3D structure of a genome and other or sub-compartments can be detected using higher resolution and/or using other tools of compartment detection.
import matplotlib.gridspec as gridspec
import scipy as sc
from scipy import signal
from scipy.linalg import eig
# Computation of eigen vectors:
(V,D) = sc.linalg.eig(mc);
plot(D[:,0]);
# Plot of contacts Map and Eigen vector:
gs = gridspec.GridSpec(2, 1, height_ratios=[5,1] );
ax0 = plt.subplot(gs[0]);
ax0.imshow(mn**0.2,interpolation="none")
ax0.set_title("Contacts Map");
ax2 = plt.subplot(gs[1], sharex=ax0 );
b1=0;
b2=len(D[:,0]);
borders = list(range(b1,b2));
borders2 = D[:,0]
borders2 = array(borders2[:,0]);
ax2.set_xlim([b1, b2]);
#ax2.set_ylim([-2.0, 2.0]);
ax2.fill_between( borders, 0,borders2, color='darkblue');
ax2.fill_between( borders, 0,borders2,borders2<0 ,color='darkblue' );
ax2.set_xlabel("Position along the genome (in bins of 200kb)");
ax2.set_ylabel("Eigen vector");
savefig('chr3_Eigen_bin100kb.png');
Example of plot for the human chromosome 3 with 100 kb bins (zoom at the end of the chromosome):
An alternative and interesting way to mathematically represent the data is the sparse formalism. It is very relevant for matrices in which most of the elements are zero which is ofter the case for human, mouse contacts maps.
In python, functions implemented for the use of sparse formalism can be found in the module scipy.
From the contacts map, you can do a kind of 4C plot of a position of interest by simply select the corresponding line in the matrix:
plot(mn[ int(37000000/BIN),: ] );
xlabel("Postion along the chromosome in bins of 100kb")
ylabel("Normalised Score");
savefig('4Cplot_chr3.png');
Example of 4C plot from the normalised contacts map for the position 37000000 bp on the chromosome 3: