Table of Contents
There are many methods to investigate significant Hi-C contacts established between a particular genomic region and its neighborhood within some range of distances. One popular method was introduced by H. Won in 2016 (https://doi.org/10.1038/nature19847). Here we present a handy tool, applying this method (with minor technical differences). It allows user to obtain meaningful contacts from Hi-C map for a predefined list of genomic coordinates corresponding to SNPs, TSSs or any other features.
The package was developed to detect significant contacts from a human Hi-C data. It has not been tested on another species.
Requirements
- python 3.6+
Install from PyPI using pip.
pip install contact-hunter
- Hi-C map in .cool format
- Tab-delimited file for genomic features to be explored. Should not contain header, 2 columns are expected: chromosome, start
- Tab-delimited file for background. Should not contain header, 2 columns are expected: chromosome, start
The file with background can be generated based on the data you are exploring. For example, if you are going to find contacts for a list of specific SNPs it is reasonable to use a list with all the rest SNPs from the relevant GWAS study as a background. For a set of differentially expressing genes, all other TSSs can be a background. More details on background can be found in methods section https://doi.org/10.1038/nature19847.
run in terminal
contact_hunter COOL_PATH LOCUS_BACKGROUND LOCUS_TEST RESOLUTION DISTANCE RESULTS_FILE
type contact_hunter -h in terminal to view all the parameters
import module
import contact_hunter
use get_contacts function
contact_hunter.get_contacts(cool,background_locus,tested_locus,resolution,distance)
type help(contact_hunter.get_contacts) or ?contact_hunter.get_contacts in jupyter notebook to view all the parameters
The tool returns table with 5 columns:
- chr - chromosome
- bin_start - start of target bins (algorithm detects significant interactions between these and surrounding bins
- list_of_loci - list with the precise coordinates of features of interes (SNPs, TSSs, etc), falling to the bins
- interacting_locus_coord - start of significantly interacting bins
- pval - p-value
Using the CLI version, you get a file with the table described above, since the output file name is a required argument.
When used as a python module, the get_contucts function returns a table, but no output file is created.
The tool has been tested on the human data, the goal was to detect genomic regions interacting significantly with the list of target SNPs or a gene set TSSs. One can use the tool to explore contacts in another species with another features (for example, to get contacts for a particular set of ATAC-seq peaks). In this case, the generation of an average heatmap is recommended. The heatmap can be easily obtained with the usage of specific option. In addition to basic output, it yields an average heatmap around significant contacts which allows to estimate roughly the performance of the tool on user's specific data. The clear enrichment in the central pixel is a good sign! :)
- add --avr_heatmap to command when using CLI version
- specify plot_generate=True when using as a python module
One of the important issues is the Hi-C data resolution. Everybody strives to set as small a bin size as possible for Hi-C data, this strategy helps to more accurately annotate the resulting contacts in the subsequent analysis. But, unfortunately, using the sparse data is not appropriate here. The only thing user should rely on is the Hi-C map quality.
In accordance with the initial paper https://doi.org/10.1038/nature19847, an appropriate distance constraining the field of contacts search is ±5 Mb for the human data.
The algorithm implementation includes significant contacts selection by fdr. The default fdr value is 0.01. There is a column p-val in output table. These are p-values of contacts that survived the correction. Importantly, if user plans to select contacts by p-value (e.g. to consider only contacts with the lowest p-value), then this selection should be done separately for each chromosome: a single threshold should not be set. This recommendation is due to the fact that each chromosome is considered separately in the algorithm and the critical values are calculated individually.
Distributed under the GPLv3 License. See LICENSE for more information.
Anna Kononkova - a.kononkova@yandex.ru
Project Link: https://github.com/Khrameeva-Lab/contact-hunter