Skip to content
Muhammad Saad Shamim edited this page Mar 10, 2022 · 6 revisions
Data The data file formats used in the Juicer / Juicebox / Straw ecosystem are described below.

The Juicer data archive is available at aidenlab.org/data.html and consists of .hic files, described below.

.hic files

The .hic file is a highly compressed binary file that stores contact matrices from multiple resolutions in a clever way, allowing random access.

To create a .hic file, use Pre. To extract data from a .hic file, use dump; or access the data directly with the Straw API. All of the feature annotation algorithms operate directly on .hic files. Juicebox uses the fast querying capabilities of .hic files to make it possible to zoom in and out of many different resolutions quickly.

The .hic file format is described in detail at: https://github.com/aidenlab/hic-format. The Straw API can be used to read data from the .hic file into Java, Python, C++, R, and MATLAB.

Hi-C contacts

Hi-C contacts are represented as one line per read; the minimal necessary information is the chromosome and position of each read. Pre takes a number of different possible formats to represent Hi-C contacts.

Fastq files

These are the raw data files that come off the sequencer. They include the read name, the read (a string of A,C,T,G, or N) and base quality information. See this article for more information. Juicer takes .fastq files and transforms them into contact matrices stored in a .hic file.