-
Notifications
You must be signed in to change notification settings - Fork 1
/
manual.html
191 lines (139 loc) · 7.59 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>ACCOST: ACCOST: Altered Chromatin COnformation STatistics</title></head>
<body>
<h2>ACCOST: Altered Chromatin COnformation STatistics</h2>
<h3>Description</h3>
ACCOST assigns statistical significance to differences in contact counts in Hi-C experiments. The program uses a negative binomial to model the contact counts, and pools contacts at the same genomic distance to aid in estimating the mean and variance of the data.<br>
<br>
Note that ACCOST takes as input Hi-C that has already been mapped, filtered and binned. Normalization must also be performed in advance, and the bias values provided alongside the raw contact counts, as described below.<br>
<p><b>Prerequisites & credits</b></p>
<p>ACCOST was written with Python 2.7 and R, and requires both. It also requires numpy and scipy, as well as some sklearn libraries.</p>
<h3>Usage</h3>
<pre>accost.py [options] <bin size> <bin file> <filename input file> <ID A> <ID B> <output dir></pre>
<p><b>Required inputs</b></p>
<ul>
<li>
<p><span style="font-family: monospace">bin size</span> is an integer value corresponding to the number of bases in each bin</p></li>
<li>
<p><span style="font-family: monospace">bin file</span> is a tab-delimited file containing bin midpoints and mappability data, with the format:</p></li>
<pre>
<chr ID> <mid> <anythingElse> <mappabilityValue> <anythingElse>+
chr10 50000 NA 0.65 ...
</pre>
<li>
<p><span style="font-family: monospace">filename input file</span> is a table of filenames and their assignments to celltypes/controls/etc., which are themselves specified in the <span style="font-family: monospace">ID A</span> and <span style="font-family: monospace">ID B</span> columns. The <span style="font-family: monospace">idfile</span> has the following tab-delimited format:</p>
<div style="font-family: monospace">
<id A> <hi-c> <bias><br>
<id A> <hi-c> <bias><br>
<id B> <hi-c> <bias><br>
<id B> <hi-c> <bias><br>
</div>
<br>
</li>
<li>
<span style="font-family: monospace"><id A></span>
and
<span style="font-family: monospace"><id B></span>
can be any string (e.g. 1 and 2, or A and B, or WT and KO), but only two classes are supported.<br>
</li>
<li>
<p><span style="font-family: monospace">output dir</span> is the directory name where the output will be written. It will be created and shouldn't exist.</p></li>
</ul>
<p><b>Input format</b></p>
<p>The inputs are <b>raw</b> Hi-C contact count files in
tab-delimited format, with columns
<span style="font-family: monospace"><chr1> <mid1> <chr2> <mid2> <# reads></span><br>
<br>
E.g.:<br>
<pre>
chr6 225000 chr11 795000 1
chr5 425000 chr8 1195000 1
chr5 425000 chr8 1105000 1
chr5 425000 chr8 1115000 1
chr5 425000 chr8 1145000 2
chr5 425000 chr8 1155000 3
chr2 685000 chr14 2355000 1
</pre>
</div>
<br>
The data do not need to include zeros and do not have to be sorted. Gzipped input (detected by providing a filename ending in gz) is supported. Note that the data must be raw; non-integer counts will produce an error.<br>
<br>
<span style="font-family: monospace"><bias 1></span> and
<span style="font-family: monospace"><bias 2></span> are normalization biases (e.g. from
ICE normalization). The files should either have three tab-delimited columns specifying the genomic coordinates of each bin and the corresponding bias, e.g.:<br>
<pre>
chr6 225000 0.87
chr6 325000 0.92
</pre>
or could be provided as a single column of floats such that<br>
<br>
<span style="font-family: monospace">normalized_i,j = raw_i,j / (bias_i * bias_j)</span><br>
<br>
In practice, one can easily obtain the single column of biases by running the <a href="http://members.cbio.mines-paristech.fr/~nvaroquaux/iced/index.html">ICE normalization package</a> with output_bias=True.
<h3>Output format</h3>
<p>Running <span style="font-family: monospace">ACCOST.py</span> will produce three output files:
<span style="font-family: monospace"><chr_id>_ln_pvals.txt</span>
(containing i,j indices and natural log of p-values),
<span style="font-family: monospace"><chr_id>_stats.csv</span>
(containing additional statistics useful for debugging), and
<span style="font-family: monospace"><chr_id>_differential_ln_pvals_expanded.txt</span>
(containing more user friendly expansion of the <span style="font-family: monospace"><chr_id>_ln_pvals.txt</span> file) in the form:
</p>
<pre>
i j chr1 mid1 chr2 mid2 dist ln_pval log10_pval pval
0 1 chr1 5000 chr1 15000 1 -13.708 -5.953 1.114e-06
0 2 chr1 5000 chr1 25000 2 -11.676 -5.071 8.497e-06
0 3 chr1 5000 chr1 35000 3 -2.251 -0.978 1.053e-01
0 9 chr1 5000 chr1 95000 9 -6.114 -2.655 2.212e-03
0 10 chr1 5000 chr1 105000 10 -4.342 -1.886 1.300e-02
</pre>
<b>Full argument list:</b><br>
<pre>
usage: ACCOST.py [-h] [-d DISTANCES] [-dr DISTANCE_REVERSE] [-o OUTPUT_PREFIX]
[-mind MIN_DIST] [-maxd MAX_DIST]
[--filter_diagonal FILTER_DIAGONAL] [-m MAP_THRESH]
[-p MIN_PERCENTILE] [-ds DIST_SMOOTH] [-om OUTPUT_MATRIX]
[--output_p_thresh OUTPUT_P_THRESH]
[--output_q_thresh OUTPUT_Q_THRESH] [--no_dist_norm]
binsize binfile idfile id_A id_B
positional arguments:
binsize the bin size of the input file
binfile bin and mappability file
idfile input file of count and bias filenames
id_A ID for class A
id_B ID for class B
outdir name of the output directory
optional arguments:
-h, --help show this help message and exit
-d DISTANCES, --distances DISTANCES
pregenerated matrix with distance, to speed up
calculations
-dr DISTANCE_REVERSE, --distance_reverse DISTANCE_REVERSE
reverse matrix with distance, to speed up
calcaulations
-mind MIN_DIST, --min_dist MIN_DIST
the lower threshold for distance (in bins, default 0)
-maxd MAX_DIST, --max_dist MAX_DIST
the upper threshold for length (in bins, default
10000)
--filter_diagonal FILTER_DIAGONAL
filter counts on the diagonal?
-m MAP_THRESH, --map_thresh MAP_THRESH
the mappability threshold below which to ignore bins
(default 0.25)
-p MIN_PERCENTILE, --min_percentile MIN_PERCENTILE
the count percentile threshold, below which to ignore
counts (default 80 (%))
-ds DIST_SMOOTH, --dist_smooth DIST_SMOOTH
the number of locus pairs to smooth over for
calculating mean/variance (default 10)
-om OUTPUT_MATRIX, --output_matrix OUTPUT_MATRIX
Whether to output the full, dense matrix of P-values
(rather than just those passing a threshold) (default:
F)
--output_p_thresh OUTPUT_P_THRESH
--output_q_thresh OUTPUT_Q_THRESH
--no_dist_norm Disable distance size factors, for testing
</pre>
</body></html>