-
Notifications
You must be signed in to change notification settings - Fork 0
/
EXAMPLES.txt
238 lines (188 loc) · 13.2 KB
/
EXAMPLES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
To run theses examples, you will need fasta files to run as input. You can download peaks from T. Bailey's website under the supplemental material for DREME:
Directory: https://tlbailey.bitbucket.io/supplementary_data/Bailey2011/
Full set: https://tlbailey.bitbucket.io/supplementary_data/Bailey2011/CHEN.tgz
We will assume that a directory named CHEN contains the fasta sequences under files with the .pos.fasta extension
=======================================================================================
0) Basic run: MotifGP for 10 generations
=======================================================================================
Description:
This example describes a typical dry run for MotifGP. 10 Generations will terminate quickly and will not explore as many possibilities as a longer execution would, but it's enough to give an idea of the expected output.
MotifGP will first print the information about the parameters. Once the motif finding starts, a log of statistics is printed for every generation. The statistics include:
gen: The current generation
evals: The number of solutions evaluated per run (excludes invalid/discarded individuals)
memoize: Number of solution fitnesses stored in cache
maxdepth: The maximum tree height encountered in all solutions in the population
Also, for each fitness objective, it prints the average, minimum, maximum and standard deviation of all the solutions in the population. The default objectives are Discrimination and Fisher's exact test.
Command:
$ python motifgp.py -p CHEN/Smad1.pos.fasta -n 10 --popsize=100
Options:
{'backpad': False, 'verbose': True, 'ngen': 10, 'bg_algo': 'dinuclShuffle', 'erase': None, 'inspector': False, 'no_seed': True, 'checkpoint_path': None, 'mutpb': 0.3, 'revcomp': False, 'cxpb': 0.7, 'random_seed': 42, 'matcher': 'grep', 'training_path': 'CHEN/Smad1.pos.fasta', 'hardmask': False, 'moo': 'FORTIN', 'background_path': None, 'short': 0, 'grammar': 'iupac', 'clutch': False, 'popsize': 100, 'timelimit': None, 'tag': 'default', 'output_path': './OUT/', 'hamming': False, 'fitness': 'DF'}
[['Discrimination', <function fit_discrimination at 0x7ff8e03bfde8>, 1.0], ['Fisher', <function fit_fisher at 0x7ff8e03bfe60>, -1.0]]
[<function fit_discrimination at 0x7ff8e03bfde8>, <function fit_fisher at 0x7ff8e03bfe60>]
[1.0, -1.0]
Creating pset for Regex
HARDMASK: False
Lengths: 1126 1126
Writing background to runtime_tmp/Smad1.pos.fasta_background_855b93d2-30af-45f3-8c9b-eabebe8cb74b
Writing positive to runtime_tmp/Smad1.pos.fasta_foreground_855b93d2-30af-45f3-8c9b-eabebe8cb74b
Loading grep matcher
Loaded.
Initial pop size: 100
Input sequences# 1126
Control sequences# 1126
Running algorithm: FORTIN
Crossover/Mutations 0.7 0.3
Discrimination Fisher
------------------------------------------------ ----------------------------------------
gen evals memoize maxdepth avg max min std avg max min std
0 0 0 4 0.472553 0.531073 0 0.109893 0.5565 1 0.216836 0.133272
0 100 108 8 0.508866 0.807692 0.5 0.0345004 0.450098 0.522975 0.00117953 0.136495
Writing gen 0 at path ./OUT/default/FORTIN/42/0.checkpoint
1 100 176 12 0.528465 1 0.5 0.0677212 0.319425 0.520942 0.00117953 0.170171
... Truncated statistics output ...
9 100 693 17 0.801609 1 0.622378 0.120997 0.0028191 0.0311113 1.7699e-06 0.00589427
10 100 745 17 0.814121 1 0.628472 0.120244 0.00317617 0.0311113 1.7699e-06 0.00604208
Writing gen 10 at path ./OUT/default/FORTIN/42/10.checkpoint
Memoized 745 calls.
regex in 4.4266140461 time
NEF File written at ./OUT/default/FORTIN/42.nef
=======================================================================================
=======================================================================================
1) Run MotifGP for 10 seconds
=======================================================================================
Description:
Instead of terminating on a generation, terminates after a certain amount of time has been elapsed.
Command:
$ python motifgp.py -p CHEN/Smad1.pos.fasta --timelimit 10
Ouput:
.....
regex in 10.3056499958 time
NEF File written at ./OUT/default/FORTIN/42.nef
=======================================================================================
2) Search for motif including their reverse complement
=======================================================================================
Description:
For each individual to be evaluated, compute the reverse complement of the motif search sequences for both.
A sequence is matched if it contains the motif or its reverse complement.
Command:
$ python motifgp.py -p CHEN/Smad1.pos.fasta -n 10 --revcomp
The output motif in the .nef file will not have the reverse complement to keep the data compact, but it is considered in the evaluation of solutions.
=======================================================================================
3) Select different multi-objective parameters
=======================================================================================
Description:
The multi-objective optimization depends on a few parameters, mainly:
1) The fitness objectives
F - Fisher's exact test
D - Discrimination
S - Support
E.g: argument: --fitness=DF, for a Discrimination/Fisher's exact test fitness function
2) The multiobjective optimizer (NSGA-II, SPEA2, or the improved NSGA-IIr. NSGA-II and IIr are dubbed NSGA2 and NSGAR in this software.)
E.g: argument: --moo=NSGA2
3) Evolutionary parameters
a) Population size
E.g --popsize=3000
b) Crossover and mutation rates
E.g. --cxpb=0.6 --mutpb=0.4
Our default settings are well suited for different types of datasets, but it is possible that certain dataset benefit from modified parameters.
For example:
$ python motifgp.py -p CHEN/Smad1.pos.fasta -n 10 --fitness=DS --moo=NSGAR --popsize=3000 --cxpb=0.6 --mutpb=0.4
=======================================================================================
4) Provide a background dataset
=======================================================================================
Description:
The -b argument allows usage of a background dataset instead of the dinucleotide shuffle.
For example:
python motifgp.py -p CHEN/CTCF.pos.fasta -b CHEN/Smad1.pos.fasta -n 1
Options:
{'backpad': False, 'verbose': True, 'ngen': 1, 'bg_algo': 'dinuclShuffle', 'erase': None, 'inspector': False, 'no_seed': True, 'checkpoint_path': None, 'mutpb': 0.3, 'revcomp': False, 'cxpb': 0.7, 'random_seed': 42, 'matcher': 'grep', 'training_path': 'CHEN/CTCF.pos.fasta', 'hardmask': False, 'moo': 'FORTIN', 'background_path': 'CHEN/Smad1.pos.fasta', 'short': 0, 'grammar': 'iupac', 'clutch': False, 'popsize': 1000, 'timelimit': None, 'tag': 'default', 'output_path': './OUT/', 'hamming': False, 'fitness': 'DF'}
[['Discrimination', <function fit_discrimination at 0x7fea287b5e60>, 1.0], ['Fisher', <function fit_fisher at 0x7fea287b5ed8>, -1.0]]
[<function fit_discrimination at 0x7fea287b5e60>, <function fit_fisher at 0x7fea287b5ed8>]
[1.0, -1.0]
Creating pset for Regex
HARDMASK: False
Lengths: 39609 1126
Writing background to runtime_tmp/CTCF.pos.fasta_background_929da026-8273-4361-b523-5cfa2690f527
Writing positive to runtime_tmp/CTCF.pos.fasta_foreground_929da026-8273-4361-b523-5cfa2690f527
Loading grep matcher
Loaded.
Initial pop size: 1000
Input sequences# 39609
Control sequences# 1126
WARNING: Uneven number of input and control sequences.
The output shows a warning because of the uneven sequence count between input and control.
This has an effect on fitness objectives like Discrimination, where fewer input sequences can make
the score easier to maximize.
======================================================================================
5) Evaluating predictions with TOMTOM
======================================================================================
To determine if predictions are correct, we recommend using the TOMTOM motif comparison tool, which is installed through the MEME suite (http://meme-suite.org/doc/download.html?man_type=web). The databases required by TOMTOM are also on the same page (http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.12.tgz)
The script nef_to_neft.py appends the TOMTOM output appended to each row of an output .nef file from MotifGP.
The first argument is the .nef file
The second argument is the path to the tomtom executable
The third argument is the a comma separated list of motif databases compatible with TOMTOM:
Example:
python nef_to_neft.py OUT/default/FORTIN/42.nef "/home/y/meme/bin/tomtom" ~/meme/db/motif_databases/MOUSE/uniprobe_mouse.meme,~/meme/db/motif_databases/HUMAN/HOCOMOCOv10_HUMAN_mono_meme_format.meme
The output will be written at the same location as the input file, but with a .neft extension.
======================================================================================
6) Creating/Editing objectives
======================================================================================
MotifGP already has a few objective functions implemented in objectives.py, but it is also customizable. Create or edit functions in objective.py, and configure them in config/objectives.map.
As an example, we can create a variant of the Support objective (Total positive sequences matched/Total positive sequences) where the function is changed to (Total negative sequences matched/Total negative sequences).
class NegativeSupport(Objective): # Create a class for objective, and extend Objective (from objective.py)
# In the initialation, you need to assign:
# tag: String - The objective name
# w: float - The weight for minimization or maximization
# f: function - The objective to compute
def __init__(self):
self.tag = "NegativeSupport" # Name the objective
self.w = -1.0 # Set positive weight for maximization objectives, and -1.0 for a minimization objective.
def fit_negative_support(p, P, n, N): # Create a fitness function that accepts p, P, n and N, even if they're unused.
# p : total positive matches
# P : total positive sequences
# n : total negative matches
# N : total negative sequences
# By convention, you can name unused arguments with _, __ and ___. e.g def fitness(p, _, P, __):
return n/(1.0*N)
self.f = fit_negative_support # Designate the fitness
Note that we set w (the weight) to -1.0. The idea is that we should want to minimize the number of negative matches, so a negative weigth seems more appropriate.
Then, simply update the config/objectives.map file to assign a key as the parameter for MotifGP.
F=Fisher
I=ScipyFisher
D=Discrimination
S=Support
O=OddsRatio
R=ScipyOddsRatio
Q=FalseDiscove
# My objectives
N=NegativeSupport
And you should be able to use your objective with the key N. For example a Fisher+NegativeSupport:
python motifgp.py -f FN -p CHEN/Smad1.pos.fasta
...
[<function fit_negative_support at 0x7f3d80bcdc80>, <function fit_fisher at 0x7f3d80bcdcf8>]
[1.0, -1.0]
...
Fisher NegativeSupport
------------------------------------------------ ------------------------------------------------
gen evals memoize maxdepth avg max min std avg max min std
0 0 0 4 -0.700303 1 -11.8035 0.794532 0.722226 0.847247 0 0.244139
0 795 720 8 -1.08205 -0.606384 -20.2213 1.50511 0.797282 0.847247 0.0159858 0.135432
1 762 1079 8 -1.26253 -0.647503 -20.2213 2.18183 0.804621 0.847247 0.0595027 0.109026
2 781 1386 9 -1.87303 -0.647503 -20.2213 3.75498 0.797226 0.847247 0.218472 0.120874
3 798 1682 11 -2.7606 -0.647503 -20.2213 5.03724 0.774728 0.847247 0.305506 0.149269
4 802 1981 11 -3.08767 -0.647503 -21.802 5.12837 0.766437 0.847247 0.371226 0.149922
5 778 2254 11 -3.42811 -0.647503 -21.802 5.06522 0.756964 0.847247 0.371226 0.129628
6 785 2549 10 -3.51545 -0.647503 -21.802 5.46597 0.761878 0.847247 0.371226 0.130661
7 779 2799 12 -4.73884 -0.647503 -24.4557 5.87413 0.731313 0.847247 0.298401 0.136101
8 770 2998 12 -5.54918 -0.647503 -24.4557 6.5863 0.710751 0.847247 0.298401 0.150391
9 787 3194 10 -5.62263 -0.647503 -24.4557 6.54813 0.707751 0.847247 0.298401 0.149553
10 783 3378 12 -5.90551 -0.647503 -24.4557 6.93355 0.705282 0.847247 0.298401 0.156731
Also, your new objective should show up in the --help menu.
-f FITNESS, --fitness=FITNESS
Objective fitness function. Available objectives: D=Di
scrimination,F=Fisher,I=ScipyFisher,O=OddsRatio,N=Nega
tiveSupport,Q=FalseDiscoveryRate,S=Support,R=ScipyOdds
Ratio. Each single character in the string represents
an objective. Objectives are mapped by the
configuration file at config/objectives. Default is
'DF' for [Discrimination,Fisher] (2-objectives).