-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
97 lines (64 loc) · 3.42 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
GARLIC - An artificial non-functional realistic DNA sequence generator.
Copyright (C) 2011-2015 Juan Caballero [Institute for Systems Biology]
DESCRIPTION
GARLIC is an artificial non-functional realistic DNA sequence generator.
Why do we need it? Because with current sequencing technology we can sequence
any organism and start mining the genome. Many genomic analysis use strong
statistical tests, but until now, a good negative control has never developed.
For example, we can use several programs (Genscan, Augustus, GlimmerHMM, ...)
to predict coding genes, all these programs has been properly trained to
recognize how a gene looks like using well characterized genes. So you expect a
low false negative rate, but none implements a negative control therefore you
will find a lot of false positive genes. This is currently true in many model
organisms (including human) where the number of predicted coding genes are more
than the number of coding genes with evidence. Of course you can use intergenic
regions as a negative control, but these regions are limited in number and size,
also you cannot be 100% sure that the regions don't contain genes.
So, we developed this new tool to recreate realistic sequences based on the
properties of the background genome. We define the background genome as the
remained sequences of a genome after removal of genes (coding and non-coding),
pseudogenes, interspersed repeats and low complexity sequences (600 Mb in hg19).
We modeled:
(1) composition of the background genome
(2) interspersed repeats
(3) low complexity sequences
The current algorithm creates a base sequence, then the sequence is bombarded
with artificially evolved elements (interspersed repeats, low complexity) as
expected in the reference genome. The final output is a Fasta file with the
new sequence generated.
REQUERIMENTS
- Perl
- Genome models. You can download from: http://www.repeatmasker.org/garlic/
or create your own, see below.
- RepBase consensus sequences. You need to download the EMBL file from RepBase
[http://www.girinst.org/repbase] (registration required) and put the file in
data/repbase [suggested].
USAGE
1. Create a new sequence
perl createFakeSequence.pl -m hg19 -s 1Mb -o fake.fa
For more options, please read the documentation using:
perl createFakeSequence.pl --help
2. Train a new model
You can obtain the models from our website, we are currently suporting some
organisms with complete annotation in the UCSC Genome Database:
[http://hgdownload.cse.ucsc.edu/downloads.html]
You can create your own model fetching the data from the UCSC site:
perl createModel.pl -m hg19
Also you can use your own sequences and annotations to create a model:
perl createModel.pl -m myOrg -f myOrg.fa -r RM.out -t TRF.out -g Genes.table
Please read the documentation using:
perl createModel.pl --help
CITATION
Realistic artificial DNA sequences as negative controls for computational genomics.
Caballero J, Smit AF, Hood L, Glusman G.
Nucl. Acids Res. 2014
doi: 10.1093/nar/gku356
LICENSE
All the code is under the GPLv3 licence, see LICENSE file for details.
CHANGES
1.5 :
- "unitialized value" warnings caused by UCSC model lookups fixed.
- TRF data should be in UCSC BED format which uses 0-based/half-open coordinates.
Fixed a bug in the code where 1-based was assumed, causing a zero-valued start
coordinate to go negative.
- createModel.pl: Added support for relative paths in input parameters.