This repository has been archived by the owner on Nov 7, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathREADME
1433 lines (1075 loc) · 63.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
0. Availability
============
The source code for this package is available from
http://research-pub.gene.com/gmap. License terms are provided in the
COPYING file.
1. Building and installing GMAP and GSNAP
==========================================
Prerequisites: a Unix system (including Cygwin on Windows), a C
compiler, and Perl
Step 1: Set your site-specific variables by editing the file
config.site. In particular, you should set appropriate values for
"prefix" and probably for "with_gmapdb", as explained in that file.
If you are compiling this package on a Macintosh, you may need to edit
CFLAGS to be
CFLAGS = '-O3 -m64'
since Macintosh machines will make only 32-bit executables by default.
Step 2: Build, test, and install the programs, by running the
following GNU commands
./configure
make
make check (optional)
make install
Note 1: Instead of editing the config.site file in step 1, you may type
everything on the command line for the ./configure script in step 2,
like this
./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb
If you omit --with-gmapdb, it defaults to ${prefix}/share. If you
omit --prefix, it defaults to /usr/local. Note that on the command
line, it is "with-gmapdb" with a hyphen, but in a config.site file,
it is "with_gmapdb" with an underscore.
Note 2: If you want to keep your version of config.site or have
multiple versions, you can save the file to a different filename, and
then refer to it like this
./configure CONFIG_SITE=<config site file>
Note 3: GSNAP is designed for short reads of a limited length, and
uses a configure variable called MAX_READLENGTH (default 300) as a
guide to the maximum read length. You may set this variable by
providing it to configure like this
./configure MAX_READLENGTH=<length>
or by defining it in your config.site file (or in the file provided to
configure as the value of CONFIG_SITE). Or you may set the value of
MAX_READLENGTH as an environment variable before calling ./configure.
If you do not set MAX_READLENGTH, it will have the default value shown
when you run "./configure --help".
Note that MAX_READLENGTH applies only to GSNAP. GMAP, on the other
hand, can process queries up to 1 million bp.
Also, starting with version 2014-08-20, if your C compiler can
handle stack-based memory allocation using the alloca() function,
GSNAP ignores MAX_READLENGTH, and can handle reads longer than that
value.
Note 4: GSNAP can read from gzip-compressed FASTA or FASTQ input
files. This feature requires the zlib library to be present
(available from http://www.zlib.net). The configure program will
detect the availability of zlib automatically. However, to disable
this feature, you can add "--disable-zlib" to the ./configure command
or edit your config.site file to have the command "disable_zlib".
Note 5: GSNAP can read from bzip2-compressed FASTA or FASTQ input
files. This feature requires the bzlib library to be present. The
configure program will detect the availability of bzlib automatically.
However, to disable this feature, you can add "--disable-bzlib" to the
./configure command or edit your config.site file to have the command
"disable_bzlib".
2. Possible issues during compilation
======================================
Recent versions of GMAP and GSNAP use certain techniques to achieve
increased speed. One of these techniques is special SIMD
(single-instruction, multiple data) instructions that can perform some
computations in parallel. There are various levels of SIMD
instructions, and we use those from the SSE2 and SSE4.1 instruction
sets. Most computers built within the last 10 years should support
SSE2 at least. However, you may run into the following issues:
Compiler issue 1. If you compile the code successfully, but when you
run the program, you see something that says "Illegal instruction",
then you must be running GMAP or GSNAP on a different computer than
the one where you compiled it. You may be using a computer cluster
with a variety of different computer models. Each time you start GMAP
or GSNAP, the program tests to see if the computer can run the same
set of SSE2 or SSE4.1 instructions as were available when the programs
were compiled. (The programs also check for if the popcnt
instructions work, but popcnt is so widely implemented that they
generally do not cause any problems.)
In that case, you may need to compile your program for the lowest
common denominator by by providing --disable-avx, --disable-sse4.1, or
--disable-sse2 to ./configure as necessary. Alternatively, your
computer cluster may have the ability to detect the capabilities of
each computer when it receives a job. Then, you may want to create
different compiled versions of GMAP and GSNAP, and call the
appropriate binary for that particular job. You will have to work
with your system administrator if you want to accomplish this.
Compiler issue 2. The most recent versions of GSNAP (starting with
2013-10-01) build a suffix array to help with speed. If your computer
does not have sufficient RAM, it is possible that the gmap_build
step fails after printing something like this:
SACA_K called with n = 3095693984, K = 5, level 0
If this happens, you either need to find a computer with more RAM, or
you can build a genome without a suffix array, by providing
--no-sarray to gmap_build. The GSNAP program will still work with the
resulting genome index, but it won't achieve optimal speed. GMAP does
not use the suffix array at all. Also, large genomes of over 4
billion bp also will not create or use a suffix array.
3. Downloading a pre-built GMAP/GSNAP database
===============================================
A GMAP/GSNAP "database" is a set of genomic index files, representing
the genome in a hash table format. You can use the gmap_build program
to build your own database (as described below), but you can started
quickly by downloading a pre-built GMAP/GSNAP database from the same
place you obtained the GMAP program (see above for URL).
Place the database in the GMAPDB directory you specified in the
config.site file when you built the gmap program. You should include
a subdirectory for each GMAP database; for example, if you downloaded
a database called <genome>, your directory structure should look like
this
/path/to/gmapdb/<genome>/
/path/to/gmapdb/<genome>/<genome>.chromosome
/path/to/gmapdb/<genome>/<genome>.chromosome.iit
...
/path/to/gmapdb/<genome>/<genome>.version
Note that the GMAP database format has changed with the 2013-10-01
release to add a suffix array and other new file formats. However,
GMAP and GSNAP are both backwards and forwards compatible, meaning
that new source code will work with older genome indices and that old
source code will work with newer genome indices. Mixing up the two,
though, will result in slower running speed. To achieve optimal
speed, you should use both the latest source code and rebuild your
genome indices with the latest gmap_build program. You can tell if
your database has the most recent format if it has files of the type
<genome>.sarray, <genome>.ref12153bitpackptrs, and
<genome>.ref12153bitpackcomp.
Note that the hg19 database provided on the Web site lacks the suffix
array, although GMAP and GSNAP will work fine, albeit more slowly.
The suffix array for hg19 requires 15 GB of disk space. If you want
the suffix array, you will have to build it yourself, as discussed in
the next section.
4. Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry)
==============================================================================
You can also build your own genomic database, using the gmap_build
program provided with this package. (Note: the gmap_setup
method from older versions is no longer provided or supported.)
Previous versions limited the total sequence length in your database
to 2^32 = 4,294,967,296 (about 4 billion) bp. However, starting with
version 2013-04-01, a total "genome" may now contain up to 2^64 bp.
However, each individual chromosome is still limited to 2^32 bp. The
utility programs below will automatically recognize when a genome is
larger than 2^32 bp and build the index files accordingly. Below I
use the "genome" and "chromosome", but the input sequences can be
anything you wish to align to, including transcripts or small
fragments.
You will need to start with a set of FASTA files containing either
entire chromosomes or contigs that represent pieces of chromosomes.
For example, for the human genome, you can retrieve all of the FASTA
files under the ftp directory at
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/
If your FASTA entries each contain a single chromosome, and the
accession for each chromosome is the chromosome number/letter, you can
simply run this command
gmap_build -d <genome> [-k <kmer size>] <fasta_files...>
which will build and install the GMAP index files in your default
GMAPDB location.
You can see the full usage of gmap_build by doing "gmap_build --help",
but here are some useful flags. If your FASTA files are compressed
using gzip, you can add the flag "-g" to gmap_build. You can control
the k-mer size for the genomic index with the -k flag, which can range
from 12 to 15. The default value for -k is 15, but this requires your
machine to have 4 GB of RAM to build the indices. If you do not have
4 GB of RAM, then you will need to reduce the value of -k or find
another machine. Here are the RAM requirements for building various
indices:
k-mer of 12: 64 MB
k-mer of 13: 256 MB
k-mer of 14: 1 GB
k-mer of 15: 4 GB
These are the RAM requirements for building indices, but not to run
the GMAP/GSNAP programs once the indices are built, because the
genomic indices are compressed. For example, the genomic index for a
k-mer of 15 gives a gammaptrs file of 64 MB and an offsetscomp file of
about 350 MB, much smaller than the 4 GB that would otherwise be
required. Therefore, you may want to build your genomic index on a
computer with sufficient RAM, and distribute that index to be used by
computers with less RAM.
The amount of compression can be controlled using the -b or --basesize
parameter to gmap_build. By default, the value for k-mer size is 15,
and the value for basesize is 12. If you select a different value for
k-mer size, then basesize is made by default to be equal to that k-mer
size.
If you want to build your genomic databases with more than one k-mer
size, you can re-run gmap_build with different values of -k. This
will overwrite only the identical files from the previous runs. You
can then choose the k-mer size at run-time by using the -k flag for
either GMAP or GSNAP.
Finally, you can provide information to gmap_build that certain
chromosomes are circular, with the -c or --circular flag. The value
for these flags is a list of chromosomes, separated by commas. The
gmap_build program will then allow GSNAP and GMAP to align reads
across the ends of the chromosome. For example, the mitochondrial
genome in human beings is circular.
5. Setting up to build a GMAP/GSNAP database (more complex cases)
==================================================================
If you are indexing a genome, where each chromosome is contained in a
separate FASTA entry, or a set of contigs, where the contigs are
independent of each other, then you can just skip to section 6.
Otherwise, if you have more complicated needs, you may need to add
options to gmap_build.
Note that the term
<fasta_file>...
above indicates that multiple files can be listed. The files can be
in any order, and the contigs can be in any order within these files.
By default, the gmap_build process will sort the contigs and
chromosomes into their appropriate "chrom" order. For the human
genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by
all other chromosomes in numeric/alphabetical order. If you don't
want this sort, provide the "-s none" flag to gmap_build. Other sort
options besides "none" and "chrom" are "alpha" and "numeric-alpha".
You can type "gmap_build --help" to see the full set of options. We
discuss some specific situations below.
5a. Contigs are mapped to chromosomes
======================================
If your FASTA entries consist of contigs, each of which has a mapping
to a chromosomal region in the header, you may add the -C (or
--contigs-are-mapped) flag to gmap_build, like this
gmap_build -d <genome> -C <fasta_file>...
Then gmap_build will try to parse a chromosomal region from each
header. The program knows how to parse the following patterns:
chr=1:217281..257582 [may insert spaces around '=', or omit '=' character]
chr=1 [may insert spaces around '=', or omit '=' character]
chromosome 1 [NCBI .mfa format]
chromosome:NCBI35:22:1:49554710:1 [Ensembl format]
/chromosome=2 [Celera format]
/chromosome=2 /alignment=(88840247-88864134) /orientation=rev [Celera format]
chr1:217281..257582
chr1 [may insert spaces after 'chr']
If only the chromosome is specified, without coordinates, the program
will assign its own chromosomal coordinates by concatenating the
contigs within each chromosome. If gmap_build cannot figure out the
chromosome, it will assign it to chromosome "NA".
5b. Using an MD file
=====================
Another possibility is that your FASTA entries consist of contigs that
mapped to chromosomes, where the mapping information is in an external
file. Genomes from NCBI typically include an ".md" file (like
seq_contig.md) that specifies the chromosomal coordinates for each
contig. To use this information, provide the -M (or --mdfile) flag to
gmap_build like this
gmap_build -d <genome> -M <mdfile> <fasta_file>...
The program will then try to parse the mdfile (which often changes
formats) and verify with you which columns contain the contig names
and chromosomal coordinates.
5c. Compressed FASTA files or files requiring processing
=========================================================
If your genome files are compressed using gzip, you can simply add the
flag "-g" to gmap_build. But if your genome files require some other
type of processing, you may need to write a small script that pipes
the sequences in FASTA format to gmap_build. You can tell gmap_build
about your script with the -E (or --fasta-pipe) flag, like this
gmap_build -d <genome> -E 'gunzip -c chr*.gz'
gmap_build -d <genome> -E 'cat *.fa | ./add-chromosomal-info.pl'
You can think of the command as a Unix pipe for processing each FASTA
file before it is read by gmap_build.
6. Running GMAP
================
To see the full set of options, type "gmap --help". The following are
some common examples of usage. For more examples, see the document
available at http://www.gene.com/share/gmap/paper/demo-slides.pdf
For each of the examples below, we assume that you have installed a
genome database called <genome> in your GMAPDB directory. (If your
database is located elsewhere, you can specify the -D flag to gmap or
set the environment variable GMAPDB to point to that directory.)
* Mapping only: To map one or more cDNAs in a FASTA file onto a
genome, run GMAP as follows:
gmap -d <genome> <cdna_file>
* Mapping and alignment: If you want to map the cDNAs to a genome,
and show the full alignment, provide the -A flag:
gmap -d <genome> -A <cdna_file>
* Alignment only: To align one or more cDNAs in a FASTA file onto a
given genomic segment (also in a FASTA file), use the -g flag
instead of the -d flag:
gmap -g <genome_segment> -A <cdna_file>
* Batch mode: If you have a large number of cDNAs to run, and you have
sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B
5" option. Details for these options are provided by running "gmap
--help".
gmap -d <genome> -B 5 -A <cdna_file>
* Multithreaded mode: If your machine has several processors, you can
make batch mode run even faster by specifying multiple threads with
the -t flag:
gmap -d <genome> -B 5 -A -t <nthreads> <cdna_file>
Note that with multiple threads, the output results will appear in
random order, depending on which thread finishes its computation
first. If you wish your output to be in the same order as the
input cDNA file, add the '-O' (letter O, not the number 0) flag
to get ordered output.
Guidelines: The -t flag specifies the number of computational
threads. In addition, if your machine supports threads, GMAP also
uses one thread for reading the input query sequences, and one
thread for printing the output results. Therefore, the total number
of threads will be 2 plus the number you specify. The program will
work optimally if it uses one thread per available processor. If
you specify too many threads, you can cause your computer to thrash
and slow down. Note that other programs running on your computer
also need processors.
* Compressed output: If you want to store the alignment results in a
compressed format, use the -Z flag. You can uncompress the results
by using the gmap_uncompress.pl program:
gmap -d <genome> -Z <cdna_file> > x
cat x | gmap_uncompress
Note that gmap is written for small genomes (less than 2^32 bp in
total length). With large genomes, there is an equivalent program
called gmapl, which you should run instead of gmap. The gmapl program
is equivalent to gmap, and is based on the same source code, but is
compiled to use 64-bit index files instead of 32-bit files. The gmap
and gmapl programs will detect whether the genomes are the correct
size, and will exit if you try to run them on the wrong-sized genomes.
7. Building map files
======================
This package includes an implementation of interval index trees
(IITs), which permits efficient lookup of interval information. The
gmap program also allows you (with its -m flag) to look up pre-mapped
annotation information that overlaps your query cDNA sequence. These
interval index trees (or map files) are built using the iit_store
program included in this package. To build a map file, do the
following:
Step 1: Put your map information for a given genome into a map file
with the following FASTA-like format:
>label coords optional_tag
optional_annotation (which may be zero, one, or multiple lines)
For example, the label may be an EST accession, with the coords
representing its genomic position. Labels may be duplicated if
necessary.
The coords should be of the form
chr:position
chr:startposition..endposition
The term "chr:position" is equivalent to "chr:position..position". If
you want to indicate that the interval is on the minus strand or
reverse direction, then <endposition> may be less than
<startposition>.
Tags are very general and can be used for a variety of purposes. For
example, you could
Step 2: Run iit_store on this map file as follows
cat <mapfile> | iit_store -o <mapname>
The program will create a file called <mapname>.iit.
Now you can retrieve this information with iit_get
iit_get <mapname>.iit <coords>
where <coords> has the format "chr:position" or
"chr:startposition..endposition". The iit_get program has other
capabilities, including the ability to retrieve information by label,
like this:
iit_get <mapname>.iit <label>
More details can be found by running "iit_get --help".
If you plan to use this map information frequently, you may want to
place it with its corresponding genome for future use. In each
GMAP/GSNAP database, there is a subdirectory for storing map files,
like this
/path/to/gmapdb/<genome>/<genome>.maps/
(If you don't remember where your default gmap directory is, run "gmap
--version" to find it.) If you put your <mapname>.iit file into this
maps subdirectory, you can get additional functionality. First, you
can run the program get-genome, which is used mainly for getting
genome sequence, to get map information instead, like this
get-genome -d <genome> -m <mapname> <coords>
Second, you can use GMAP with the -m flag to retrieve map information
that corresponds to a given cDNA sequence like this
gmap -d <genome> -m <mapname> <cdna_file>
Finally, GMAP and the IIT utilities support the GFF3 format. GMAP can
generate its results in GFF3 format, and iit_store can parse GFF3
files using its -G and -l flags. More details about iit_store can be
found by doing "iit_store --help".
8. Running GSNAP
=================
GSNAP uses the same database as GMAP does, so you will need to process
the genome using gmap_build as explained above, if you haven't done
that already.
To see the full set of options for GSNAP, type "gsnap --help". A key
parameter you will need to set is the "-m" flag, which is the maximal
score you will allow per read (or each end of a paired-end read). The
score equals the number of mismatches, plus penalties for indels and
local or distant splicing, if any. If you do not set a value for
"-m", then GSNAP will pick a value, depending on the length of each
read, that will allow it to run fairly quickly.
For DNA-Seq, the automatic setting should be fine, unless you need to
accommodate penalty values for indels or splicing, or your reads are
of poor quality.
For RNA-Seq, in previous versions, we recommended a moderately high
value of -m, such as 10 or so, to handle alignments that cross an
intron-exon boundary. But now that GSNAP can find terminal alignments
and has GMAP integrated in its algorithm, it is better to select a
small value for -m, such as the default value or something small like
4 or 5 for a 75-bp read.
Input to GSNAP should be either in FASTQ or FASTA format. The FASTQ
input may include quality scores, which will then be included in SAM
output, if that output format is selected. For single-end reads, the
FASTQ file may be piped into GSNAP, or given as its command-line
argument, like this
cat <fastq_file> | gsnap -d <genome>
or
gsnap -d <genome> <fastq_file>
For paired-end reads, the two corresponding FASTQ files should be
given as command-line arguments in pairs, like this
gsnap -d <genome> <fastq_file_1> <fastq_file_2> [<fastq_file_3> <fastq_file_4>...]
A pipe cannot work since GSNAP needs to access both FASTQ files in
parallel. The reads in FASTQ files may have varying lengths, if
desired. Note that GSNAP can process multiple sets of paired-end
reads, by adding the files in pairs. If you want to provide multiple
single-end files, you can either use "cat" to concatenate them into
the stdin of gsnap, like this:
cat <fastq_file_1> [<fastq_file_2>...] | gsnap -d <genome>
or you can provide them all on the command line with the
--force-single-end flag, like this:
gsnap -d <genome> --force-single-end <fastq_file_1> [<fastq_file_2>...]
which will process each FASTQ file one at a time as single-end reads,
and not try to pair them up.
GSNAP also has the ability to deal with files compressed with gzip, if
the configure script at compile time can find a zlib library in your
system (see Note 3 in the section above about building and installing
GMAP and GSNAP). If so, and your files are gzipped, you can then read
in gzipped files directly like this
gsnap --gunzip -d <genome> <fastq.gz>, or
gsnap --gunzip -d <genome> --force-single-end <fastq1.gz> [<fastq2.gz>...]
for single-end reads, or
gsnap --gunzip -d <genome> <fastq_1.gz> <fastq_2.gz> [<fastq_3.gz> <fastq_4.gz>...]
for paired-end reads.
Likewise, GSNAP can handle files compressed with bzip2, if the
configure script at compile time can find a bzlib library in your
system (see Note 3 in the section above about building and installing
GMAP and GSNAP). If so, and your files are bzip2-compressed, you can
then read in those files directly like this
gsnap --bunzip2 -d <genome> <fastq.bz2>, or
gsnap --bunzip2 -d <genome> --force-single-end <fastq1.bz2> [<fastq2.bz2>...]
for single-end reads, or
gsnap --bunzip2 -d <genome> <fastq_1.bz2> <fastq_2.bz2> [<fastq_1.bz2> <fastq_2.bz2>...]
for paired-end reads.
For FASTA format, you should include one line per read (or end of a
paired-end read). The same FASTA file can have a mixture of
single-end and paired-end reads of varying lengths, if desired.
Single-end reads:
Each FASTA entry should contain one short read per line, like this
>Header information
AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
Each short read can have a different length. However, the entire read
needs to be on a single line, and may not wrap around multiple lines.
If it extends to a second line, GSNAP will think that the read is
paired-end.
Paired-end reads:
Each FASTA entry should contain two short reads, one per line, like
this
>Header information
AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
By default, the program assumes that the second end is in the reverse
complement direction compared with the first end. If they are in the
same direction, you may need to use the --circular-input (or -c) flag.
GSNAP and GMAP can also read an extended FASTA format that include
quality scores, which look like this
>Header information
AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
+
<quality scores>
for single-end reads, or
<Header information
AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
+
<quality scores>
for the second-end of a paired-end read. In addition, GSNAP can read
an extended FASTA format for paired-end reads, like this:
>Header information
AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
+
<quality scores 1>
<quality scores 2>
This extended FASTA format is useful if paired-end information needs
to be piped into GSNAP via stdin.
As with gmap, gsnap is written for small genomes (less than 2^32 bp in
total length). With large genomes, there is an equivalent program
called gsnapl, which you should run instead of gsnap. The gsnapl
program is equivalent to gmap, and is based on the same source code,
but is compiled to use 64-bit index files instead of 32-bit files.
The gsnap and gsnapl programs will detect whether the genomes are the
correct size, and will exit if you try to run them on the wrong-sized
genomes.
9. SAM output format
=====================
GSNAP can generate SAM output format, by providing the "-A sam" flag
to GSNAP. In addition, GMAP can also print its alignments in SAM
output, using the "-f samse" or "-f sampe" options, for single-end or
paired-end data. The sampe option will generate SAM flags to indicate
whether the read is the first or second end of a pair, which requires
that you provide GMAP with an extended FASTA format having a ">" or
"<" character in the header to indicate that information. However,
the sampe option will change only the SAM flags, and not change the
underlying alignment algorithm. GMAP does not know how to find
concordance between paired-end reads like GSNAP does.
GSNAP provides some special SAM flags as follows:
XQ: A non-normalized mapping quality score
X2: The second best XQ score among all multimapping alignments for a
given read. If there is only a single alignment, this value is 0.
XO: Output type. GSNAP categorizes its alignments into output types,
as follows. Note that the --split-output option will create separate
output files for each output type. Alternatively, if you use
sam_sort, you should provide --split-output to that program instead
and achieve the same functionality. (The reason for this is that
there may be situations where GSNAP assigns different output types to
the first and second ends of the reads and sam_sort needs to see
alignments from both ends together.) In either case, the output types
have the following meanings and filename suffixes:
NM (nomapping) (filename suffix ".nomapping"): The entire read
(single-end or paired-end) could not be aligned. If the
--failed-input=FILENAME flag is specified, then these reads are also
printed in FASTQ or FASTA format (depending on the input format) in
the given file (plus a .1 or .2 ending for the first and second ends
of a paired-end read).
CU (concordant unique) (filename suffix ".concordant_uniq"): Both
ends of a paired-end read could be aligned concordantly to a single
position in the genome. For a definition of concordance, see the
section "Output types" below.
CM (concordant multiple) (filename suffix ".concordant_mult"): Both
ends of a paired-end read could be aligned concordantly, but to more
than one position in the genome.
CX (concordant multiple excess) (filename suffix
".concordant_mult_xs"): Multiple concordant alignments, but user
specified --quiet-if-excessive and the number of alignments exceeds
"-n" threshold. If the --failed-input option is given, these reads
are also printed in FASTA or FASTQ format in the given file.
CT (concordant translocation) (filename suffix
".concordant_transloc"): Both ends of a paired-end read could be
aligned concordantly, but one end requires a split alignment to a
distant location, such as another chromosome, or a different strand
on that chromosome, or a far distance on that strand. Note that
translocation alignments need to be printed on two separate SAM
lines.
CC (concordant circular) (filename suffix ".concordant_circular"):
Both ends of a paired-end read could be aligned concordantly, but
one or both ends require an alignment that goes around the origin of
a circular chromosome. Circular chromosomes are specified in the
gmap_build step by using the -c or --circular flag, as described
previously. Note that circular alignments need to be printed on two
separate SAM lines.
PI (paired unique inversion) (filename suffix ".paired_uniq_inv"):
Both ends of a paired-end read could be aligned uniquely, but in a
way that indicates that a genomic inversion has occurred between the
two ends.
PS (paired unique scramble) (filename suffix ".paired_uniq_scr"):
Both ends of a paired-end read could be aligned uniquely, but in a
way that indicates that the genomic order is scrambled. This
typically occurs because of tandem duplications.
PL (paired unique long) (filename suffix ".paired_uniq_long"): Both
ends of a paired-end read could be aligned uniquely, but in a way
that indicates that a large genomic deletion has occurred between
the two ends.
PC (paired unique circular) (filename suffix
".paired_uniq_circular"): Both ends of a paired-end read could be
aligned uniquely, but not concordantly, representing an inversion,
scramble, or deletion. In addition, one or both ends of the read
goes around the origin of a circular chromosome.
PM (paired multiple) (filename suffix ".paired_mult"): Both
ends of a paired-end read could be aligned near each other,
representing an inversion, scramble, or deletion, but there are
multiple places in the genome where an alignment is found.
PX (paired multiple excess) (filename suffix ".paired_mult_xs"):
Multiple paired alignments, but user specified --quiet-if-excessive
and the number of alignments exceeds the "-n" threshold. If the
--failed-input option is given, these reads are also printed in
FASTA or FASTQ format in the given file.
HU, HM, HT, HC (halfmapping unique, halfmapping multiple,
halfmapping translocation, and halfmapping circular, respectively)
(filename suffixes: ".halfmapping_uniq", ".halfmapping_mult",
".halfmapping_transloc", ".halfmapping_circular): Same as for the
concordant output types, except that only one end of the paired-end
read could be aligned, and the other end could not be aligned
anywhere in the genome.
HX (halfmapping multiple excess) (filename suffix
".halfmapping_mult_xs"): Multiple halfmapping alignments, but user
specified --quiet-if-excessive and the number of alignments exceeds
the "-n" threshold. If the --failed-input option is given, these
reads are also printed in FASTA or FASTQ format in the given file.
UU, UM, UT, UC (unpaired unique, unpaired multiple, unpaired
translocation, and unpaired circular, respectively) (filename
suffixes: ".unpaired_uniq", ".unpaired_mult", ".unpaired_transloc",
".unpaired_circular): Same as for the concordant output types,
except that the two ends could not be aligned concordantly or even
paired. These "unpaired" categories are also used for single-end
reads, since they lack a mate end that can allow for concordance,
pairing, or halfmapping.
UX (unpaired multiple excess) (filename suffix ".unpaired_mult_xs"):
Multiple unpaired alignments, but user specified
--quiet-if-excessive and the number of alignments on one end exceeds
the "-n" threshold. If the --failed-input option is given, these
reads are also printed in FASTA or FASTQ format in the given file.
XB: Prints the barcode extracted from the end of the read. Applies only
if --barcode-length is not 0.
XP: Prints the primer inferred from a paired-end read. Applies only
if the --adapter-strip flag is specified.
XH: Prints the part of the query sequence that was hard-clipped.
Sequence is printed in plus-genomic order and replaces the "H" part of
the CIGAR string.
XI: Prints the part of the quality string that was hard-clipped.
Sequence is printed in plus-genomic order and replaces the "H" part of
the CIGAR string.
XS: Prints the strand orientation (+ or -) for a splice. Appears only
if splicing is allowed (-N or -s flag provided), and only for reads
containing a splice. The value "+" means the expected GT-AG, GC-AG,
or AT-AC dinucleotide pair is on the plus strand of the genome, and
"-" means the dinucleotides are on the minus strand. If the
orientation is not obvious, because the dinucleotides do not match
GT-AG, GC-AG, AT-AC, or their complements, then the program applies a
probabilistic splice model to determine the orientation. If the
splice sites have low probability, then the program may not be able to
determine an orientation, and the result will be printed as XS:A:?.
To prevent this flag, which cannot be handled by such programs (such
as Cufflinks), use the --force-xs-dir flag. However, this flag will
merely change occurrences of XS:A:? arbitrarily to XS:A:+.
XA: Indicates an ambiguous splice. If GSNAP finds two or more
possible splices at a given position, it will try to resolve the
ambiguity if possible based on the other end of the paired-end read.
If the ambiguity cannot be resolved, GSNAP will not report any of the
splices, but will report a soft clip instead. The XA field indicates
which end or ends are ambiguous and the number of matches found on
each ambiguous end, based on the output XA:Z:i,j. If i or j is
greater than 0, that indicates that the lower or higher chromosomal
end is ambiguous, respectively. The value given indicates the number
of matches found in the ambiguous end. This number may be smaller
than the number of bases soft-clipped, due to mismatches.
XC: Indicates whether the alignment crosses over the origin of a
circular chromosome. If so, the string XC:A:+ is printed.
XT: Prints the intron dinucleotides and splice probabilities around a
distant splicing event (genomic deletion, inversion, scramble, or
translocation).
XW and XV: Printed only when SNP-tolerant alignment is enabled. XW
provides the number of mismatches against both the reference and
alternate alleles (or the "World" population). Therefore, these are
true mismatches. XV provides the number of positions that are
mismatches against the reference genome, but do match the alternate
genome. Therefore, these are known variant positions. The sum of XW
and XV provides the number of differences relative to the reference
genome, and with the exception of indels, should equal the value of
NM.
XG: Indicates which method within GSNAP generated the alignment. A:
suffix array method, B: GMAP alignment produced from suffix array, M:
GMAP alignment produced from GSNAP hash table method, T: terminal
alignment, O: merging of overlaps. Absence of XG flag indicates the
standard GSNAP hash table method. (Note: older versions of GSNAP used
"PG:", but some downstream software required all PG methods to be
listed in the header section, so we changed the field name to "XG:")
10. GSNAP output format
========================
By default, GSNAP prints its output in a FASTA-like format, which we
developed before we incorporated the SAM format. The default GSNAP
output has some advantages over SAM output, especially for debugging
purposes. However, we routinely use SAM output for our own pipeline,
and it has been subject to the most testing by ourselves and by outside
users.
Here is some output from GSNAP on a paired-end read:
>GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC 1 concordant ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGct----------------------------------- 1..39 +9:139128263..139128301 start:0..acceptor:0.99,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon4/13 segs:2,align_score:2 pair_score:5,pair_length:112
,-------------------------------------acGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC 40..76 +9:139128516..139128552 donor:0.96..end:0,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon3/13
<CTTCGCCAACAACTCGGGCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACCGC 1 concordant ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
CTTCGCCAACAACTCGGcCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACgtg 1..73 -9:139128588..139128516 start:0..end:3,sub:3+1=4 segs:1,align_score:3 pair_score:5,pair_length:112
Each end of a read gets its own block, with the first read starting
with ">" and the second read for paired-end reads starting with "<".
The block starts with a header line that has in column 1, the query
sequence in its original direction (and with lower-case preserved if
any); in column 2, the number of hits for that query and if the read
is paired-end, the relationship between the ends (as discussed in the
next paragraph); and in column 3, the accession number for the query.
11. Output types
=================
The two ends of a paired-end read can have the following
relationships: "concordant", "paired", or "unpaired". A paired-end
read is concordant if the ends align to the same chromosome, in the
expected relative orientations, and having an inferred insert length
greater than zero and within the "--pairmax" parameter. The inferred
insert length is the distance from the end of the first-end alignment
to the start of the second-end alignment, plus the read lengths of the
two ends. There may be more than one concordant alignment for a given
read, and if so, the alignments for each end are reported in
corresponding order.
If a concordant relationship cannot be found, then the program will
report any paired relationships it can find. A paired alignment
occurs when the two ends align to the same chromosome, but fail some
criterion for concordance. There are different subtypes of paired
alignments, depending on which criterion is violated. If the
orientations are opposite what is expected, the paired subtype is
"inversion". If they are in the expected orientation, but the
distance is greater than the "--pairmax" parameter, then the paired
subtype is "toolong". If they are in the expected orientation, but
the inferred insert length appears to be negative, then the paired
subtype is "scramble". In GSNAP output, a paired subtype is shown in
a label called "pairtype", which can have the values
"pairtype:inversion", "pairtype:toolong", and "pairtype:scramble".
Otherwise, if neither a concordant nor paired alignment can be found,
then the program will align each end separately, and report the
relationship as being "unpaired".
GSNAP can find translocation splices within a single end of a read,
but it tries to be conservative about reporting them. If there is any
alignment that does not involve such a translocation, then it will not
report the translocation. It therefore reports translocation splices
only when no other alignment is found within the concordant, paired,
or unpaired categories. Therefore, such results are listed in the
header as having "(transloc)" appear after the "concordant", "paired",
or "unpaired" result type.
After the query line, each of the genomic hits is shown, up to the
'-n' parameter. If too many hits were found (more than the '-n'
parameter), the behavior depends on whether the "--quiet-if-excessive"
flag is given to GSNAP. If not, then the first n hits will be printed
and the rest will not be printed. If the "--quiet-if-excessive" flag
is given to GSNAP, then no hits will be printed if the number exceeds
n.
Each of the genomic hits contains one or more alignment segments,
which is a region of continuous one-to-one correspondence (matches or
mismatches) between the query and the genome. Multiple segments occur
when the alignment contains an insertion, deletion, or splice. The
first segment is marked by a space (" ") at the beginning of the line,
while the second and following segments are marked by a comma (",") at
the beginning of the line. (In the current implementation of GSNAP
that allows only a single indel or splice, the number of segments is
at most two.)
The segments contain information in tab-delimited columns as follows:
Column 1: Genomic sequence with matches in capital letters, mismatches
in lower-case letters, and regions outside the segment with dashes.
For deletions in the query, the deleted genomic sequence is also
included in lower case. For spliced reads, the two dinucleotides at
the intron ends are included in lower case.
Column 2: Range of query sequence aligned in the segment. Coordinates
are inclusive, with the first nucleotide considered to be position 1.
Column 3: Range of genomic segment aligned, again with inclusive
coordinates, with the first nucleotide in each chromosome considered
to be position 1. Plus and minus strands are marked with a "+" or "-"
sign.
Column 4: Segment information, delimited by commas. The first item
reports on the ends of the segment, which can be of type "start",
"end", "ins", "del", "donor", "acceptor", or "term". After "start"
and "end", we report the number of nucleotides clipped or trimmed from
the segment. In our example above, "end:3" means that 3 nucleotides
should be trimmed from the end. Trimming finds a local maximum of
matches to mismatches from the end and is computed only if the "-T"
flag is specified, and the value for "-T" limits the amount of
trimming allowed. After "ins" and "del", we report the number of
nucleotides that were inserted or deleted in the query relative to the
genome. After "donor" or "acceptor", we report the probability of the
splice site, based on the MaxEnt model. The "term" label indicate a
terminal segment, where the entire read could not be aligned, but more
than half of the read could be aligned from either end.
Each segment will also show after the "sub" tag, he number of
mismatches in that segment including the part that is trimmed, if any.
If SNP-tolerant alignment is chosen, then the number of SNPs is also
shown (see details below under SNP-tolerant alignment). Other
information may also be included with the segment information, such as
the orientation and distance of the splice or known splice labels, if
appropriate flags and information are given to GSNAP. Splices are
marked with a splice_type, which can be "consistent", "inversion",
"scramble", or "translocation". A "translocation" splice includes
splices on the same chromosome where the splice distance exceeds the
parameter for localsplicedist.
Column 5: Alignment or hit information, delimited by commas. For the
first segment in a hit (the one starting with a space), this column
provides the number of segments (denoted by "segs:") and the score of
the alignment (denoted by "align_score:").