-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfasta20.doc
1130 lines (895 loc) · 47.8 KB
/
fasta20.doc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
COPYRIGHT NOTICE
Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
Pearson and the University of Virginia. All rights reserved. The
FASTA program and documentation may not be sold or incorporated
into a commercial product, in whole or in part, without written
consent of William R. Pearson and the University of Virginia.
For further information regarding permission for use or
reproduction, please contact: David Hudson, Assistant Provost for
Research, University of Virginia, P.O. Box 9025, Charlottesville,
VA 22906-9025, (804) 924-6853
The FASTA program package
Introduction
This documentation describes the version 2.0x of the FASTA
program package (see W. R. Pearson and D. J. Lipman (1988),
"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence
Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
98). Version 2.0 modifies version 1.8 to include explicit
statistical estimates for similarity scores based on the extreme
value distribution. In addition, FASTA protein alignments now
use the Smith-Waterman algorithm with no limitation on gap size.
FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
options to change gap penalties on the command line. Version 1.7
replaces rdf2 and rss with prdf and prss, which use the extreme-
value distribution to calculate accurate probability estimates.
Although there are a large number of programs in this package,
they belong to four groups:
Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
Global alignment: ALIGN
In addition, I have included several programs for protein
sequence analysis, including a Kyte-Doolittle hydropathicity
plotting program (GREASE, TGREASE), and a secondary structure
prediction package (GARNIER).
The FASTA sequence comparison programs on this disk are
improved versions of the FASTP program, originally described in
Science (Lipman and Pearson, (1985) Science 227:1435-1441). We
have made several improvements. First, the library search
programs use a more sensitive method for the initial comparison
of two sequences which allows the scores of several similar
regions to be combined. As a result, the results of a library
search are now given with three scores, initn (the new initial
score which may include several similar regions), init1 (the old
fastp initial score from the best initial region), and opt (the
old fastp optimized score allowing gaps in a 32 residue wide
band).
These programs have also been modified to become "universal"
(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
FAST-N (nucleotides)); by changing the environment variable
SMATRIX, the programs can be used to search protein sequences,
DNA sequences, or whatever you like. By default, FASTA, LFASTA,
and the PRDF programs automatically recognize protein and DNA
sequences. Sequences are first read as amino acids, and then
converted to nucleotides if the sequence is greater than 85%
A,C,G,T (the '-n' option can be used to indicate DNA sequences).
TFASTA compares protein sequences to a translated DNA sequence.
Alternative scoring matrices can also be used. In addition to
the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
based on simple identities or the genetic code can also be used
for sequence comparisons or evaluation of significance. Several
different protein sequence matrices have been included;
instructions for constructing your own scoring matrix are
included in the file FORMAT.DOC.
The remainder of this document is divided into three sections:
(1) a brief history of the changes to the FASTA package; (2) A
guide to installing the programs and databases; (3) A guide to
using the FASTA programs. The programs are very easy to use, so
if you are using them on a machine that is administered by
someone else, you may want to skip to section (3) to learn how to
use the programs, and then read section (1) to look at some of
the more recent changes. If you are installing the programs on
your own machine, you will need to read section (2) carefully.
1. Revision History
1.1. Changes with version 2.0u
Version 2.0u provides several major improvements over
previous versions of FASTA (and SSEARCH). The most important is
the incorporation of explicit statistical estimates and
appropriate normalization of similarity scores. This improvement
is discussed in more detail below in the section entitled
Statistical Significance. In addition, all of the protein
comparison programs now use the BLOSUM50 matrix, with gap
penalties of -12, -2, by default. BLOSUM50 performs
significantly better than the older PAM250 matrix. PAM250 can
still be used with the command line option: -s 250. (DNA
sequence comparisons use a more stringent gap penalty of -16, -4,
which produces excellent statistical estimates when optimized
scores are used. TFASTA uses -16, -4 as well.)
The quality of the fit of the extreme value distribution to
the actual distribution of similarity scores is summarized with
the Kolmogorov-Smirnov statistic. The acceptance limits for this
statistic can be found in many statistics books. In general,
values <0.10 (N=30) indicate excellent agreement between the
actual and theoretical distributions. If this statistic is >
0.2, consider using a higher (more stringent) gap penalty, e.g.
-16, -4 rather than -12, -2. The default scoring matrix for DNA
has been changed to score +5 for an identity and -4 for a
mismatch. These are the same scores used by BLASTN.
With explicit expectation calculations, the program now
shows all scores and alignments with expectations less than 10.0
(with optimized scores, 2.0 without optimization) when the "-Q"
(quiet) mode is used. The expectation threshold can be changed
with the "-E" option.
Finally, the algorithm used to produce the final alignments
of protein sequences is now a full Smith-Waterman, with unlimited
gaps. (The older band-limited alignments are used for DNA
sequences and TFASTA by default, because Smith-Waterman
alignments are very slow for long sequences.) Both the optimized
and Smith-Waterman scores are reported; if the Smith-Waterman
score is higher, then additional gaps allowed a better alignment
and similarity score to be calculated.
FASTA searches now optimize similarity scores by default
(this slows searches about 2-fold (worst case) for ktup=2). Thus,
the meaning of the "-o" option has been reversed; "-o" now turns
off optimization and reports results sorted by "initn" scores.
Optimization significantly improves the sensitivity of FASTA, so
that it almost matches Smith-Waterman. With version 2.0, the
default band width used for optimized calculations can be varied
with the "-y" option. For proteins with ktup=2, a width of 16
(-y 16) is used; 16 is also used for DNA sequences. For proteins
and ktup=1, a width of 32 is used. Searches that disable
optimization with the "-o" option will work fine for sequences
that share 25% or more identity in general, but to detect
evolutionary relationships with 20% - 25% identity, the more
sensitive default optimization is often required. Optimization
is required for accurate statistical estimates with either
protein or DNA sequences.
The FASTA package now includes FASTX, a program that
compares a DNA sequence to a protein sequence database by
translating the DNA sequence in three frames (the reverse frames
are selected with the -i option) and aligning the three-frame
translation with the sequences in the protein database.
Alignment scores allow frameshifts so that a cDNA or EST sequence
with insertion/deletion errors can be aligned with its homologues
from beginning to end.
With release 20u6, there is also a TFASTX program, which is
a replacement for TFASTA. TFASTA treats each of the six reading
frames of a DNA library sequence as a different sequence; TFASTX
compares a protein sequence against only two sequences from each
DNA sequence - the forward and reverse orientation. For a given
orientation, TFASTX calculates a similarity score for alignments
that allow frameshifts, thus considering all possible reading
frames.
Another new program is included - randseq - which will
produce a randomly shuffled (uniform or local shuffle) from an
input sequence. This randomly shuffled sequence can be used to
evaluate the statistical estimates produced by FASTA, SSEARCH, or
BLAST.
1.2. Changes with version 1.7
Version 1.7 has been released to provide the PRDF and PRSS
programs for shuffling sequences and estimating accurately the
probabilities of the unshuffled-sequence scores.
PRDF a version of RDF2 that uses calculates the probability
of a similarity score more accurately by using a fit to
an extreme value distribution. Code to fit the extreme
value distribution parameters and the impetus to update
RDF2 was provided by Phil Green, U. of Washington.
PRSS a version of PRDF that uses a rigorous Smith-Waterman
calculation to score similarities
1.3. Changes with version 1.6
FASTA version 1.6 uses a new method for calculating optimal
scores in a band (the optimization or last step in the FASTA
algorithm). In addition, it uses a linear-space method for
calculating the actual alignments. FASTA v1.6 package includes
several new programs:
SSEARCH a program to search a sequence database using the
rigorous Smith-Waterman algorithm (this program is
about 100-fold slower than FASTA with ktup=2 (for
proteins).
LALIGN A rigorous local sequence alignment program that will
display the N-best local alignments (N=10 by default).
PLALIGN a version of lalign that plots the local alignments to
a postscript file.
FLALIGN a version of lalign that plots the local alignments to
a GCG Figure file.
The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
algorithm described by Huang and Miller (1991) Adv. Appl. Math.
12:337-357. The SSEARCH and PRSS programs incorporate algorithms
described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
LFASTA and PLFASTA now calculate a different number of local
similarities; they now behave more like LALIGN/PLALIGN. Since
local alignments of identical sequences produce "mirror-image"
alignments, lalign and lfasta consider only one-half of the
potential alignments between sequences from identical file names.
Thus
lfasta mchu.aa mchu.aa
Displays only two alignments, with earlier versions of the
program, it would have displayed five, including the identity
alignment. PLFASTA does display five alignments; when two
identical filenames are given, it draws the identity alignment,
calculates the two unique local alignments, draws them, and draws
their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
filenames, rather than the actual sequences, to determine whether
sequences are identical; you can "trick" the programs into
behaving the old way by putting the same sequence in two
different files.
1.4. Changes with version 1.5
FASTA version 1.5 includes a number of substantial revisions
to improve the performance and sensitivity of the program. It is
now possible to tell the program to optimize all of the initn
scores greater than a threshold. The threshold is set at the
same value as the old FASTA cutoff score. Alternatively, you can
tell FASTA to sort the results by the init1, rather than the
initn, score by using the -1 option. FASTA -1 ... will report
the results the way the older FASTP program did.
A new method has been provided for selecting libraries. In
the past, one could enter the name of a sequence file to be
searched or a single letter that would specify a library from the
list included in the $FASTLIBS file. Now, you can specify a set
of library files with a string of letters preceded by a '%'.
Thus, if the FASTLIBS file has the lines:
Genbank 70 primates$1P/seqlib/gbpri.seq 1
Genbank 70 rodents$1R/seqlib/gbrod.seq 1
Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
Then the string: "%PRMB" would tell FASTA to search the four
libraries listed above. The %PRMB string can be entered either
on the command line or when the program asks for a filename or
library letter.
FASTA1.5 also provides additional flexibility for specifying
the number of results and alignments to be displayed with the -Q
(quiet) option. The -b number option allows you to specify the
number of sequence scores to show when the search is finished.
Thus
FASTA -b 100 ...
tells the program to display the top 100 sequence scores. In the
past, if you displayed 100 scores (in -Q mode), you would also
have store 100 alignments. The -d option allows you to limit the
number of alignments shown. FASTA -b 100 -d 20 would show 100
scores and 20 alignments.
Finally, FASTA can provide a complete list of all of the
sequences and scores calculated to a file with the -r (results)
option. FASTA -r results.out ... creates a file with a list of
scores for every sequence in the library. The list is not
sorted, and only includes those scores calculated during the
initial scan of the library.
2. Installing the FASTA package
2.1. Installing the programs
2.1.1. Unix version
The FASTA distribution comes with several makefile's that
can be used to compile the FASTA programs. Over the years, as
ATT Unix System 5 and BSD unix have converged, these files have
become very similar. To begin with, I recommend using the
standard Makefile. There are two values in the makefile that
should be checked against the values used on your system: the HZ
value, which is the frequency in ticks per second used by the
times() system call, this value can usually be found by running:
grep HZ /usr/include/sys/*
and the functions available to return random numbers. If you
have a rand48() function that returns a 32-bit random number, use
it and use the lines:
NRAND=nrand48
RANFLG= -DRAND32
If not, you will need to use the rand() function call and
determine whether it returns a 16-bit or a 32-bit value. These
functions are used by PRDF and PRSS. If you have problems
compiling the programs, you may want to examine the makefile.unx
and makefile.sun files, to look for differences. I have tried to
use very standard unix functions in these programs, and they have
been successfully compiled, with very small changes to the
Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
machines (under the BSD environment).
2.1.2. IBM-PC/DOS version
For the IBM-PC/DOS version, the FASTA source code disk
contains the complete source code to all of the programs on the
other disks. The programs were compiled with Borland's Turbo
'C++', using Borland's MAKE utility. The graphics programs
(PLFASTA, TGREASE) use the graphics device drivers supplied with
the Turbo 'C' V2.0 package. Also included are the documentation
files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the
files the source code disk to run the programs. The files on
this disk are identical to the UNIX and VMS versions that run on
larger machines. Also included is the code to compile
ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize
for end-gaps.
If you have the DOS or Macintosh version of the FASTA
package, to install the programs you should:
(1) Make a new directory (folder) for the FASTA programs.
This need not be the same as the directory for your
sequence databases.
(2) Copy the files from the FASTA source disk to the new
directory.
(3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
PATH command to include the FASTA directory and (b) add
the line:
set FASTLIBS=c:\yourfastadirectory\fastgbs
On the Macintosh, you may need to edit the "environment"
file and change the line that reads:
FASTLIBS=fastgbs
to indicate the full directory path for the fastgbs file,
for example:
FASTLIBS=Q105:FASTA:fastgbs
(4) Finally, you will need to edit the fastgbs file. This is
usually the most confusing part of the installation. An
example of this file is shown below; to customize this
file for your machine, you will need to change the file
names from those provided in the fastgbs file to ones that
reflect the directory names and file names you use on your
machine. This is explained in more detail below. In
addition, some entries in the fastgbs file refer to other
files of file names. These files of file names (as
opposed to actual database files) may also need to be
edited.
2.2. Installing the libraries
2.2.1. The NBRF protein sequence library
The FASTA program package does not include any protein or
DNA sequence libraries. You can obtain the PIR protein sequence
database from:
National Biomedical Research Foundation
Georgetown University Medical Center
3900 Reservoir Rd, N.W.
Washington, D.C. 20007
In addition, this database is available via anonymous ftp from
the host "ftp.bchs.uh.edu". It is available in two formats, VMS
and CODATA format. The "VMS" format (library type 5 below) can
be searched much faster, can be easily reformatted for use by the
"BLAST" rapid searching program, and is compatible with the
Genetics Computer Group package of programs. The CODATA format
is used by the EUGENE/MBIR computing package from Baylor (library
type 2).
2.2.2. The GENBANK DNA sequence library
FASTA, and TFASTA search sequences from the GENBANK
"flatfile" (not ASN.1) DNA sequence library in the flat-file
format distributed by the National Center for Biotechnology
Information and the PIR format used by EBI/EMBL. CD-ROMs can be
obtained from:
Genbank
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
8600 Rockville Pike
Bethesda, MD 20894
The GenBank DNA sequence library is also available via
anonymous FTP from ncbi.nlm.nih.gov.
2.2.3. The EBI/EMBL CD-ROM libraries
The European Bioinformatics Institute (EBI) is now
distributing the EMBL CD-ROM that contains both the complete EMBL
DNA sequence database (which should be essentially identical to
the GenBank DNA sequence database) and the SWISS-PROT protein
sequence database. SWISS-PROT is derived from the NBRF Protein
sequence database with additions from the EBI/EMBL DNA sequence
database. This CD-ROM is a "best-buy," since it provides both
DNA and protein sequence libraries. It is available from:
European Bioinformatics Institute
Hinxton Genome Campus, Hinxton Hall
Hinxton, Cambridge CB10 1RQ,
United Kingdom
Tel: +44 1223 4944
Fax: +44 1223 494468
Email: DATALIB@ebi.ac.uk
In addition, the SWISS-PROT protein sequence database is
available via anonymous FTP from ncbi.nlm.nih.gov.
2.3. Finding the libraries: FASTLIBS
FASTA and TFASTA use the environment variable FASTLIBS to
find the protein and DNA sequence libraries. The FASTLIBS
variable contains the name of a file that has the actual
filenames of the libraries. The FASTGBS file on is an example of
a file that can be referred to by FASTLIBS. To use the FASTGBS
file, type:
setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
or
export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
Then edit the FASTGBS file to indicate where the protein and DNA
sequence libraries can be found. If you have a hard disk and
your protein sequence library is kept in the file
/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
in the directory: /usr/lib/genbank, then fastgbs might contain:
NBRF Protein$0P/usr/lib/seq/aabank.lib 0
SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
GB Primate$1P@/usr/lib/genbank/gpri.nam
GB Rodent$1R@/usr/lib/genbank/grod.nam
GB Mammal$1M@/usr/lib/genbank/gmammal.nam
^ 1 ^^^^ 4 ^ ^
23 (5)
The first line of this file says that there is a copy of the NBRF
protein sequence database (which is a protein database) that can
be selected by typing "P" on the command line or when the
database menu is presented in the file /usr/lib/seq/aabank.lib.
Note that there are 4 or 5 fields in the lines in fastgbs.
The first field is the description of the library which will be
displayed by FASTA; it ends with a '$'. The second field (1
character), is a 0 if the library is a protein library and 1 if
it is a DNA library. The third field (1 character) is the
character to be typed to select the library.
The fourth field is the name of the library file. In the
example above, the /usr/lib/seq/aabank.lib file contains the
entire protein sequence library. However the DNA library file
names are preceded by a '@', because these files (gpri.nam,
grod.nam, gmammal.nam) do not contain the sequences; instead they
contain the names of the files which contain the sequences. This
is done because the GENBANK DNA database is broken down in to a
large number of smaller files. In order to search the entire
primate database, you must search more than a dozen files.
In addition, an optional fifth field can be used to specify
the format of the library file. Alternatively, you can specify
the library format in a file of file names (a file preceded by an
'@'). This field must be separated from the file name by a space
character (' ') from the filename. In the example above, the
aabank.lib file is in Pearson/FASTA format, while the swiss.seq
file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
FASTA can read the following formats:
0 Pearson/FASTA (>SEQID - comment/sequence)
1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
2 NBRF CODATA (ENTRY/SEQUENCE)
3 EMBL/SWISS-PROT (ID/DE/SQ)
4 Intelligenetics (;comment/SEQID/sequence)
5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
6 GCG (version 8.0) Unix Protein and DNA (compressed)
11 NCBI Blast1.3.2 format (unix only)
In particular, this version will work with the EMBL and PIR VMS
formats that are distributed on the EMBL CD-ROM. The latter
format (PIR VMS) is much faster to search than EMBL format. This
release also works with the protein and DNA database formats
created for the BLASTP and BLASTN programs by SETDB and PRESSDB
and with the new NCBI search format. If a library format is not
specified, for example, because you are just comparing two
sequences, Pearson/FASTA (format 0) is used by default. To
change this default, you may set the LIBTYPE environment variable
to a number. For example,
setenv LIBTYPE 1
would cause the program to use the GenBank LOCUS format by
default for libraries (or the second sequence file), but the
Pearson/FASTA format would still be used for the query sequence.
You can specify a group of library files by putting a '@'
symbol before a file that contains a list of file names to be
searched. For example, if @gmam.nam is in the fastgbs file, the
file "gmam.nam" might contain the lines:
</usr/lib/genbank
gbpri.seq 1
gbrod.seq 1
gbmam.seq 1
In this case, the line beginning with a '<' indicates the
directory the files will be found in. The remaining lines name
the actual sequence files. So the first sequence file to be
searched would be:
/usr/lib/genbank/gbpri.seq
The notation "<PIRNAQ:" might be used under the VAX/VMS operating
system. Under UNIX, the trailing '/' is left off, so the library
directory might be written as "</usr/seqlib".
With version 1.4 of the FASTA package, the FASTA and TFASTA
programs can search a library composed of different files in
different sequence formats. For example, you may wish to search
the Genbank files (in GenBank flat file format) and the EMBL DNA
sequence database on CD-ROM. To do this, you simply list the
names and filetypes of the files to be searched in a file of
filenames. For example, to search the mammalian portion of
Genbank, the unannotated portion of Genbank, and the unannotated
portion of the EMBL library, you could use the file:
</usr/lib/DNA
gbpri.seq 1
# (this '#' causes the program to display the size of the library)
gbrod.seq 1
gbmam.seq 1
gbuna.seq 1
unanno.seq 5
#
You do not need to include library format numbers if you
only use the Pearson/FASTA version of the PIR protein se-
quence library. If no library type is specified, the
program assumes that type 0 is being used (unless you
have set LIBTYPE).
Support for the old compressed GenBank files, which have not been
distributed for more than four years, has been removed from
programs in the FASTA package.
Test the setup by running FASTA. Enter the sequence file
'MUSPLFM.AA' when the program requests it (this file is included
with the programs). The program should then ask you to select a
protein sequence library. Alternatively, if you run the TFASTA
program and use the MUSPLFM.AA query sequence, the program should
show you a selection of DNA sequence libraries. Once the fastgbs
file has been set up correctly, you can set FASTLIBS=fastgbs in
your AUTOEXEC.BAT file, and you will not need to remember where
the libraries are kept or how they are named.
FASTA and TFASTA must open a large number of files when
searching and reporting the results of a GENBANK floppy disk
format library search. You may have problems with the large
number of files under DOS on IBM-PC's (Unix and VMS users will
not have these problems). If you are going to search the GENBANK
floppy disk format DNA sequence library under DOS, you should add
the line:
FILES=16
to your CONFIG.SYS file. (Typically this is already done for
programs like Windows or WordPerfect.)
3. Using the FASTA Package
3.1. Overview
The FASTA sequence comparison programs all require similar
information, the name of a query sequence file, a library file,
and the ktup parameter. All of the programs can accept arguments
on the command line, or they will prompt for the file names and
ktup value.
To use FASTA, simply type:
FASTA
and you will be prompted for :
the name of the test sequence file
the name of the library file
and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
ktup of 2 is about 5 times faster than ktup = 1.
For a 200 aa sequence against a 10,000,000 aa
library, the program takes about 30 min with
ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
IBM-PC.
The program can also be run by typing
FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
Included with the package are the test files, MUSPLFM.AA,
LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that
everything is working, you can try:
fasta musplfm.aa lcbo.aa
and
tfasta musplfm.aa bovprl.seq
To test the local similarity programs LFASTA and PLFASTA, try:
lfasta mchu.aa mchu.aa
and
plfasta mchu.aa mchu.aa
MCHU (calmodulin) has four duplicated calcium binding sites that
are clearly detected by LFASTA. For a more complicated example,
try MWRTC1.aa, myosin heavy chain.
3.2. Sequence files
The FASTA programs know about three kinds of sequence files
(four under VMS): (1) plain sequence files that can only be used
as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
library files. These are the same as plain sequence files, each
sequence is preceded by a comment line with a '>' in the first
column. (3) distributed sequence libraries (this is a broad class
that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
flat-file format, EMBL flat-file format, and Intelligenetics
format. All of the files that you create should be of type (1)
or (2). Type (2) files (ones with a be used as query or library
sequence files by all of the programs.
I have included several sample test files, *.AA. The first
line may begin with a '>' or ';' followed by a comment. The
text after ';' in other lines will be ignored. Spaces and
tabs (and anything else that is not an amino-acid code) are
ignored.
Library files should have the form:
>Sequence name and identifier
A F A S Y T .... actual sequence.
F S S .... second line of sequence.
>Next sequence name and identifier
This is often referred to as "FASTA" or "Pearson" format. You
can build your own library by concatenating several sequence
files. Just be sure that each sequence is preceded by a line
beginning with a '>' with a sequence name.
The test file should not have lines longer than 120
characters, and sequences entered with word processors should use
a document mode, with normal carriage returns at the end of
lines.
Program Summary
3.3. Sequence search programs
FASTA universal sequence comparison. Defaults to comparing
protein sequences; if the sequences are > 85% A+C+G+T
or the -n option is used, a DNA sequence is assumed.
FASTX Search a protein sequence library using amino acid
sequence comparison to the forward three frames of a
translated DNA query sequence. (The reverse frames are
specified with the -i option.) Alignment scores allow
frameshifts; the final alignment uses a Smith-Waterman
type alignment routine (no limit on gaps) that allows
frameshifts.
TFASTA Search DNA library for a protein sequence by
translating the DNA sequence to protein in all six
frames (three forward frames with the -3 command line
option). TFASTA with ktup=2 is about as fast as a DNA
FASTA with ktup=4, and is substantially more sensitive.
(also reads the GENBANK library)
TFASTX Search DNA library for a protein sequence by
translating the DNA sequence to protein in all six
frames (three forward frames with the -3 command line
option) calculating similarity scores that allow
frameshifts. TFASTX produces an optimal Smith-Waterman
alignment of the query and translated-library sequence.
SSEARCH Universal sequence comparison using the Smith-Waterman
algorithm ( T. F. Smith and M. S. Waterman (1981) J.
Mol. Biol. 147:195-197). This program uses code
developed by Huang and Miller (X. Huang, R. C.
Hardison, W. Miller (1990) CABIOS 6:373-381) for
calculating the local similarity score and code from
the ALIGN program (see below) for calculating the local
alignment. SSEARCH is about 50-times slower than FASTA
with ktup=2 (for proteins).
ALIGN optimal global alignment of two sequences with no
short-cuts. This program is a slightly modified
version of one taken from E. Myers and W. Miller. The
algorithm is described in E. Myers and W. Miller,
"Optimal Alignments in Linear Space" (CABIOS (1988)
4:11-17).
3.4. Local similarity programs
LFASTA local similarity searches showing local alignments.
The algorithm used to calculate the local alignment in
a band has been improved (Chao, Pearson, and Miller,
submitted).
PLFASTA local similarity searches with plot output (on the IBM,
this program requires that the environment variable
BGIDIR be set).
PCLFASTA (unix only) local similarity searches with plot output
using pic commands.
LALIGN Calculates the N-best local alignments using a rigorous
algorithm. (N=10 by default.) The algorithm was
developed by Huang and Miller (X. Huang and W. Miller
(1991) Adv. Appl. Math. 12:337-357), which is a
linear-space version of an algorithm described by M. S.
Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
Like SSEARCH, LALIGN is rigorous, but also very slow.
PLALIGN A version of LALIGN that plots its output to a
postscript file.
3.5. Statistical Significance
With version 2.0 of the FASTA program distribution, FASTA,
TFASTA, and SSEARCH now provide estimates of statistical
significance for library searches. Work by Altschul, Arratia,
Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
Nature Genetics 6:119 for an excellent review) suggests that
local sequence similarity scores follow the extreme value
distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
u = ln(Kmn)/lambda and m,m are the lengths of the query and
library sequence. This formula can be rewritten as: 1 - exp(-Kmn
exp(-lambda x), which shows that the average score for an
unrelated library sequence increases with the logarithm of the
length of the library sequence. FASTA and SSEARCH use simple
linear regression against the the log of the library sequence
length to calculate a normalized "z-score" with mean 50,
regardless of library sequence length, and variance 10. These
z-scores can then be used with the extreme value distribution and
the poisson distribution (to account for the fact that each
library sequence comparison is an independent test) to calculate
the number of library sequences to obtain a score greater than or
equal to the score obtained in the search. The original idea and
routines to do the linear regression on library sequence length
were provided Phil Green, U. Washington. This version of FASTA
and SSEARCH uses a slightly different strategy for fitting the
data than those originally provided by Dr. Green.
The expected number of sequences is plotted in the histogram
using an "*". Since the parameters for the extreme value
distribution are not calculated directly from the distribution of
similarity scores, the pattern of "*'s" in the histogram gives a
qualitative view of how well the statistical theory fits the
similarity scores calculated by FASTA and SSEARCH. For FASTA, if
optimized scores are calculated for each sequence in the database
(the default), the agreement between the actual distribution of
"z-scores" and the expected distribution based on the length
dependence of the score and the extreme value distribution is
usually very good. Likewise, the distribution of SSEARCH Smith-
Waterman scores typically agrees closely with the actual
distribution of "z-scores." The agreement with unoptimized
scores, ktup=2, is often not very good, with too many high
scoring sequences and too few low scoring sequences compared with
the predicted relationship between sequence length and similarity
score. In those cases, the expectation values may be
overestimates.
The statistical routines assume that the library contains a
large sample of unrelated sequences. If this is not the case,
then the expectation values are meaningless. Likewise, if there
are fewer than 20 sequences in the library, the statistical
calculations are not done.
For protein searches, library sequences with E() values <
0.01 for searches of a 10,000 entry protein database are almost
always homologous. Frequently sequences with E()-values from 1 -
10 are related as well. Remember, however, that these E() values
also reflect differences between the amino acid composition of
the query sequence and that of the "average" library sequence.
Thus, when searches are done with query sequences with "biased"
amino-acid composition, unrelated sequences may have
"significant" scores because of sequence bias. The programs
below, PRDF and PRSS, can address this problem by calculating
similarity scores for random sequences with the same length and
amino acid composition.
If optimization is not used ("-o"), E-values for DNA
sequences overestimate the significance of the scores that are
obtained and unrelated sequences frequently have E()-values <
0.0005. With optimization, the agreement between E()-value
compares favorably with protein sequence comparison. This is in
part due to the use of more stringent gap penalties for DNA
sequence comparison, -16, -4 rather than -12, -2. With the
latter penalties, many unrelated sequences appear to have
significant similarity. Nevertheless, since protein sequence
comparison is much more sensitive, DNA sequence comparison should
not be used to identify sequences that encode protein. Even with
ktup=6, optimization rarely increases run-times more than 50%
with mRNA-size query sequences. Optimization should be used
whenever possible.
Similar comments apply to TFASTA, where higher gap
penalties (-16,-4) are required for accurate statistical
estimates. Because TFASTA produces so many artificial "coding"
sequences with atypical amino acid compositions, the statistical
estimates with TFASTA are often over estimates. With optimized
scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
will sometimes have E() values of 0.1. If initn scores are used,
unrelated sequences may have have E() values < 0.01.
PRDF improved version of RDF program that includes accurate
probability estimates for all three scoring methods
(includes local or window shuffle routine)
PRSS A version of PRDF that uses the rigorous Smith-Waterman
calculation used by SSEARCH.
RANDSEQ produces a randomly shuffled sequence from a query
sequence.
RELATE significance program described by Dayhoff (Atlas of
Protein Sequence and Structure, Vol. 5, Supplement 3).
Each chunk of 25 residues in one sequence is compared
to every 25 residue fragment of the second sequence.
Sequences which are genuinely related will have a large
number of scores greater than 3 standard deviations
above the mean score of all of the comparisons.
3.6. Other analysis programs
AACOMP calculate the amino acid composition and molecular
weight of a sequence.
BESTSCOR calculate the best self-comparison score.
GREASE Kyte-Doolittle hydropathicity profile
TGREASE graphic plot of Kyte-Doolittle profile
FROMGB convert from GenBank LOCUS format (also used by the
IBI-Pustell programs) to Pearson/FASTA format.
GARNIER A secondary structure prediction program using the
method of Garnier, Osgusthorpe, and Robson, J. Mol.
Biol., (1978) 120:97-120.
3.7. Options
These programs have a number of output options, which are
invoked by the environment variables LINLEN, SHOWALL, and MARKX.
Alternatively, these values can be controlled by command line
options. The number of sequence residues per output line is now
adjustable by setting the environment variable LINLEN, or the
command line option -w. LINLEN is normally 60, to change it set
LINLEN=80 before running the program or add -w 80 to the command
line. LINLEN can be set up to 200. SHOWALL (-a) determines
whether all, or just a portion, of the aligned sequences are
displayed. Previously, FASTP would show the entire length of
both sequences in an alignment while FASTN would only show the
portions of the two sequences that overlapped. Now the default is
to show only the overlap between the two sequences, to show
complete sequences, set SHOWALL=1, or use the -a option on the
command line.
The differences between the two aligned sequences can be
highlighted in three different ways by changing the environment
variable MARKX or the -m option. Normally (MARKX=0) the program
uses ':' do denote identities and '.' to denote conservative
replacements. If MARKX=1, the program will not mark identities;
instead conservative replacements are denoted by a 'x' and non-
conservative substitutions by a 'X'. If MARKX=2, the residues in
the second sequence are only shown if they are different from the
first. MARKX=3 displays the aligned library sequences without the
query sequence; these can be used to build a primitive multiple
alignment. MARKX=4 provides a graphical display of the
boundaries of the alignments. Thus the five options are:
MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4
MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
MWKSCGYPYT MWKSCGYPYT
(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
format for use with other programs. See the file"readme.v20u4"
for a more complete description.
3.8. Command line options
It is now possible to specify several options on the
command line, instead of using environment variables. The
command line options are preceded by a dash; the following
options are available:
-a same as showall=1
-A force Smith-Waterman alignments for DNA sequences and
TFASA. By default, only FASTA protein sequence
comparisons use Smith-Waterman alignments.
-b # Number of sequence scores to be shown on output. In
the absence of this option, fasta (and tfasta and
ssearch) display all library sequences obtaining
similarity scores with expectations less than 10.0 if
optimized score are used, or 2.0 if they are not. The
-b option can limit the display further, but it will
not cause additional sequences to be displayed.
-c # Threshold score for optimization (OPTCUT). Set "-c 1"
to optimize every sequence in a database. (This slows
the program down about 5-fold).
-E # Limit the number of scores and alignments shown based
on the expected number of scores. Used to override the
expectation value of 10.0 used by default. When used
with -Q, -E 2.0 will show all library sequences with
scores with an expectation value <= 2.0.
-d # Number of alignments to be reported by default. (Used
in conjunction with -Q). No longer necessary, see "-b"
above.
-f Penalty for the first residue in a gap (-12 by default
for proteins, -16 for DNA or for TFASTA).
-g Penalty for additional residues in a gap (-2 by default
for proteins, -4 for DNA and TFASTA ).
-h Penalty for frameshift (FASTX, TFASTX only).
-H Omit histogram.
-i Invert (reverse complement) the query sequence if it is
DNA. For TFASTX, search the reverse complement of the
library sequence only.
-k # Threshold for joining init1 segments to build an initn
score (GAPCUT).
-l file Location of library menu file (FASTLIBS).
-L Display more information about the library sequence in
the alignment.
-m # MARKX = # (0, 1, 2, 3, 4, 10)
-n Force the query sequence to be treated as a DNA
sequence. This is particularly useful for query
sequences that contain a large number of ambiguous
residues, e.g. transcription factor binding sites.
-O Send copy of results to "filename." Helpful for
environments without STDOUT.
-o Turn off default optimization of all scores greater
than OPTCUT. Sort results by "initn" scores.
-Q,-q Quiet - does not prompt for any input. Writes scores