Releases: wrpearson/fasta36
fasta_v36.3.8i_Nov2022
The FASTA package - protein and DNA sequence similarity searching and alignment programs
This directory contains the source code for the FASTA package of
programs (W. R. Pearson and D. J. Lipman (1988), "Improved Tools
for Biological Sequence Analysis", PNAS 85:2444-2448). The current verion of the program is fasta-36.3.8i
.
The FASTA package offers many of the same programs as BLAST
, but
takes a different approach to statistical estimates, and provides
additional optimal programs for local (ssearch36
) and global
(ggsearch36
, glsearch36
) alignment, and for non-overlapping
internal local alignments (lalign36
).
The programs available include:
FASTA | BLAST | description |
---|---|---|
fasta36 | blastp/blastn | Protein and DNA local similarity search |
ssearch36 | optimal Smith-Waterman search -- vectorized on Intel and Arm architectures | |
ggsearch36 | optimal global Needleman-Wunsche search -- vectorized on Intel and Arm architectures | |
glsearch36 | optimal global(query)/local (library) search -- vectorized on Intel and Arm architectures | |
fastx36 / fasty36 | blastx | DNA query search against protein sequence database. (fasty36 uses a slower, more sophisticated frame shift aligner) |
tfastx36 / tfasty36 | tblastn | protein query search against DNA database |
fastf36 / tfastf36 | compares an ordered peptide mixture against a protein (fastf36) or DNA (tfastf36) database | |
fastm36 / tfastm36 | compares a set of ordered peptide against a protein (fastf36) or DNA (tfastf36) database or oligonucleotides against a DNA database | |
fasts36 / tfasts36 | compares an unordered set of peptides against a protein (fasts36) or DNA (tfasts36) database | |
lalign36 | look for non-overlapping internal alignments, similar to a "dot-plot," but with statistical signficance | |
Changes in fasta-36.3.8i Nov, 2022
-
bug fix to remove duplicate variant annotations
-
update to scripts/get_protein.py and annotation scripts.
-
modify code to reduce mktemp compilation warning messages
-
changes to annotation scripts for Pfam shutdown; new ann_pfam_www.py, ann_pfam_sql.py
Changes in fasta-36.3.8i Sept, 2021
- Enable translation table -t 9 for Echinoderms. This bug has existed
since alternate translation tables were first made available.
Changes in fasta-36.3.8i May, 2021
- Add an option, -Xg, that preserves the gi|12345 string the score
summary and alignment output.
Changes in fasta-36.3.8i Nov, 2020
-
fasta-36.3.8i (November, 2020) incorporates the SIMDe
(SIMD-everywhere,
https://github.com/simd-everywhere/simde/blob/master/simde/x86/sse2.h)
macro definitions that allow the smith_waterman_sse2.c,
global_sse2.c, and glocal_sse2.c code to be compiled on non-Intel
architectures (currently tested on ARM/NEON). Many thanks to
Michael R. Crusoe (https://orcid.org/0000-0002-2961-9670) for the
SIMDE code converstion, and to Evan Nemerson for creating SIMDe. -
The code to read FASTA format sequence files now ignores lines with
'#' at the beginning, for compatibility with PSI Extended FASTA
Format (PEFF) files (http://www.psidev.info/peff).
Changes in fasta-36.3.8h May, 2020
-
fasta-36.3.8h (May 2020) fixes a bug that appeared when
multiple query sequences were searched against a large library
that would not fit in memory. In that case, the number of
library sequences and residues increased by the library size
with each new search. -
More consistent formats for ERROR and Warning messages.
-
Corrections to code to address compiler warnings with gcc8/9.
-
addition of 's' option to show similarity in -m8CBls (or -m8CBs, -m8CBsl) and 'd' option to show raw (unaligned) domain information.
Changes in fasta-36.3.8h February, 2020
- The license for Michael Farrar's Smith-Waterman sse2 code and global/glocal sse2 code is now open source (BSD), see COPYRIGHT.sse2 for details.
Changes in fasta-36.3.8h August, 2019
- Modifications to support makeblastdb format v5 databases. Currently, only simple database reads have been tested.
Changes in fasta-36.3.8h March, 2019
-
Translation table 1 (
-t 1
) now translates 'TGA'->'U' (selenocysteine). -
New script for extracting DNA sequences from genomes (
scripts/get_genome_seq.py
). Currently works with human (hg38), mouse (mm10), and rat (rn6).
Changes in fasta-36.3.8h January, 2019
-
Bug fixes:
fastx
/tfastx
searches done with the-t t
option (which adds a*
to protein sequences so that termination codons can be matched), did not work properly with theVT
series of matrices, particularlyVT10
. This has been fixed. -
New features: Both query and library/subject sequences can be generated by specifying a program script, either by putting a
!
at the start of the query/subject file name, or by specifying library type9
. Thus,fasta36 \\!../scripts/get_protein.py+P09488+P30711 /seqlib/swissprot.fa
orfasta36 "../scripts/get_protein.py+P09488+P30711 9" /seqlib/swissprot.fa
will compare two query sequences,P09488
andP30711
, to SwissProt, by downloading them from Uniprot using theget_protein.py
script (which can download sequences using either Uniprot or RefSeq protein accessions). Often, the leading!
must be escaped from shell interpretation with\\!
.
New scripts that return FASTA sequences using accessions or genome coordinates are available in scripts/
. get_protein.py
, get_uniprot.py
, get_up_prot_iso_sql.py
and get_refseq.py
. get_refseq.py
can download either protein or mRNA RefSeq entries. get_up_prot_iso_sql.py
retrieves a protein and its isoforms from a MySQL database.
get_genome_seq.py
extracts genome sequences using coordinates from local reference genomes (hg38
and mm10
included by default).
Changes in fasta-36.3.8h December, 2018
The scripts/ann_exons_up_www.pl
and ann_exons_up_sql.pl
now include the option --gen_coord
which provides the associated genome coordinate (including chromosome) as a feature, indicated by '<'
(start of exon) and '>'
(end of exon).
Changes in fasta-36.3.8h released November, 2018
fasta-36.3.8h provides new scripts and modifications to the fasta
programs that normalize the process of merging sub-alignment scores and region information into both FASTA and BLAST results. To move BLASTP towards FASTA with respect to alignment annotation and sub-alignment scoring:
-
The
blastp_annot_cmd.sh
runs a blast search, finds and scores domain information for the alignments, and merges this information back into the blast output.html
file. This script uses:annot_blast_btab2.pl --query query.file --ann_script annot_script.pl --q_ann_script annot_script.pl blast.btab_file > blast.btab_file_ann
(a blast tabular file with one or two new fields, an annotation field and (optionally with --dom_info) a raw domain content field.merge_blast_btab.pl --btab blast.btab_file_ann blast.html > blast_ann.html
(merge the annotations and domain content information in theblast.btab_file_ann
file together with the standard blast output file to produce annotated alignments.- In addition,
rename_exons.py
is available to rename exons (later other domains) in the subject sequences to match the exon labeling in the aligned query sequence. relabel_domains.py
can be used to adjust color sets for homologous domains.
-
There is also an equivalent
fasta_annot_cmd.sh
script that provides similar funtionality for the FASTA programs. This script does not need to useannot_blast_btab2.pl
to produce domain subalignment scores (that functionality is provided in FASTA), but it also can usemerge_fasta_btab.pl
andrename_exons.py
to modify the names of the aligned exons/domains in the subject sequences. -
To support the independence of the
blastp
/fasta
output from html annotation, the FASTA package includes some new options:-
The
-m 8CBL
option includes query sequence length and subject sequence length in the blast tabular output. In addition, if domain annotations are available, the raw domain coordinates are provided in an additional field after the annotation/subalignment scoring field.-m 8CBl
provides the sequence lengths, but does not add the raw domain coordinates. -
The
-Xa
option prevents annotation information from being included in the html output -- it is only available in the-m 8CB
(or-m 8CBL/l
) output -
To reduce problems with spaces in script arguements, annotation scripts with spaces separating arguments can use '+' instead of ' '.
-
The
fasta_annot_cmd.sh
script produces both a conventional alignment onstdout
and a-m 8CBL
alignment, which is sent to a separate file, which is separated from the-m F8CBL
option with a=
, thus-m F8CBL=tmp_output.blast_tab
.
-
Changes in fasta-36.3.8g released 23-Oct-2018
-
(Oct. 2018) Improvements to scripts in the
psisearch2/
directory:psisearch2/m89_btop_msa2.pl
- the
--clustal
option produces a "CLUSTALW (1.8)", which is required for some downstream programs - the
--trunc_acc
option removes the database and accession from identifiers of the form:sp|P09488|GSTM1_HUMAN
to produceGSTM1_HUMAN
.
3...
- the
fasta_v36.3.8h_May2020
The FASTA package - protein and DNA sequence similarity searching and alignment programs
The FASTA (pronounced FAST-Aye, not FAST-Ah) programs are a comprehensive set of similarity searching and alignment programs for searching protein and DNA sequence databases. Like the BLAST programs blastp
and blastn
, the fasta
program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in addition to heuristic similarity searching, the FASTA package provides
programs for rigorous local (ssearch
) and global (ggsearch
) similarity searching, as well as a program for finding non-overlapping sequence similarities (lalign
). Like BLAST, the FASTA package also includes programs for aligning translated DNA sequences against proteins (fastx
, fasty
are equivalent to blastx
, and tfastx
, tfasty
are similar to tblastn
).
See doc/README_v36.3.8h.md and doc/readme.v36 for a more complete summary of changes.
May, 2020
-
fix a bug that appeared when
multiple query sequences were searched against a large library
that would not fit in memory. In that case, the number of
library sequences and residues increased by the library size
with each new search. -
More consistent formats for *** ERROR and *** Warning messages.
-
Corrections to code to address compiler warnings with gcc8/9.
Feb, 2020
The major update in this release is the change of the license terms
for the SSE2 accelerated versions of the Smith-Waterman and
global/glocal alignment algorithms. All of the FASTA package is now
distributed under open source licesnses, either Apache (for the
majority of the code) or BSD (for the SSE2 accelerated code).
August, 2019
Bug fix to recover properly when memory mapped databases are too large.
Modifications to support makeblastdb format v5 databases. Currently,
only simple database reads have been tested.
March, 2019
An updated release of the FASTA package (fasta-36.3.8h
) is
available. In addition to minor bug fixes, the latest version can
generate query and library sequences using program scripts.
December, 2018
The latest version of the FASTA package is fasta-36.3.8h
, Dec. 2018.
See doc/README_v36.3.8h.md for a more complete summary of changes.
November, 2018
The current released version of the FASTA package is fasta-36.3.8h
, Nov. 2018
See doc/README_v36.3.8h.md for a more complete summary of changes.
October, 2018
The current version of the FASTA package is fasta-36.3.8g, Oct. 2018
See doc/README_v36.3.8h.md for a more complete summary of changes.
April, 2018
The current version of the FASTA package is fasta-36.3.8g, Apr. 2018
December, 2017
The current FASTA version is fasta-36.3.8g, Dec. 2017
The statistics routines for normally distributed scores (ggsearch36,
glsearch36) are more robust to very low E()-value thresholds.
Sept, 2017
The current FASTA version is fasta-36.3.8f, Sept. 2017
If the -S option is used and a query sequence has no upper case
letters, it is re-read with lower-case letters converted to upper-case.
May, 2017
The current FASTA version is fasta-36.3.8f, May. 2017
Various bugs in sub-alignment scoring corrected and support for the
EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL
and
$SRCH_URL2
format strings has changed to enable pairwise alignment.
September, 2016
The fasta-36.3.6e version includes a new directory, psisearch2
, with
scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using
an improved strategy for reducing PSSM contamination due to alignment
over-extension.
As of November, 2014, the FASTA program code is available under the
Apache 2.0 open source license.
Up-to-date release notes are available in the file doc/readme.v36
.
Documentation on the FASTA programs is available in the files:
dir/file | description |
---|---|
doc/fasta36.1 |
(unix man page) |
doc/changes_v36.html |
(short descriptions of enhancements to FASTA programs) |
doc/readme.v36 |
(text descriptions of bug fixes and version history) |
doc/fasta_guide.tex |
(Latex file which describes fasta36, and provides an introduction to the FASTA programs, their use and installation.) |
doc/fasta_guide.pdf1 |
(printable/viewable description of fasta-36) |
fasta_guide.pdf
provides background information on installing the
fasta programs (in particular, the FASTLIBS
file), that new users of
the fasta3 package may find useful.
Parts of the FASTA package are distributed across several sub-directories
dir | description |
---|---|
bin/ |
(pre-compiled binaries for some architectures) |
conf/ |
example FASTLIBS files (files for finding libraries) |
data/ |
scoring matrices |
doc/ |
documentation files |
make/ |
make files |
misc/ |
perl scripts to reformat -m 9 output, convert -R search.res files for 'R', and embed domains in shuffled sequences |
psisearch2/ |
perl/python scripts implementing the new psisearch2_msa iterative PSSM search |
scripts/ |
perl scripts for -V (annotate alignments) and -E (expand library) options |
seq/ |
test sequences |
src/ |
source code |
sql/ |
sql files and scripts for using the sql database access |
test/ |
test scripts |
For some binary distributions, only the doc/
, data/
, seq/
, and bin/
,
directories are provided.
To make the standard FASTA programs:
cd src
make -f ../make/Makefile.linux_sse2 all
where ../make/Makefile.linux_sse2
is the appropriate Makefile for your system.
The executable programs will then be found in ../bin
(e.g. ../bin/fasta36
, etc.)
For a simple test of a program, try (from the src directory)
../bin/fasta36 -q ../seq/mgstm1.aa ../seq/prot_test.lseg
fasta-v36.3.8g
The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in fasta-36.3.8g released October, 2018
-
psisearch2/m89_btop_msa2.pl
-
the
--clustal
option produces a "CLUSTALW (1.8)", which is required for some downstream programs -
the
--trunc_acc
option removes the database and accession from identifiers of the form:
sp|P09488|GSTM1_HUMAN
to produceGSTM1_HUMAN
. -
the
--min_align
option specifies the fraction of the query sequence that must be aligned
(q_end-q_start+1)/q_length)
Together, these changes make it possible for the output of
m89_btop_msa2.pl
to be used by
the EMBOSS programfprotdist
. -
-
A more general implementation of
psisearch2_msa_iter.sh
, which doespsisearch2
one iteration at a time, and a new equivalentpsisearch2_msa_iter_bl.sh
, which usespsiblast
to do the search. -
A small restructuring of the
make/Makefiles
to remove the-lz
dependence for non-debugging scripts (and add it back when -DDEBUG is used).
Changes in fasta-36.3.8g released 5-Aug-2018
-
(Apr 2018) incorporation of "-t t" temrination codes ("*") in -m 8CB, -m 8CC, and -m9C so that aligned termination codons are indicated as "**" (-m8CB) or
"*1" (-m8CC, -m9C). -
(Mar 2018) Updates to scripts/annot_blast_btop2.pl to provide
subalignment scoring for blastp searches (BLOSUM62 only). (see
doc/readme.v36) -
(Feb. 2018) a new extended option, -XB, which causes percent
identity, percent similarity, and alignment length to be calculated
using the BLAST model, which does not count gaps in the alignment
length.
see readme.v36 for other bug fixes.
Changes in fasta-36.3.8g released 31-Dec-2017
-
(December, 2017) -- Make statistical thresholds more robust for
small E()-values with normally distributed scores (ggsearch36,
glsearch36). -
(September, 2017) Treat all lower-case queries as uppercase with -S option.
-
(May, 2017) Improvements/fixes to sub-alignment scoring strategies.
-
Improvements/fixes to psisearch2 scripts.
For more detailed information, see doc/readme.v36
.
fasta-36.3.8d
The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in fasta-36.3.8d released 13-April-2016:
- Various bug fixes to
pssm_asn_subs.c
that avoid coredumps when
reading NCBI PSSM ASN.1 binary files.pssm_asn_subs.c
can now read
UUPACAA sequences. - default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from
-14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1) have changed
slightly. - Introduction of
scripts/m9B_btop_msa.pl
and
scripts/m8_btop_msa.pl
, which uses the BTOP (-m 9B
or-m 8CB
)
encoded alignment strings to produce a query driving multiple
sequence alignment (MSA) in ClustalW format. This MSA can be used
as input topsiblast
to produce an ASN.1 PSSM. - The
scripts/annot_blast_btop2.pl
script replaces
scripts/annot_blast_btop.pl
and allows annotation of both the query
and subject sequences. - Various domain annotation scripts have been renamed for clarity.
For example,ann_feats_up_sql.pl
uses an SQL implementation of
Uniprot features tables to annotate domains. Likewise,
ann_pfam_www.pl
gets domain information from the Pfam web site,
whileann_pfam27.pl
gets the information from the downloaded
Pfam27 mySQL tables, andann_pfam28.pl
uses the Pfam28 mySQL
tables. - percent identity in sub-alignment scores is calculated like a BLAST
percent identity -- gaps are not included in the denominator.
For more detailed information, see doc/readme.v36
.