forked from morgannprice/PaperBLAST
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFORMATS
154 lines (111 loc) · 6.11 KB
/
FORMATS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
Formats of the various databases and flat files used by PaperBLAST and
related tools:
PaperBLAST
Metadata about proteins and papers is in a SQLite3 database: data/litsearch.db
The database schema is in bin/litsearch.sql
The non-redundant protein sequences are in a BLASTp database in data/uniq.faa
with identifiers like geneId from Gene or db::protId from CuratedGene.
See the SeqToDuplicate table (of litsearch.db) for the mapping of
redundant identifiers to the (arbitrarily selected) one in uniq.faa
SitesBLAST
Proteins with known functional sites are in the BLASTp database
data/hassites.faa. It includes entries of the form SwissProt::protId
or PDB:accession:chain. Information about the sites is in
litsearch.db.
Curated BLAST for Genomes
Curated BLAST relies on uniq.faa and litsearch.db to get metadata and
sequences of characterized proteins.
Genomes are fetched using lib/FetchAssembly.pm (see below)
GapMind: information about pathways
Information about a pathway is in gaps/{set}, including:
The list of pathways in the set, in {set}.table
The rules files for each pathway, in *.steps (documented in
Steps.pm). Before being used as queries, these are compiled into
*.query files in tmp/path.{set}/
Dependencies between pathways, in requires.tsv
Known gaps in organisms that can perform pathways, in
{set}.known.gaps.tsv and {set}.known.gaps.markers.faa
Curated classifications of gaps for those pathways in some organisms,
in {set}.curated.gaps.tsv
GapMind: characterized proteins and queries
GapMind relies on various files in tmp/path.{set}/ to describe the
queries for a set of pathways. These are compiled into sqlite3
databases curated.db and steps.db (see lib/curated.sql and steps.sql
for the schemas). Some scripts also use curated.faa.udb (see below).
The files in tmp/path.{set} include:
{path}.query has the compiled queries for each step of each pathway.
This is tab-delimited with fields step, type of step (see
lib/Steps.pm), query, desc, file, and sequence. (file is for HMM
queries.)
*.hmm files store the relevant HMM models, if any.
A BLASTp database named curated.faa stores all the characterized
proteins. (This is a subset of what is the CuratedGene table and is
stored separately because mismatches between the design of the rules
and the database of characterized proteins can cause good candidates
to be marked as moderate-confidence.) The headers are named
id1,...,idN where the ids are all identical sequences and of the form
db::protId. For instance, Q7XJ02 is described in both BRENDA and
SwissProt so there is a header line
">BRENDA::Q7XJ02,SwissProt::Q7XJ02".
This database is also storted as a usearch (ublast) database,
curated.faa.udb.
curated.faa.info is tab-delimited with ids, length, descs, and
optionally id2s and orgs. ids is the comma-delimited list of
identifiers used in the curated.faa file. length is the length of the
sequence (in amino acids). descs, id2s, and orgs are delimited by ";; "
and correspond to the desc, id2, and organism fields of CuratedGene.
pfam.hits.tab is a tab-delimited file with all of the PFam hits for
the sequences in curated.faa. It has no header line and has the fields
curatedIds, pfam name, pfam accession, evalue, bits, seqFrom, seqTo,
seqLen, hmmFrom, hmmTo, hmmLen.
hetero.tab records which of these proteins is part of a heteromeric
complex. It is tab-delimited with fields db, protId, and comment. Even
if comment is empty, it implies that the protein is heteromeric. This
file is used by curatedClusters.cgi to highlight proteins that match a
rule and are heteromeric. It is not used by the main GapMind pages
(gapView.cgi).
orgSets.tsv records if an analysis for a genome is already available
in one of the standard sets such as orgsFit. It is tab-delimited with
fields orgId, gdb, gid, genomeName, nProteins (as in a .org file, see
FetchAssembly), and also orgSet.
uniprot.tsv is a cache of sequences and descriptions for uniprot
entries that appear in step definitions. It is used to speed up
gapquery.pl; removing it should be harmless.
GapMind: assemblies and results
GapMind uses FetchAssembly.pm to download assemblies, see
below. For uploaded files, it uses AASeqToAssembly(). For
groups of assemblies, it uses directories like tmp/orgsFit (these are
created at the command line, not from the web site).
In either case, it uses a directory in tmp/ to store the analysis
results. For individual assemblies, these directories are named
{gdb}__{gid}. For uploaded assemblies, they have hex names like
9e004e282fd791a3bc03b92a56fbb6c8. By convention, the directories for
groups of assemblies have names beginning with tmp/orgs.
The analysis results are in another sqlite3 database, {set}.sum.db
(see lib/gaps.sql for the schema). This is built from
{set}.sum.*. with suffixes rules (one line per pathway/rule), steps
(one line per pathway/step), cand (one line per for
pathway/step/candidate), and warn (one line per violated
requirement). These are all tab-delimited. These are computed by
gapsummary.pl or (for the warn file) checkGapRequirements.pl.
The files {set}.hits and {set}.revhits are intermediate files from
gapsearch.pl and gaprevsearch.pl
FetchAssembly.pm
CacheAssembly() caches NCBI assemblies in tmp/refseq_{gid}.* with
suffixes faa (for predicted proteins in fasta format), fna (for the
genome sequence in fasta format), and features.tab for the gene
annotations). The gid is an assembly id such as "GCF_003173355.1".
CacheAssembly() caches MicrobesOnline genomes in tmp/mogenome_{gid}.*
with suffixes faa and fna. The gid is an NCBI taxonomy id such as
"272844".
CacheAssembly() caches JGI genomes in tmp/{gid}, where gid is the
identifier at the JGI portal, such as "EsccolW_FD". FetchAssembly.pm
requires a private key, usually in private/.JGI.info, for access to
the JGI portal. (The SetPrivDir() function sets the private
directory.)
CaceAssembly() caches fitness browser genomes in tmp/fbrowse_{gid}.*
with suffixes faa and fna. The gid is the orgId in the Fitness Browser
such as "Keio". FetchAssembly.pm requires the fitness browser database
to be available locally, as set by SetFitnessBrowserPath(), usually
fbrowse_data is a symbolic link to the cgi_data directory of the
fitness browser.