Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabolic reconstruction #1413

Merged
merged 408 commits into from
Apr 24, 2020
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
408 commits
Select commit Hold shift + click to select a range
ba9e610
handle more complex orthology format
ivagljiva Feb 28, 2020
09e0e11
correction for incorrect orthology
ivagljiva Feb 28, 2020
9311f36
correction for incorrect pathway
ivagljiva Feb 28, 2020
76d14ed
correction for incorrect reaction
ivagljiva Feb 28, 2020
0843eba
update warning for corrected lines and raise error if we find uncorre…
ivagljiva Feb 28, 2020
8d92dac
fix module table structure
ivagljiva Feb 28, 2020
07bcdc2
change db store condition so we do not call store for just 1 entry
ivagljiva Feb 28, 2020
dfc5a3f
accept M nums for orthology. KEGG module db creation is finally working
ivagljiva Feb 28, 2020
384248a
end module DB creation output
ivagljiva Mar 4, 2020
8e19e68
now we actually put the corrected values in the db
ivagljiva Mar 4, 2020
7e0146b
fix for orthology that has commas within ()
ivagljiva Mar 4, 2020
9b0d47e
ignore lines that are completely blank
ivagljiva Mar 4, 2020
af7ed65
is_kegg_modules_db function
ivagljiva Mar 5, 2020
90bbf14
initialize an existing modules db
ivagljiva Mar 5, 2020
4f0ce5d
make modules dict optional param for modules db; existing db should n…
ivagljiva Mar 5, 2020
1d10502
update to get_some_rows_from_table_as_dict so that it works for DBs w…
ivagljiva Mar 5, 2020
5eb5ad0
we should use kegg_modules instead of just kegg or just modules where…
ivagljiva Mar 5, 2020
0868eb5
modules table accessor method for mnum + data_name
ivagljiva Mar 5, 2020
4bf9e2f
parsing and accessor functions for class of a module
ivagljiva Mar 5, 2020
bd286fa
keep track of current module being parsed
ivagljiva Mar 5, 2020
7aad6e7
keep track of parsing errors in dict for concise output
ivagljiva Mar 5, 2020
5a2b444
add --quiet param
ivagljiva Mar 5, 2020
57d95be
count multiple definition field "errors"
ivagljiva Mar 5, 2020
b93b302
add quiet option to parsing error counts
ivagljiva Mar 5, 2020
65d3979
I broke this somehow, but now I fixed it again
ivagljiva Mar 5, 2020
c561d91
argument convention
ivagljiva Mar 5, 2020
d628248
add just-do-it as acceptable param
ivagljiva Mar 5, 2020
06e4e88
get --reset the right way
ivagljiva Mar 5, 2020
e418eda
update requires and provides for kegg programs
ivagljiva Mar 5, 2020
451e1f2
argument groups for nicer -h
ivagljiva Mar 5, 2020
c469122
allow skipping uncorrectable errors with --just-do-it
ivagljiva Mar 6, 2020
074539c
fix --just-do-it to save the erroneous lines in the DB; this is now t…
ivagljiva Mar 6, 2020
4c417eb
removal of comment build-up
ivagljiva Mar 6, 2020
a9e8a6a
update docstring
ivagljiva Mar 6, 2020
2f605ea
access function for modules by knum
ivagljiva Mar 6, 2020
8c9a25a
access function for class by knum
ivagljiva Mar 6, 2020
6f0ae93
access functions for module names
ivagljiva Mar 6, 2020
0dac17b
module class dict is now keyed by module number
ivagljiva Mar 6, 2020
04a760d
module name dict now keyed by module number
ivagljiva Mar 6, 2020
2f49290
fix bug, remove undefined class reference
ivagljiva Mar 6, 2020
9afb8dc
rename accessor function to show it returns a dict
ivagljiva Mar 6, 2020
96a5120
accessor function to get classes as a list
ivagljiva Mar 6, 2020
9114847
store module name/class ontology as functional sources
ivagljiva Mar 7, 2020
a5568e0
update error message for when user tries to access a nonexistant modu…
ivagljiva Mar 8, 2020
954e4c0
I looked into the modules with multiple DEFINITION lines and decided …
ivagljiva Mar 9, 2020
5f97dbc
update requires and provides
ivagljiva Mar 10, 2020
c2f21b5
skeleton script for estimating metabolism
ivagljiva Mar 10, 2020
1e7dc40
change kofam-data-dir arg to the less confusing kegg-data-dir
ivagljiva Mar 10, 2020
9b29f64
kofam-data-dir -> kegg-data-dir
ivagljiva Mar 10, 2020
b7033b9
KOfam -> KEGG wherever necessary
ivagljiva Mar 10, 2020
64b5c29
metabolism estimator class with a basic init
ivagljiva Mar 10, 2020
0bb6cb7
make executable
ivagljiva Mar 10, 2020
319c9d2
bug fix
ivagljiva Mar 10, 2020
f3fb3d4
additional imports that were needed
ivagljiva Mar 10, 2020
1a6e5c7
sanity check on contigs db
ivagljiva Mar 10, 2020
660ed3d
initialization and the beginnings of an estimate (driver) function
ivagljiva Mar 10, 2020
bb99b46
now our script calls the driver function
ivagljiva Mar 10, 2020
a46662a
clean up gene calls that do not have kofam hits
ivagljiva Mar 11, 2020
2ffb4b6
print some initialization info
ivagljiva Mar 11, 2020
379689e
add quiet parameter
ivagljiva Mar 11, 2020
bffe30d
merge with master and resolve conflicts - hmmer no longer compresses …
ivagljiva Mar 11, 2020
560c297
add params for working with profiles
ivagljiva Mar 11, 2020
40aab01
remove quiet param, this should be global
ivagljiva Mar 11, 2020
83707e9
Merge branch 'master' into metabolic_reconstruction
ivagljiva Mar 11, 2020
13d11e4
remove quiet param, it is global
ivagljiva Mar 11, 2020
2f02eab
remove unprofiled splits
ivagljiva Mar 11, 2020
c947e07
import profile db
ivagljiva Mar 12, 2020
db03afc
Merge branch 'master' into metabolic_reconstruction
ivagljiva Mar 12, 2020
ff1341d
update run info
ivagljiva Mar 12, 2020
0023d70
profile db + collection param check
ivagljiva Mar 12, 2020
dc70e92
get metagenome mode param
ivagljiva Mar 12, 2020
ce6f854
skeleton control for estimating situation
ivagljiva Mar 12, 2020
89a9046
skeleton funcs for single genome estimation. we also now return hits …
ivagljiva Mar 13, 2020
f2e82e8
structure for single genome estimation in place
ivagljiva Mar 13, 2020
f381217
accessor functions for all modules
ivagljiva Mar 13, 2020
9867ca7
MAJOR BUG FIX add leftover db entries to modules DB after processing …
ivagljiva Mar 13, 2020
0f2b391
accessor for all modules in db
ivagljiva Mar 13, 2020
05659e4
get the modules list
ivagljiva Mar 13, 2020
710924c
Merge branch 'metabolic_reconstruction' of https://github.com/meren/a…
ivagljiva Mar 13, 2020
0693d2b
mark KOs present for list of splits
ivagljiva Mar 13, 2020
84901d3
add module completeness threshold param
ivagljiva Mar 15, 2020
246fe71
typo
ivagljiva Mar 15, 2020
9fdaf2f
one MASSIVE commit containing the code for estimating module complete…
ivagljiva Mar 16, 2020
8e38de6
add output file param
ivagljiva Mar 16, 2020
6c27cfe
change fraction to percent completeness
ivagljiva Mar 16, 2020
4b2848b
keep track of complete nonessential steps in module
ivagljiva Mar 16, 2020
651ca1a
fix bug in temporary return statment and prettify other returns
ivagljiva Mar 16, 2020
e0b5826
store dict as tab delimited output
ivagljiva Mar 16, 2020
a0a56d7
condense module warnings
ivagljiva Mar 16, 2020
a3ee25b
update function deflines
ivagljiva Mar 17, 2020
d563afc
controls for modules defined by other modules
ivagljiva Mar 17, 2020
5d7337c
use decimal for percent completion! making it into a percent got conf…
ivagljiva Mar 17, 2020
9c681fd
fix return to keep track of modules that need to be re-evaluated later
ivagljiva Mar 17, 2020
aa4e0e6
this is code to re-evaluate completeness of modules defined by other …
ivagljiva Mar 17, 2020
c83d996
skeleton function for re-evaluating module completeness
ivagljiva Mar 17, 2020
26b5781
move adjustment code into its own func
ivagljiva Mar 17, 2020
e790533
print complete modules
ivagljiva Mar 17, 2020
7bec6bc
rename func and update docstring
ivagljiva Mar 17, 2020
5ef7059
typo
ivagljiva Mar 17, 2020
f000a96
refactor function to modify dict in place. it is much cleaner this wa…
ivagljiva Mar 17, 2020
8e5f996
adjustment func now returns completeness status so we can update the …
ivagljiva Mar 17, 2020
7d6bf65
fix complete modules output
ivagljiva Mar 17, 2020
ca50a26
cosmetic updates
ivagljiva Mar 17, 2020
b85b329
make atomic function to estimate for list of splits, and convert geno…
ivagljiva Mar 17, 2020
e443137
import ccolections
ivagljiva Mar 17, 2020
0ca10c7
rename to superdict just to be clear on what is returned
ivagljiva Mar 17, 2020
bc05a15
estimator for bins, not tested yet
ivagljiva Mar 17, 2020
9947c4e
remove rogue self
ivagljiva Mar 18, 2020
9db61ec
conditional output
ivagljiva Mar 18, 2020
25c9e9c
clarify that number of genes refers to HMM model
ivagljiva Mar 18, 2020
59668ab
fix separator between module class and name annotations. it will not …
ivagljiva Mar 18, 2020
90d1ac2
bin output
ivagljiva Mar 18, 2020
8a2a125
output bin name
ivagljiva Mar 19, 2020
bb747d6
fetch genes in contigs
ivagljiva Mar 19, 2020
6ca72b8
modified init function to gather contig and gene call info as well as…
ivagljiva Mar 19, 2020
a5942a1
update function docstrings
ivagljiva Mar 19, 2020
fcc0617
update estimate function and bin-specific estimate function to use ne…
ivagljiva Mar 19, 2020
6a3c569
update genome estimate func to use new kofam hit tuple structure
ivagljiva Mar 19, 2020
0c139c2
now all functions used in estimation can handle the kofam hit tuples
ivagljiva Mar 19, 2020
40bde32
update comment
ivagljiva Mar 19, 2020
1992ffc
add genes, contigs, new style of kofam hits to module dictionaries
ivagljiva Mar 20, 2020
299b4a6
add todo for later
ivagljiva Mar 20, 2020
89b9b7b
deprecate old completeness function and start replacement. replacemen…
ivagljiva Mar 22, 2020
7fd4477
copy over code for DEFINITION parsing so it can be adapted for path u…
ivagljiva Mar 22, 2020
a2d0526
add vars for kegg pathway download
ivagljiva Mar 22, 2020
85d5ccf
funcs for processing and downloading pathways
ivagljiva Mar 22, 2020
251544d
ignore certain pathway files
ivagljiva Mar 22, 2020
a826278
pathway data dir and some sanity checks for pathway data
ivagljiva Mar 22, 2020
ede613a
update exclusion of certain identifiers when downloading
ivagljiva Mar 22, 2020
882e94c
exclude drug structure maps as well
ivagljiva Mar 22, 2020
4dfc704
So here is a partial path unrolling function that I have been working…
ivagljiva Mar 25, 2020
ce69eda
Revert "deprecate old completeness function and start replacement. re…
ivagljiva Mar 25, 2020
cfb9659
Revert "So here is a partial path unrolling function that I have been…
ivagljiva Mar 25, 2020
5ea36ad
Revert "Revert "deprecate old completeness function and start replace…
ivagljiva Mar 25, 2020
5d5e369
Revert "copy over code for DEFINITION parsing so it can be adapted fo…
ivagljiva Mar 25, 2020
9501464
starting over
ivagljiva Mar 25, 2020
b4ef56d
more ideas after brainstorming
ivagljiva Mar 25, 2020
1d7dfca
finally an unrolling strategy that works!
ivagljiva Mar 26, 2020
7e3063c
handle protein complexes in path unrolling
ivagljiva Mar 26, 2020
276927d
fix a little bug with extending the list
ivagljiva Mar 26, 2020
a418fad
path unrolling now works on complexes, still some bugs tho
ivagljiva Mar 26, 2020
c11eb89
now unrolling works for real with complexes, but not with multiple de…
ivagljiva Mar 26, 2020
d24f6fb
actually, the prior bug was caused by a space in the definition line.…
ivagljiva Mar 26, 2020
29276f1
clean up comments and update function docstrings
ivagljiva Mar 27, 2020
f1cd2c0
update function documentation
ivagljiva Mar 31, 2020
f59157c
change separator for kegg module annotation
ivagljiva Mar 31, 2020
9488a84
get rid of old present_kos list in module completeness dict
ivagljiva Mar 31, 2020
fd37815
bug fix to make interactive work with kegg module annotations. since …
ivagljiva Mar 31, 2020
5746262
a new function for estimating module completeness
ivagljiva Mar 31, 2020
eded043
updated function documentation
ivagljiva Mar 31, 2020
70ff494
fix complete paths output for no KO hits case
ivagljiva Apr 1, 2020
2bc13f0
increment number of complete modules when appropriate
ivagljiva Apr 1, 2020
77d39dd
updated module completeness adjustment function for pathway unrolling…
ivagljiva Apr 1, 2020
bbbd094
rename i var to unique_id just to be clear on where that id is coming…
ivagljiva Apr 1, 2020
1b61ad1
little syntax buggy bug
ivagljiva Apr 1, 2020
e41eb10
fix indentation bug
ivagljiva Apr 1, 2020
6997b28
fix string formatting bug
ivagljiva Apr 1, 2020
d0e0701
fix parsing bug in pathway unrolling
ivagljiva Apr 1, 2020
b447184
fix variable name
ivagljiva Apr 1, 2020
f82e2f2
multiple complete paths per module is now debug output
ivagljiva Apr 1, 2020
04efd7b
fix index bug in module completeness adjustment func
ivagljiva Apr 1, 2020
394bf62
delete deprecated completeness function hehehehehe
ivagljiva Apr 1, 2020
87b67d7
update docstring for printing function
ivagljiva Apr 1, 2020
096b085
change header for bin_name depending on context
ivagljiva Apr 2, 2020
3d3cd2a
reset progress bar after any warnings
ivagljiva Apr 2, 2020
f6f1186
update estimate_for_list_of_splits documentation
ivagljiva Apr 2, 2020
1859659
add metagenome mode. that was suspiciously easy
ivagljiva Apr 2, 2020
83eb61d
typo
ivagljiva Apr 2, 2020
320575a
fix kegg dir structure and outputs during kegg setup
ivagljiva Apr 2, 2020
eb46630
output is fixed :)
ivagljiva Apr 2, 2020
5ee9fa1
fix output headers
ivagljiva Apr 2, 2020
57af100
switch headers for bin and genome
ivagljiva Apr 2, 2020
9c5c2e1
merge pathway unrolling into main metabolic reconstruction branch
ivagljiva Apr 2, 2020
75cdf83
now we generate a summary file for complete modules in addition to th…
ivagljiva Apr 6, 2020
e7cf5ac
redundancy estimatiom skeleton, plus func for naive redundancy
ivagljiva Apr 8, 2020
788d3e7
call redundancy estimation for each module
ivagljiva Apr 8, 2020
f71a088
import stats
ivagljiva Apr 9, 2020
33e7f3e
function for copywise redundancy computation
ivagljiva Apr 9, 2020
f685e18
store redundancy outputs in module completeness dict
ivagljiva Apr 9, 2020
840aeab
add weighted sum aggregation measure
ivagljiva Apr 9, 2020
48ecf24
we only add completeness distribution once because it is the same reg…
ivagljiva Apr 9, 2020
92eaebf
add option to use hmmsearch
ivagljiva Apr 14, 2020
09c6329
merge with master after adding hmmsearch to see if that fixes errors …
ivagljiva Apr 14, 2020
40dce58
add hmmpress step to fix anvi-run-hmms, which broke after i removed h…
ivagljiva Apr 14, 2020
fb83f0d
parser handles hmmsearch output
ivagljiva Apr 15, 2020
65e480d
add option to choose which hmmer program to run
ivagljiva Apr 15, 2020
5e1bc9a
add parameter for hmmer program to kofams script
ivagljiva Apr 15, 2020
0029606
write metabolism dict to json format
ivagljiva Apr 16, 2020
b046683
more params for JSON data
ivagljiva Apr 16, 2020
4513de7
fix estimate from json flag to take a file path
ivagljiva Apr 16, 2020
f123b75
sanity check for json params
ivagljiva Apr 16, 2020
0227f85
change filename to file prefix for json output
ivagljiva Apr 16, 2020
33d9ad8
sanity check for profiles.db
ivagljiva Apr 16, 2020
9d0e7eb
json storage function now takes a file path
ivagljiva Apr 16, 2020
d336b8e
store json before running estimation
ivagljiva Apr 16, 2020
15ecb7b
do not run estimation when we do not want that info in the JSON output
ivagljiva Apr 16, 2020
78527e1
allow either contigs db or json file as input
ivagljiva Apr 16, 2020
22e82dd
add function for estimating from json file
ivagljiva Apr 16, 2020
00fa53a
sanity checks for keys in JSON file
ivagljiva Apr 17, 2020
5978ada
convert lists to sets
ivagljiva Apr 17, 2020
cac2483
change structure to separate kofam hit marking from module estimation
ivagljiva Apr 17, 2020
bf4d9d4
add function to estimate from json
ivagljiva Apr 17, 2020
6190810
Merge branch 'add_hmmsearch' into metabolic_reconstruction
ivagljiva Apr 20, 2020
93fca00
Merge branch 'master' into metabolic_reconstruction
ivagljiva Apr 20, 2020
fc80c8a
warning message if you try to use hmmsearch on a nucleotide alphabet …
ivagljiva Apr 20, 2020
482c5e5
add hmmsearch as default program to pfams
ivagljiva Apr 20, 2020
7d74e68
description update
ivagljiva Apr 20, 2020
ec7c34c
update provides
ivagljiva Apr 20, 2020
8022862
output correct program used for nucleotide alphabets
ivagljiva Apr 20, 2020
0bb6a90
allow choice of hmm program (hmmscan is still default)
ivagljiva Apr 20, 2020
82628b5
add creation date meta value and warning for old db
ivagljiva Apr 21, 2020
8a456e6
add accessor function for all knums in modules table
ivagljiva Apr 21, 2020
5e6f975
ignore modules in orthology lines when building modules table
ivagljiva Apr 22, 2020
36987e8
add hash of content to modules self table
ivagljiva Apr 22, 2020
00ea418
bug fix in hashing function
ivagljiva Apr 22, 2020
69b6237
bug fix in hashing function
ivagljiva Apr 22, 2020
93952d4
add modules db hash to contigs db self table before annotating with kegg
ivagljiva Apr 22, 2020
d159f2d
move setting hash to after annotation
ivagljiva Apr 22, 2020
fd8a330
sanity check for different modules db hashes before estimation
ivagljiva Apr 22, 2020
de630ac
change hashing function to sort lists and convert to string to ensure…
ivagljiva Apr 22, 2020
e827eb2
fix another bug in hashing function
ivagljiva Apr 22, 2020
53c473c
add geometric mean aggregation measure to copywise redundancy function
ivagljiva Apr 22, 2020
fa97bbf
:
ivagljiva Apr 22, 2020
209d3d1
use hashlib instead of hash to get consistent hashes between runs
ivagljiva Apr 22, 2020
e7372b9
only use first 12 digits of hash
ivagljiva Apr 22, 2020
197d601
calculate geometric mean using math because our stats package version…
ivagljiva Apr 22, 2020
87c794d
okay. fine. we will use scipy for geometric mean.
ivagljiva Apr 22, 2020
f712e54
remove in_place hmmpress option that I did not implement (yet)
ivagljiva Apr 22, 2020
2bd3efd
remove extra pfams line
ivagljiva Apr 23, 2020
eea5713
cosmetic updates to hmmer.py
ivagljiva Apr 23, 2020
68d56f8
update hmmer-program param help
ivagljiva Apr 23, 2020
1b292c1
capitalize KEGG
ivagljiva Apr 23, 2020
493606a
fix docstring format
ivagljiva Apr 23, 2020
a2533fe
explicit log file path printed per thread
ivagljiva Apr 23, 2020
c5ae538
reorder imports
ivagljiva Apr 23, 2020
49422a8
format docstring
ivagljiva Apr 24, 2020
924ba51
formatting, structure, and output clarification changes
ivagljiva Apr 24, 2020
10e8b56
sanity check to avoid overwriting user-specified directories with --r…
ivagljiva Apr 24, 2020
d348a0d
bug fix
ivagljiva Apr 24, 2020
f5d4cab
missing space
ivagljiva Apr 24, 2020
5b1c2e8
remove erroneous explanation
ivagljiva Apr 24, 2020
a408a5c
refactor variable names and error messages to be agnostic to kegg
ivagljiva Apr 24, 2020
a399268
cosmetic updates :)
ivagljiva Apr 24, 2020
99917d3
add safety check for writable pfams dir and avoid overwriting user-sp…
ivagljiva Apr 24, 2020
2d22ab3
move check for tarfile from utils to filesnpaths
ivagljiva Apr 24, 2020
d2475a9
add option to return none if a key is not in the self table
ivagljiva Apr 24, 2020
aa1855f
sanity check for already-annotated contigs db
ivagljiva Apr 24, 2020
a80b3e3
update provides/requires statements
ivagljiva Apr 24, 2020
4d6cc47
add kegg-related items to anvio_items
ivagljiva Apr 24, 2020
b4cc0e1
missing quote
ivagljiva Apr 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ diamond-log-file.txt
anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/*.dmnd

anvio/tests/sandbox/test_visualize_split_coverages/TEST_OUTDIR
anvio/data/misc/KEGG/
anvio/data/misc/Pfam/
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved
59 changes: 56 additions & 3 deletions anvio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,14 @@ def get_args(parser):
"program with superuser privileges. If you don't have superuser privileges, then you can "
"use this parameter to tell anvi'o the location you wish to use to setup your database."}
),
'kegg-data-dir': (
['--kegg-data-dir'],
{'default': None,
'type': str,
'help': "The directory path for your KEGG setup, which will include things like \
KOfam profiles and KEGG MODULE data. Anvi'o will try to use the default path\
if you do not specify anything."}
),
'hide-outlier-SNVs': (
['--hide-outlier-SNVs'],
{'default': False,
Expand All @@ -701,6 +709,12 @@ def get_args(parser):
"up using this flag) (plus, there may or may not be some historical data on this here: "
"https://github.com/meren/anvio/issues/309)."}
),
'hmmer-program': (
['--hmmer-program'],
{'type': str,
'required': False,
'help': "Which of the HMMER programs to use to run HMMs (ie, hmmscan, hmmsearch)"}
),
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved
'hmm-source': (
['--hmm-source'],
{'metavar': 'SOURCE NAME',
Expand Down Expand Up @@ -2265,6 +2279,41 @@ def get_args(parser):
'help': "Provide if working with INSeq/Tn-Seq genomic data. With this, all gene level "
"coverage stats will be calculated using INSeq/Tn-Seq statistical methods."}
),
'module-completion-threshold': (
['--module-completion-threshold'],
{'default': 0.75,
'metavar': 'NUM',
'type': float,
'help': "This threshold defines the point at which we consider a KEGG module to be 'complete' or "
"'present' in a given genome or bin. It is the fraction of steps that must be complete in "
" in order for the entire module to be marked complete. The default is %(default)g."}
),
'get-raw-data-as-json': (
['--get-raw-data-as-json'],
{'default': None,
'metavar': 'FILENAME_PREFIX',
'type': str,
'help': "If you want the raw metabolism estimation data dictionary in JSON-format, provide a filename prefix to this argument."
"The program will then output a file with the .json extension containing this data."}
),
'store-json-without-estimation': (
['--store-json-without-estimation'],
{'default': False,
'action': 'store_true',
'help': "This flag is used to control what is stored in the JSON-formatted metabolism data dictionary. When this flag is provided alongside the "
"--get-raw-data-as-json flag, the JSON file will be created without running metabolism estimation, and "
"that file will consequently include only information about KOfam hits and gene calls. The idea is that you can "
"then modify this file as you like and re-run this program using the flag --estimate-from-json."}
),
'estimate-from-json': (
['--estimate-from-json'],
{'default': None,
'metavar': 'FILE_PATH',
'type': str,
'help': "If you have a JSON file containing KOfam hits and gene call information from your contigs database "
"(such as a file produced using the --get-raw-data-as-json flag), you can provide that file to this flag "
"and KEGG metabolism estimates will be computed from the information within instead of from a contigs database."}
),
}

# two functions that works with the dictionary above.
Expand Down Expand Up @@ -2300,7 +2349,8 @@ def set_version():
t.genes_db_version, \
t.auxiliary_data_version, \
t.genomes_storage_vesion, \
t.structure_db_version
t.structure_db_version, \
t.kegg_modules_db_version


def get_version_tuples():
Expand All @@ -2311,7 +2361,8 @@ def get_version_tuples():
("Auxiliary data storage version", __auxiliary_data_version__),
("Pan DB version", __pan__version__),
("Genome data storage version", __genomes_storage_version__),
("Structure DB version", __structure__version__)]
("Structure DB version", __structure__version__),
("Kegg Modules DB version", __kegg_modules_version__)]
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved


def print_version():
Expand All @@ -2322,6 +2373,7 @@ def print_version():
run.info("Genome data storage version", __genomes_storage_version__)
run.info("Auxiliary data storage version", __auxiliary_data_version__)
run.info("Structure DB version", __structure__version__)
run.info("Kegg Modules DB version", __kegg_modules_version__)


__version__, \
Expand All @@ -2332,7 +2384,8 @@ def print_version():
__genes__version__, \
__auxiliary_data_version__, \
__genomes_storage_version__ , \
__structure__version__ = set_version()
__structure__version__, \
__kegg_modules_version__ = set_version()


if '-v' in sys.argv or '--version' in sys.argv:
Expand Down
10 changes: 7 additions & 3 deletions anvio/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@

max_num_items_for_hierarchical_clustering = 20000

# max coverage depth to read from BAM files using pysam.
# max coverage depth to read from BAM files using pysam.
# this parameter also can be set later using command line parameters
# we use uint16 as dtype for numpy arrays when we work on & store coverages
# which has limit of 65536, so this constant needs to be smaller than that.
Expand Down Expand Up @@ -163,8 +163,8 @@
'Val': {"C":5, "H":11, "N":1, "O":2, "S":0}})

# taken from http://prowl.rockefeller.edu/aainfo/volume.htm
# volume reference: A.A. Zamyatin, Protein Volume in Solution, Prog. Biophys. Mol. Biol. 24(1972)107-123.
# surface area reference: C. Chotia, The Nature of the Accessible and Buried Surfaces in Proteins, J. Mol. Biol., 105(1975)1-14.
# volume reference: A.A. Zamyatin, Protein Volume in Solution, Prog. Biophys. Mol. Biol. 24(1972)107-123.
# surface area reference: C. Chotia, The Nature of the Accessible and Buried Surfaces in Proteins, J. Mol. Biol., 105(1975)1-14.
AA_geometry = Counter({'Ala': {"volume":88.6, "area":115},
'Arg': {"volume":173.4, "area":225},
'Asn': {"volume":111.1, "area":150},
Expand Down Expand Up @@ -369,3 +369,7 @@ def get_codon_to_num_lookup(reverse_complement=False):
nt_to_RC_num_lookup = get_nt_to_num_lookup({'A': 3, 'C': 2, 'G': 1, 'T': 0, 'N': 4})
codon_to_num_lookup = get_codon_to_num_lookup(reverse_complement=False)
codon_to_RC_num_lookup = get_codon_to_num_lookup(reverse_complement=True)


# KEGG setup constant - used to warn user that the KEGG MODULES.db data may need to be updated
KEGG_SETUP_INTERVAL = 90 # days since last MODULES.db creation
33 changes: 25 additions & 8 deletions anvio/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,7 @@ def get_table_as_dict(self, table_name, table_structure=None, string_the_key=Fal
# entry assigns a new `entry_id`, enters the data. it is all good when there is a single process doing it.
# but when there are multiple processes running in parallel, sometimes race conditions occur: two processes
# learn the max entry id about the same time, and when they finally enter the data to the db, some entries
# end up not being unique. this is a toughie because sometimes entry ids are used to connect distinct
# end up not being unique. this is a toughie because sometimes entry ids are used to connect distinct
# information from different tables, so they must be known before the data goes into the database, etc.
# when these race conditions occur, anvi'o gives an error telling the user kindly that they are fucked. but in
# some cases it is possible to recover from that (THE CODE BELOW TRIES TO DO THAT) by reassigning all ids on the
Expand Down Expand Up @@ -593,13 +593,17 @@ def get_table_as_dataframe(self, table_name, where_clause=None, columns_of_inter
return results_df[columns_of_interest]


def get_some_rows_from_table_as_dict(self, table_name, where_clause, error_if_no_data=True, string_the_key=False):
def get_some_rows_from_table_as_dict(self, table_name, where_clause, error_if_no_data=True, string_the_key=False, row_num_as_key=False):
"""This is similar to get_table_as_dict, but much less general.

get_table_as_dict can do a lot, but it first reads all data into the memory to operate on it.
In some cases the programmer may like to access to only a small fraction of entries in a table
by using `WHERE column = value` notation, which is not possible with the more generalized
function."""
function.

row_num_as_key bool added as parameter so this function works for KEGG MODULES.db, which does not have unique IDs in the
first column. If True, the returned dictionary will be keyed by integers from 0 to (# rows returned - 1)
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved
"""

results_dict = {}

Expand All @@ -608,16 +612,29 @@ def get_some_rows_from_table_as_dict(self, table_name, where_clause, error_if_no

rows = self._exec('''SELECT * FROM %s WHERE %s''' % (table_name, where_clause)).fetchall()

row_num = 0
for row in rows:
entry = {}

for i in columns_to_return[1:]:
entry[table_structure[i]] = row[i]
if row_num_as_key:
entry[table_structure[0]] = row[0]
for i in columns_to_return[1:]:
entry[table_structure[i]] = row[i]

if string_the_key:
results_dict[str(row[0])] = entry
if string_the_key:
results_dict[str(row_num)] = entry
else:
results_dict[row_num] = entry
else:
results_dict[row[0]] = entry
for i in columns_to_return[1:]:
entry[table_structure[i]] = row[i]

if string_the_key:
results_dict[str(row[0])] = entry
else:
results_dict[row[0]] = entry

row_num += 1

if error_if_no_data and not len(results_dict):
raise ConfigError("Query on %s with the where clause of '%s' did not return anything." % (table_name, where_clause))
Expand Down
4 changes: 2 additions & 2 deletions anvio/dbops.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,7 +502,7 @@ def init_functions(self, requested_sources=[], dont_panic=False):
if gene_callers_id not in self.gene_function_calls_dict:
self.gene_function_calls_dict[gene_callers_id] = dict([(s, None) for s in self.gene_function_call_sources])

if self.gene_function_calls_dict[gene_callers_id][source]:
if self.gene_function_calls_dict[gene_callers_id][source] and e_value:
if self.gene_function_calls_dict[gene_callers_id][source][2] < e_value:
# 'what we have:', self.gene_function_calls_dict[gene_callers_id][source]
# 'rejected :', ('%s :: %s' % (function if function else 'unknown', accession), e_value)
Expand Down Expand Up @@ -3959,7 +3959,7 @@ def create(self, args):
def compress_nt_position_info(self, contig_length, genes_in_contig, genes_in_contigs_dict):
"""This function compresses information regarding each nucleotide position in a given contig
into a small int. Every nucleotide position is represented by four bits depending on whether
they occur in a complete opoen reading frame, and which base they correspond to in a codon.
they occur in a complete open reading frame, and which base they correspond to in a codon.

0000
||||
Expand Down
92 changes: 57 additions & 35 deletions anvio/drivers/hmmer.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import gzip
import shutil
from threading import Thread, Lock
import glob
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved

import anvio
import anvio.utils as utils
Expand All @@ -30,26 +31,54 @@


class HMMer:
def __init__(self, target_files_dict, num_threads_to_use=1, progress=progress, run=run):
def __init__(self, target_files_dict, num_threads_to_use=1, program_to_use="hmmscan", progress=progress, run=run):
"""A class to streamline HMM runs."""
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved
self.num_threads_to_use = num_threads_to_use
self.program_to_use = program_to_use
self.progress = progress
self.run = run

self.tmp_dirs = []
self.target_files_dict = {}

acceptable_programs = ["hmmscan", "hmmsearch"]
if self.program_to_use not in acceptable_programs:
raise ConfigError("HMMer class here. You are attemptimg to use the program %s to run HMMs, but we don't recognize it. The currently"
" supported programs are: %s" % (self.program_to_use, ", ".join(acceptable_programs)))
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved

for source in target_files_dict:
tmp_dir = filesnpaths.get_temp_directory_path()
self.tmp_dirs.append(tmp_dir)

part_file_name = os.path.join(tmp_dir, os.path.basename(target_files_dict[source]))

# create splitted fasta files inside tmp directory
self.target_files_dict[source] = utils.split_fasta(target_files_dict[source],
self.target_files_dict[source] = utils.split_fasta(target_files_dict[source],
parts=self.num_threads_to_use,
prefix=part_file_name)

def verify_hmmpress_output(self, hmm_path):
"""This function verifies that the HMM profiles located at hmm_path have been successfully hmmpressed.

What this means is that every .hmm profile in the directory has an associated .h3f, .h3i, .h3m, and
.h3p file.

PARAMETERS
==========
hmm_path string, the path at which the HMM profiles are located

"""

for file_path in glob.glob(os.path.join(hmm_path, '*.hmm')):
base_path = file_path[:-3]
expected_extensions = ['h3f', 'h3i', 'h3m', 'h3p']
for ext in expected_extensions:
if not os.path.exists(base_path + ext):
raise ConfigError("It appears that hmmpress was not properly run on the hmm profiles at %s. The \
file %s does not exist. It is likely that you will have to set up your profiles \
again by running a program such as `anvi-setup-pfams` or `anvi-setup-kegg-kofams`. \
We are very sorry about this." % (hmm_path, base_path + ext))
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved


def run_hmmscan(self, source, alphabet, context, kind, domain, num_genes_in_model, hmm, ref, noise_cutoff_terms):
target = ':'.join([alphabet, context])
Expand All @@ -70,40 +99,24 @@ def run_hmmscan(self, source, alphabet, context, kind, domain, num_genes_in_mode
self.run.info('Context', context)
self.run.info('Domain', domain if domain else 'N\\A')
self.run.info('HMM model path', hmm)
self.run.info('Number of genes', num_genes_in_model)
self.run.info('Number of genes in HMM model', num_genes_in_model)
self.run.info('Noise cutoff term(s)', noise_cutoff_terms)
self.run.info('Number of CPUs will be used for search', self.num_threads_to_use)
if alphabet in ['DNA', 'RNA']:
self.run.info('HMMer program used for search', 'nhmmscan')
else:
self.run.info('HMMer program used for search', self.program_to_use)

# we want to create hmm files in the same direcotry
tmp_dir = os.path.dirname(self.target_files_dict[target][0])
log_file_path = os.path.join(tmp_dir, '00_log.txt')
log_file_path = os.path.join(tmp_dir, '*_log')
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved

self.run.info('Temporary work dir', tmp_dir)
self.run.info('Log file', log_file_path)

self.progress.new('Unpacking the model into temporary work directory')
self.progress.update('...')
hmm_file_path = os.path.join(tmp_dir, source + '_hmm.txt')
hmm_file = open(hmm_file_path, 'wb')
hmm_file.write(gzip.open(hmm, 'rb').read())
hmm_file.close()
self.progress.end()
self.run.info('Log files', log_file_path)

self.progress.new('Processing')
self.progress.update('Compressing the pfam model')

cmd_line = ['hmmpress', hmm_file_path]
ret_val = utils.run_command(cmd_line, log_file_path)

if ret_val:
raise ConfigError("The last call did not work quite well. Most probably the version of HMMER you have "
"installed is either not up-to-date enough, or too new :/ Just to make sure what went "
"wrong please take a look at the log file ('%s'). Please visit %s to see what "
"is the latest version availalbe if you think updating HMMER can resolve it. You can "
"learn which version of HMMER you have on your system by typing 'hmmpress -h'."\
% (log_file_path, 'http://hmmer.janelia.org/download.html'))
self.progress.end()

# check if all hmmpress files are in the HMM directory
self.verify_hmmpress_output(hmm)
# we may want to throw a more descriptive error *here* instead of failing in the verify function


workers = []
Expand All @@ -119,17 +132,28 @@ def run_hmmscan(self, source, alphabet, context, kind, domain, num_genes_in_mode
"Anvi'o will use %s process with %s cores each instead. I hope thats okay for you. " %
(str(self.num_threads_to_use), str(num_parts), target, str(num_parts), cores_per_process))

if alphabet in ['DNA', 'RNA'] and self.program_to_use == 'hmmsearch':
self.run.warning("You requested to use the program `%s`, but because you are working with %s sequences Anvi'o will use `nhmmscan` instead. "
"We hope that is alright." % (self.program_to_use, alphabet))


for part_file in self.target_files_dict[target]:
log_file = part_file + '_log'
output_file = part_file + '_output'
shitty_file = part_file + '_shitty'

cmd_line = ['nhmmscan' if alphabet in ['DNA', 'RNA'] else 'hmmscan',
'-o', output_file, *noise_cutoff_terms.split(),
'--cpu', cores_per_process,
'--tblout', shitty_file,
hmm_file_path, part_file]
if noise_cutoff_terms:
cmd_line = ['nhmmscan' if alphabet in ['DNA', 'RNA'] else self.program_to_use,
'-o', output_file, *noise_cutoff_terms.split(),
'--cpu', cores_per_process,
'--tblout', shitty_file,
hmm, part_file]
else: # if we didn't pass any noise cutoff terms, here we don't include them in the command line
cmd_line = ['nhmmscan' if alphabet in ['DNA', 'RNA'] else self.program_to_use,
ivagljiva marked this conversation as resolved.
Show resolved Hide resolved
'-o', output_file,
'--cpu', cores_per_process,
'--tblout', shitty_file,
hmm, part_file]

t = Thread(target=self.hmmscan_worker, args=(part_file,
cmd_line,
Expand Down Expand Up @@ -199,5 +223,3 @@ def hmmscan_worker(self, part_file, cmd_line, shitty_output_file, log_file, merg
def clean_tmp_dirs(self):
for tmp_dir in self.tmp_dirs:
shutil.rmtree(tmp_dir)


Loading