Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] update to sourmash 4.1.0 #171

Merged
merged 35 commits into from
May 24, 2021
Merged

[MRG] update to sourmash 4.1.0 #171

merged 35 commits into from
May 24, 2021

Conversation

taylorreiter
Copy link
Member

@taylorreiter taylorreiter commented May 18, 2021

Update to 4.1.0 version of sourmash, which should make the gather step ~faster. I left it as pip install so it's easy to update to other sourmash versions that may happen to not be in conda in the future
@ctb ready for review
Sorry...Not ready for review 😬

@taylorreiter taylorreiter changed the title [MRG] update to sourmash 4.1.0 [WIP] update to sourmash 4.1.0 May 18, 2021
@taylorreiter
Copy link
Member Author

taylorreiter commented May 19, 2021

One of the tests that is still failing (tests/test_compare_taxonomy.py::test_basic) is because the loomba results are supposed to look like this:

genome,filter_at,override_filter_at,total_bad_bp,superkingdom_bad_bp,phylum_bad_bp,class_bad_bp,order_bad_bp,family_bad_bp,genus_bad_bp,f_ident,f_major,lineage,comment
LoombaR_2017__SID1050_bax__bin.11.fa.gz,genus,,12351,0,0,0,7286,9347,12351,0.959,0.764,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Acutalibacteraceae;g__Anaeromassilibacillus,

and end up looking like this:

genome,filter_at,override_filter_at,total_bad_bp,superkingdom_bad_bp,phylum_bad_bp,class_bad_bp,order_bad_bp,family_bad_bp,genus_bad_bp,f_ident,f_major,lineage,comment
loomba,none,,0,0,0,0,0,0,0,0.000,0.000,,too few identifiable hashes; f_ident < 10%. provide a lineage for this genome.

But I don't really have any clue what change I would have made that would have led to this difference.

update -- fixed by cc4144e. Needed and ==0 so that gather didn't just break no matter what 😬

@taylorreiter
Copy link
Member Author

taylorreiter commented May 19, 2021

and the other test test_2_loomba is failing because all of the lineages are empty
update: with cc4144e lineages are no longer empty, but not all lineages match

charcoal/Snakefile Outdated Show resolved Hide resolved
charcoal/compare_taxonomy.py Outdated Show resolved Hide resolved
charcoal/compare_taxonomy.py Outdated Show resolved Hide resolved
charcoal/conf/env-sourmash.yml Outdated Show resolved Hide resolved
charcoal/utils.py Outdated Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
@taylorreiter
Copy link
Member Author

Ok -- I've updated the threshold_bp in prefetch to be scaled3. The test still fails...
I tried it with a threshold_bp == 0, and threshold_bp of scaled
2, and the test failed with those values as well. How should I proceed?

@ctb
Copy link
Member

ctb commented May 20, 2021 via email

@taylorreiter
Copy link
Member Author

yup! That's exactly what the test does.

Using this code from SO:

def compareJson(example_json, target_json, level=-1, show_variables=False):
  _different_variables = _parseJSON(example_json, target_json, level=level, show_variables=show_variables)
  return len(_different_variables) == 0, _different_variables

def _parseJSON(reference, target, path=[], level=-1, show_variables=False):  
  if level > 0 and len(path) == level:
    return []
  
  _different_variables = list()
  # the case that the inputs is a dict (i.e. json dict)  
  if isinstance(reference, dict):
    for _key in reference:      
      _path = path+[_key]
      try:
        _different_variables += _parseJSON(reference[_key], target[_key], _path, level, show_variables)
      except KeyError:
        _record = ''.join(['[%s]'%str(p) for p in _path])
        if show_variables:
          _record += ': %s <--> MISSING!!'%str(reference[_key])
        _different_variables.append(_record)
  # the case that the inputs is a list/tuple
  elif isinstance(reference, list) or isinstance(reference, tuple):
    for index, v in enumerate(reference):
      _path = path+[index]
      try:
        _target_v = target[index]
        _different_variables += _parseJSON(v, _target_v, _path, level, show_variables)
      except IndexError:
        _record = ''.join(['[%s]'%str(p) for p in _path])
        if show_variables:
          _record += ': %s <--> MISSING!!'%str(v)
        _different_variables.append(_record)
  # the actual comparison about the value, if they are not the same, record it
  elif reference != target:
    _record = ''.join(['[%s]'%str(p) for p in path])
    if show_variables:
      _record += ': %s <--> %s'%(str(reference), str(target))
    _different_variables.append(_record)  
  return _different_variables

There is one difference:

(False, ['[NODE_5616_length_2465_cov_28.0988][2][0][0][4][1]: f__Oscillospiraceae <--> f__Acutalibacteraceae', '[NODE_5616_length_2465_cov_28.0988][2][0][0][5][1]: g__Flavonifractor <--> g__Anaeromassilibacillus'])

which is weird if the db isn't changing, right?

charcoal/Snakefile Outdated Show resolved Hide resolved
charcoal/just_taxonomy.py Outdated Show resolved Hide resolved
charcoal/utils.py Outdated Show resolved Hide resolved
@ctb
Copy link
Member

ctb commented May 21, 2021

There is one difference:

(False, ['[NODE_5616_length_2465_cov_28.0988][2][0][0][4][1]: f__Oscillospiraceae <--> f__Acutalibacteraceae', '[NODE_5616_length_2465_cov_28.0988][2][0][0][5][1]: g__Flavonifractor <--> g__Anaeromassilibacillus'])

which is weird if the db isn't changing, right?

I check you -- I used

jq . < tests/test-data/loomba/LoombaR_2017__SID1050_bax__bin.11.fa.gz.contigs-tax.json > out.old

and did the same with a tmp copy of the new output, and then did

diff out.old out.new

and saw

2629c2629
<             "f__Acutalibacteraceae"
---
>             "f__Oscillospiraceae"
2633c2633
<             "g__Anaeromassilibacillus"
---
>             "g__Flavonifractor"

I ... have no idea what's going on here! Hmm.

@ctb
Copy link
Member

ctb commented May 21, 2021

There is one difference:

(False, ['[NODE_5616_length_2465_cov_28.0988][2][0][0][4][1]: f__Oscillospiraceae <--> f__Acutalibacteraceae', '[NODE_5616_length_2465_cov_28.0988][2][0][0][5][1]: g__Flavonifractor <--> g__Anaeromassilibacillus'])

which is weird if the db isn't changing, right?

I check you -- I used

jq . < tests/test-data/loomba/LoombaR_2017__SID1050_bax__bin.11.fa.gz.contigs-tax.json > out.old

and did the same with a tmp copy of the new output, and then did

diff out.old out.new

and saw

2629c2629
<             "f__Acutalibacteraceae"
---
>             "f__Oscillospiraceae"
2633c2633
<             "g__Anaeromassilibacillus"
---
>             "g__Flavonifractor"

I ... have no idea what's going on here! Hmm.

OK, digging into this a bit, my guess is this is because gather doesn't report ties, per sourmash-bio/sourmash#1366 and sourmash-bio/sourmash#278. It is slightly surprising in this case that the tie here is above the family level (!!) but these things happen.

So: this is a bigger issue, and I'm updating the file so that the tests pass, but we should also be sure to file a new issue on this. I think the idea is that gather_at_rank should detect (and handle?) such ties, and probably pull the taxonomic assignment up to the level above the tie. @bluegenes will probably be interested in this :).

@ctb ctb mentioned this pull request May 21, 2021
@ctb
Copy link
Member

ctb commented May 21, 2021

all tests pass with #172

@taylorreiter taylorreiter changed the title [WIP] update to sourmash 4.1.0 [MRG] update to sourmash 4.1.0 May 21, 2021
Copy link
Member

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when ready!

@taylorreiter taylorreiter merged commit 06e1c35 into latest May 24, 2021
@taylorreiter taylorreiter deleted the tr-update-sourmash branch May 24, 2021 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants