Ambiguous Nucleotide handling question? #137

swamidass · 2017-02-24T17:28:14Z

How are ambiguous nucleotides (e.g. N) handled? Are the kmer's containing this filtered out?

swamidass · 2017-02-26T21:37:53Z

So it seems in the code that if an N is found anywhere in a read, it is thrown out. This is done on a read level, not on a kmer level. Is this really the right behavior? Shouldn't the kmers that do not have an "N" from a read that includes an "N" still be kept?

ctb · 2017-02-26T21:46:34Z

Sorry for the delayed reply to this issue; I wanted to check it for myself before answering. In sourmash 1.0 I believe it complained loudly about any sequence that didn't have only ACTG in it. In more recent code, we updated it. From a skim of the code and my memory that this is where we left it, the answer seems to be: * `kmer_min_hash.hh`, function `_checkdna` will force non-ACTG to N * `khmer_min_hash.hh`, function `_revcomp` will complain if asked to revcomp anything other than ACTGN. This is enforced on the code base in `test_dna.py::test_basic_dna_bad_force`. HTH! (We're going through a big frustrating refactor of this over in khmer, so I am prone to confusion on this front -- I'll re-confirm when I next get back to looking at sourmash!)

swamidass · 2017-02-26T22:03:31Z

No sweat. Thanks for responding this time.

I think the is that the current handling is not correct. Basically, the check has to take place at the kmer level, not the sequence level. Unfortunately, there is no easy way to work around this on a preprocessing level. It really needs to be handled in the code.

The easiest way to handle this is to move the the check in the code. For correctness, that should probably done right now. I'll check the code and see if it is easy enough to make a pull request for this.

However, there is also an opportunity to skip the counter ahead based on encountering non ATGC letters, which might improve efficiency (maybe). That would take a bit more work to do, because some new test code is required and there are some details here. That part could be postponed to later.

What do you think?

ctb · 2017-02-26T22:07:11Z

Could you explain in a bit more detail what you believe the behavior should be? thanks!

swamidass · 2017-02-26T22:14:52Z

I think that reads or fasta sequences that have ambiguous nucleotides should not be thrown out. However, the specific kmers in these sequences that include the ambiguous nucleotides should be thrown out.

For example, take the human genome. I think all the chromosomes in the reference assembly contain some N's. We do not want to throw them all out though, or we would be left with no sequences! Instead, we want to iterate over all the kmers in the sequences, keeping the kmers that are not ambiguous, but throwing out those that are ambiguous.

Does that make sense?

swamidass · 2017-02-26T22:16:07Z

And it looks easy to make this change (though it may not be the most efficient). So I can make a pull request if you like. Later on, if you want, you can try and make it more efficient.

ctb · 2017-02-26T22:16:43Z

Sorry for the confusion --

the current code forces non-ACTGN to N, and then consumes all those k-mers.

This is different from skipping the k-mers that are ambiguous, of course :).

swamidass · 2017-02-26T22:18:50Z

Exactly, and that is not (in my view) sensible behavior. It means that the degree of ambiguity in two datasets will influence their similarity. There is no good scientific reason (other than to study ambiguity in reads itself) to program this behavior in.

ctb · 2017-02-26T22:24:46Z

On Sun, Feb 26, 2017 at 02:18:50PM -0800, swamidass wrote: Exactly, and that is not (in my view) sensible behavior. It means that the degree of ambiguity in two datasets will influence their similarity. There is no good scientific reason (other than to study ambiguity in reads itself) to program this behavior in.

I don't strongly disagree with that statement :). I do think there are other considerations -- see dib-lab/khmer#1541 for a discussion on a different project with different (but overlapping) goals -- and I am wary of settling on The Answer without some more use cases in mind. Note that it is relatively easy to preprocess input data to achieve the desired result; you can split on [^ACTG] when feeding things in programmatically, and this will yield the same result as ignoring non-pure-ACGT k-mers.

swamidass · 2017-02-26T22:29:58Z

True, but that is a pretty messy solution and it leaves sourmash with a poorly documented (undocumented?) behavior that is far from ideal. It is one of the worst type of behaviors because it can cause sourmash to "fail silently" with non-obvious behavior that does not match the naive user's expectations.

I suggest that we make the change now to a more sensible default. Then, if khmer or others request a different solution, you can build that in as a documented option.

The only downside of making the change is that signatures need to be recomputed. For this reason, it might require a version increment to 1.2.

ctb · 2017-02-26T22:38:07Z

On Sun, Feb 26, 2017 at 02:29:58PM -0800, swamidass wrote: True, but that is a pretty messy solution and it leaves sourmash with a poorly documented (undocumented?) behavior that is far from ideal. It is one of the worst type of behaviors because it can cause sourmash to "fail silently" with non-obvious behavior that does not match the naive user's expectations.

I don't think it's a messy solution - the main thing I've found in this space is that there are no *clean* ones, TBH :)

I suggest that we make the change now to a more sensible default. Then, if khmer or others request a different solution, you can build that in as a documented option.

No objection to making the proposed change. khmer doesn't use code from sourmash so I was really just pointing out that life is complicated, and that your/my/our expectations of "sensible default" may not match others, and for good reasons! Re sensible defaults, it would be good not to do something too different from what 'mash' does - I don't recall what their code does, tho.

The only downside of making the change is that signatures need to be recomputed. For this reason, it might require a version increment to 1.2.

The next released version will have to be 2.0 in any case, as we have changed quite a few other things. If you submit a PR, we'll take a look at it soon; otherwise, we'll include this issue as we get to it ;).

swamidass · 2017-02-26T22:57:33Z

According to mash developers (marbl/Mash#46), they throw out the ambiguous kmers like I am requesting here. So this change brings sourmash inline with them Though their code is harder for me to quickly understand, so I cannot verify for myself in their code.

swamidass · 2017-02-26T23:02:40Z

Okay, made a pull request. I think it should work.

swamidass · 2017-02-28T06:08:35Z

#138 fixes this, and will hopefully be merged in.

swamidass added a commit to swamidass/sourmash that referenced this issue Feb 26, 2017

Fixed handing of ambiguous nucleotides, issue sourmash-bio#137

b6d0c7f

swamidass mentioned this issue Feb 27, 2017

Fixed handling of ambiguous nucleotides, issue #137 #138

Merged

5 tasks

ctb mentioned this issue Mar 27, 2017

Fix handling of ambiguous nucleotides #148

Merged

5 tasks

betatim closed this as completed in #138 Mar 27, 2017

swamidass mentioned this issue Nov 2, 2020

More thoughts on non-ACGT characters. dib-lab/khmer#1541

Open

apetkau mentioned this issue May 6, 2021

Look more into how sourmash handles sequences with ambiguous (N) characters apetkau/genomics-data-index#39

Open

ctb mentioned this issue Apr 1, 2024

revisit InvalidDNA error? #3104

Closed

bluegenes mentioned this issue Apr 2, 2024

manysketch: check ambiguous nucleotide handling sourmash-bio/sourmash_plugin_branchwater#297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous Nucleotide handling question? #137

Ambiguous Nucleotide handling question? #137

swamidass commented Feb 24, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

swamidass commented Feb 28, 2017

Ambiguous Nucleotide handling question? #137

Ambiguous Nucleotide handling question? #137

Comments

swamidass commented Feb 24, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

ctb commented Feb 26, 2017 via email

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

swamidass commented Feb 28, 2017