-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous Nucleotide handling question? #137
Comments
So it seems in the code that if an N is found anywhere in a read, it is thrown out. This is done on a read level, not on a kmer level. Is this really the right behavior? Shouldn't the kmers that do not have an "N" from a read that includes an "N" still be kept? |
Sorry for the delayed reply to this issue; I wanted to check it for myself
before answering.
In sourmash 1.0 I believe it complained loudly about any sequence that didn't
have only ACTG in it.
In more recent code, we updated it. From a skim of the code and my memory that
this is where we left it, the answer seems to be:
* `kmer_min_hash.hh`, function `_checkdna` will force non-ACTG to N
* `khmer_min_hash.hh`, function `_revcomp` will complain if asked to
revcomp anything other than ACTGN.
This is enforced on the code base in `test_dna.py::test_basic_dna_bad_force`.
HTH!
(We're going through a big frustrating refactor of this over in khmer, so
I am prone to confusion on this front -- I'll re-confirm when I next get back
to looking at sourmash!)
|
No sweat. Thanks for responding this time. I think the is that the current handling is not correct. Basically, the check has to take place at the kmer level, not the sequence level. Unfortunately, there is no easy way to work around this on a preprocessing level. It really needs to be handled in the code. The easiest way to handle this is to move the the check in the code. For correctness, that should probably done right now. I'll check the code and see if it is easy enough to make a pull request for this. However, there is also an opportunity to skip the counter ahead based on encountering non ATGC letters, which might improve efficiency (maybe). That would take a bit more work to do, because some new test code is required and there are some details here. That part could be postponed to later. What do you think? |
Could you explain in a bit more detail what you believe the behavior should
be? thanks!
|
I think that reads or fasta sequences that have ambiguous nucleotides should not be thrown out. However, the specific kmers in these sequences that include the ambiguous nucleotides should be thrown out. For example, take the human genome. I think all the chromosomes in the reference assembly contain some N's. We do not want to throw them all out though, or we would be left with no sequences! Instead, we want to iterate over all the kmers in the sequences, keeping the kmers that are not ambiguous, but throwing out those that are ambiguous. Does that make sense? |
And it looks easy to make this change (though it may not be the most efficient). So I can make a pull request if you like. Later on, if you want, you can try and make it more efficient. |
Sorry for the confusion -- the current code forces non-ACTGN to N, and then consumes all those k-mers. This is different from skipping the k-mers that are ambiguous, of course :). |
Exactly, and that is not (in my view) sensible behavior. It means that the degree of ambiguity in two datasets will influence their similarity. There is no good scientific reason (other than to study ambiguity in reads itself) to program this behavior in. |
On Sun, Feb 26, 2017 at 02:18:50PM -0800, swamidass wrote:
Exactly, and that is not (in my view) sensible behavior. It means that the degree of ambiguity in two datasets will influence their similarity. There is no good scientific reason (other than to study ambiguity in reads itself) to program this behavior in.
I don't strongly disagree with that statement :).
I do think there are other considerations -- see
dib-lab/khmer#1541 for a discussion on a different
project with different (but overlapping) goals -- and I am wary of settling on
The Answer without some more use cases in mind.
Note that it is relatively easy to preprocess input data to achieve the desired
result; you can split on [^ACTG] when feeding things in programmatically,
and this will yield the same result as ignoring non-pure-ACGT k-mers.
|
True, but that is a pretty messy solution and it leaves sourmash with a poorly documented (undocumented?) behavior that is far from ideal. It is one of the worst type of behaviors because it can cause sourmash to "fail silently" with non-obvious behavior that does not match the naive user's expectations. I suggest that we make the change now to a more sensible default. Then, if khmer or others request a different solution, you can build that in as a documented option. The only downside of making the change is that signatures need to be recomputed. For this reason, it might require a version increment to 1.2. |
On Sun, Feb 26, 2017 at 02:29:58PM -0800, swamidass wrote:
True, but that is a pretty messy solution and it leaves sourmash with a poorly documented (undocumented?) behavior that is far from ideal. It is one of the worst type of behaviors because it can cause sourmash to "fail silently" with non-obvious behavior that does not match the naive user's expectations.
I don't think it's a messy solution - the main thing I've found in this space
is that there are no *clean* ones, TBH :)
I suggest that we make the change now to a more sensible default. Then, if khmer or others request a different solution, you can build that in as a documented option.
No objection to making the proposed change. khmer doesn't use code from
sourmash so I was really just pointing out that life is complicated, and that
your/my/our expectations of "sensible default" may not match others, and for
good reasons!
Re sensible defaults, it would be good not to do something too different from
what 'mash' does - I don't recall what their code does, tho.
The only downside of making the change is that signatures need to be recomputed. For this reason, it might require a version increment to 1.2.
The next released version will have to be 2.0 in any case, as we have changed
quite a few other things.
If you submit a PR, we'll take a look at it soon; otherwise, we'll include this
issue as we get to it ;).
|
According to mash developers (marbl/Mash#46), they throw out the ambiguous kmers like I am requesting here. So this change brings sourmash inline with them Though their code is harder for me to quickly understand, so I cannot verify for myself in their code. |
Okay, made a pull request. I think it should work. |
#138 fixes this, and will hopefully be merged in. |
How are ambiguous nucleotides (e.g. N) handled? Are the kmer's containing this filtered out?
The text was updated successfully, but these errors were encountered: