Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous Nucleotide handling question? #137

Closed
swamidass opened this issue Feb 24, 2017 · 14 comments
Closed

Ambiguous Nucleotide handling question? #137

swamidass opened this issue Feb 24, 2017 · 14 comments

Comments

@swamidass
Copy link
Contributor

How are ambiguous nucleotides (e.g. N) handled? Are the kmer's containing this filtered out?

@swamidass
Copy link
Contributor Author

So it seems in the code that if an N is found anywhere in a read, it is thrown out. This is done on a read level, not on a kmer level. Is this really the right behavior? Shouldn't the kmers that do not have an "N" from a read that includes an "N" still be kept?

@ctb
Copy link
Contributor

ctb commented Feb 26, 2017 via email

@swamidass
Copy link
Contributor Author

No sweat. Thanks for responding this time.

I think the is that the current handling is not correct. Basically, the check has to take place at the kmer level, not the sequence level. Unfortunately, there is no easy way to work around this on a preprocessing level. It really needs to be handled in the code.

The easiest way to handle this is to move the the check in the code. For correctness, that should probably done right now. I'll check the code and see if it is easy enough to make a pull request for this.

However, there is also an opportunity to skip the counter ahead based on encountering non ATGC letters, which might improve efficiency (maybe). That would take a bit more work to do, because some new test code is required and there are some details here. That part could be postponed to later.

What do you think?

@ctb
Copy link
Contributor

ctb commented Feb 26, 2017 via email

@swamidass
Copy link
Contributor Author

I think that reads or fasta sequences that have ambiguous nucleotides should not be thrown out. However, the specific kmers in these sequences that include the ambiguous nucleotides should be thrown out.

For example, take the human genome. I think all the chromosomes in the reference assembly contain some N's. We do not want to throw them all out though, or we would be left with no sequences! Instead, we want to iterate over all the kmers in the sequences, keeping the kmers that are not ambiguous, but throwing out those that are ambiguous.

Does that make sense?

@swamidass
Copy link
Contributor Author

And it looks easy to make this change (though it may not be the most efficient). So I can make a pull request if you like. Later on, if you want, you can try and make it more efficient.

@ctb
Copy link
Contributor

ctb commented Feb 26, 2017

Sorry for the confusion --

the current code forces non-ACTGN to N, and then consumes all those k-mers.

This is different from skipping the k-mers that are ambiguous, of course :).

@swamidass
Copy link
Contributor Author

Exactly, and that is not (in my view) sensible behavior. It means that the degree of ambiguity in two datasets will influence their similarity. There is no good scientific reason (other than to study ambiguity in reads itself) to program this behavior in.

@ctb
Copy link
Contributor

ctb commented Feb 26, 2017 via email

@swamidass
Copy link
Contributor Author

True, but that is a pretty messy solution and it leaves sourmash with a poorly documented (undocumented?) behavior that is far from ideal. It is one of the worst type of behaviors because it can cause sourmash to "fail silently" with non-obvious behavior that does not match the naive user's expectations.

I suggest that we make the change now to a more sensible default. Then, if khmer or others request a different solution, you can build that in as a documented option.

The only downside of making the change is that signatures need to be recomputed. For this reason, it might require a version increment to 1.2.

@ctb
Copy link
Contributor

ctb commented Feb 26, 2017 via email

swamidass added a commit to swamidass/sourmash that referenced this issue Feb 26, 2017
@swamidass
Copy link
Contributor Author

According to mash developers (marbl/Mash#46), they throw out the ambiguous kmers like I am requesting here. So this change brings sourmash inline with them Though their code is harder for me to quickly understand, so I cannot verify for myself in their code.

@swamidass
Copy link
Contributor Author

Okay, made a pull request. I think it should work.

@swamidass
Copy link
Contributor Author

#138 fixes this, and will hopefully be merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants