-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: basic_string::substr in MinHash.add_sequence #502
Comments
When run locally, I get a similar error: ✘ Wed 20 Jun - 21:05 ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison origin ☊ master 3☀ 6●
ubuntu@olgabot-reflow sourmash info
sourmash version 2.0.0a7
- loaded from path: /home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash
✘ Wed 20 Jun - 21:07 ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison origin ☊ master 3☀ 6●
ubuntu@olgabot-reflow sourmash compute --track-abundance --protein --dna --scaled 100 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 900000Traceback (most recent call last):
File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
sys.exit(sourmash.__main__.main())
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
cmd(sys.argv[2:])
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
args.input_is_protein, args.check_sequence)
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
E.add_sequence(seq, not check_sequence)
File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr It seems interesting that with either file, the error happened after viewing kmer # |
Hmm it seems that something is off with checking the sequence because adding Wed 20 Jun - 21:19 ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison origin ☊ master 3☀ 6●
ubuntu@olgabot-reflow sourmash compute --track-abundance --protein --dna --scaled 100 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature --check-sequence G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
Traceback (most recent call last):
File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
sys.exit(sourmash.__main__.main())
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
cmd(sys.argv[2:])
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
args.input_is_protein, args.check_sequence)
File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
E.add_sequence(seq, not check_sequence)
File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: invalid DNA character in input: C |
For all of these, here is the input, trimmed fastq file:
|
well that just seems bad. we did fix lowercase problems in #480 but this
doesn't seem like the problem here.
|
On Wed, Jun 20, 2018 at 03:40:54PM -0700, Olga Botvinnik wrote:
For all of these, here is the input, trimmed fastq file: `s3://olgabot-maca/temp/G14-MAA001857-3_38_F-1-1.trimmed` (716 MB)
I don't appear to have access to this file; could you drop me the public
download link? thx!
|
sorry about that! The S3 prefix is public now. Here is a HTTP download link too, just in case. |
It looks like the problem is with this sequence:
which has an invalid character (an 'N') in it. (The error message is reporting the wrong character, and I'll fix it when I track it down.) |
Fixed in #503. |
Ah okay great! For the offending Should the order be: Option A
Option B
|
I would recommend turning off --check-sequence, which will simply ignore
k-mers containing unknown characters.
|
Got it. I'm still getting the Here's the install command:
And here's the
|
I split the file into 10k seq pieces (40k lines) and
After hunting down the error and
|
Note: this only happens with
|
Any suggestions? This problem still only occurs for me when I'm fine computing only DNA signatures for now but I'd like to do protein in the future, too. |
Hello! I'm using the container
docker pull quay.io/biocontainers/sourmash:2.0.0a7--py27_0
to run sourmash compute directly ontrim-low-abund.py
reads and am getting this strange error:If I use the
master
branch and build with this dockerfile, then I get a different error but on the same line (178 insourmash/_minhash.pyx
):For reference, here is the trimmed fastq file:
s3://olgabot-maca/temp/A10_B000419_S34.trimmed
Do you know what may be happening?
The text was updated successfully, but these errors were encountered: