Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: basic_string::substr in MinHash.add_sequence #502

Closed
olgabot opened this issue Jun 20, 2018 · 14 comments · Fixed by #503
Closed

ValueError: basic_string::substr in MinHash.add_sequence #502

olgabot opened this issue Jun 20, 2018 · 14 comments · Fixed by #503

Comments

@olgabot
Copy link
Collaborator

olgabot commented Jun 20, 2018

Hello! I'm using the container docker pull quay.io/biocontainers/sourmash:2.0.0a7--py27_0 to run sourmash compute directly on trim-low-abund.py reads and am getting this strange error:

 Wed 20 Jun - 19:47  ~/kmer-hashing/sourmash/lung_cancer_v3   origin ☊ master ✔ 1☀ 
 ubuntu@olgabot-reflow-v5  reflow run /home/ubuntu/reflow-workflows/sourmash_compute.rf -read1=s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R1_001.fastq.gz -read2=s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R2_001.fastq.gz -output=s3://olgabot-maca/lung_cancer/sourmash_v3/A10_B000419_S34.signature -trim_low_abundance_kmers=true -name=A10_B000419_S34
reflow: run ID: b159d381
reflow:  <-  sourmash_compute.ComputeTrimmed ea2af250 err    exec 19s 0B
        error exec sha256:ea2af2506cee09dbe6569f3bcdad4e04c7959aa994e0a6eef8985026afb18f4a: exited with code 1
        /home/ubuntu/reflow-workflows/sourmash_compute.rf:52:6
        ec2-54-186-77-104.us-west-2.compute.amazonaws.com:9000/b3a7e8b736190ef3/ea2af2506cee09dbe6569f3bcdad4e04c7959aa994e0a6eef8985026afb18f4a
        quay.io/biocontainers/sourmash:2.0.0a7--py27_0
        command:
                        /usr/local/bin/sourmash compute \
                        --track-abundance \
                        --protein \
                        --dna \
                        --scaled 500 \
                        --ksizes 21,33,51 \
                        --name A10_B000419_S34 \
                        --output {{signature}} \
                        {{trimmed}}
        where:
            {{trimmed}} =
                . sha256:ac4e50e4e375910f97ff7847ac27b6d778daf7ff3bad53e6eae293c21df676cb 484.3MiB
        stdout:
        stderr:
setting num_hashes to 0 because --scaled is set
computing signatures for files: /arg/1/0
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from /arg/1/0
... /arg/1/0 90000Traceback (most recent call last):
              File "/usr/local/bin/sourmash", line 6, in <module>
                sys.exit(sourmash.__main__.main())
              File "/usr/local/lib/python2.7/site-packages/sourmash/__main__.py", line 76, in main
                cmd(sys.argv[2:])
              File "/usr/local/lib/python2.7/site-packages/sourmash/commands.py", line 287, in compute
                args.input_is_protein, args.check_sequence)
              File "/usr/local/lib/python2.7/site-packages/sourmash/commands.py", line 197, in add_seq
                E.add_sequence(seq, not check_sequence)
              File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
            ValueError: basic_string::substr
        profile:
            cpu mean=7.1 max=7.6
            mem mean=9.1MiB max=9.2MiB
            disk mean=0B max=0B
            tmp mean=0B max=0B
reflow: total n=2 time=23s
        ident                           n   ncache transfer runtime(m) cpu         mem(GiB)    disk(GiB)   tmp(GiB)
        sourmash_compute.ComputeTrimmed 1   0      484.3MiB 0/0/0      7.1/7.1/7.1 0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0
        sourmash_compute.Trim           1   1      0B

evaluation error:
        exec sha256:ea2af2506cee09dbe6569f3bcdad4e04c7959aa994e0a6eef8985026afb18f4a: exited with code 1

If I use the master branch and build with this dockerfile, then I get a different error but on the same line (178 in sourmash/_minhash.pyx):

 ✘  Wed 20 Jun - 19:48  ~/kmer-hashing/sourmash/lung_cancer_v3   origin ☊ master ✔ 1☀ 
 ubuntu@olgabot-reflow-v5  reflow run /home/ubuntu/reflow-workflows/sourmash_compute.rf -read1=s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R1_001.fastq.gz -read2=s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R2_001.fastq.gz -output=s3://olgabot-maca/lung_cancer/sourmash_v3/A10_B000419_S34.signature -trim_low_abundance_kmers=true -name=A10_B000419_S34
reflow: run ID: 67278fd9
reflow:  <-  sourmash_compute.ComputeTrimmed c41f4510 err    exec 10s 0B
        error exec sha256:c41f4510455966412bf670496198c11ebd680a7bce554cdda54722a502ba6f8c: exited with code 1
        /home/ubuntu/reflow-workflows/sourmash_compute.rf:52:6
        ec2-18-237-107-120.us-west-2.compute.amazonaws.com:9000/b402fcc4a562023b/c41f4510455966412bf670496198c11ebd680a7bce554cdda54722a502ba6f8c
        czbiohub/kmer-hashing
        command:
                        /opt/conda/bin/sourmash compute \
                        --track-abundance \
                        --protein \
                        --dna \
                        --scaled 500 \
                        --ksizes 21,33,51 \
                        --name A10_B000419_S34 \
                        --output {{signature}} \
                        {{trimmed}}
        where:
            {{trimmed}} = 
                . sha256:ac4e50e4e375910f97ff7847ac27b6d778daf7ff3bad53e6eae293c21df676cb 484.3MiB
        stdout:
        stderr:
setting num_hashes to 0 because --scaled is set
computing signatures for files: /arg/1/0
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from /arg/1/0
... /arg/1/0 90000Traceback (most recent call last):
              File "/opt/conda/bin/sourmash", line 11, in <module>
                load_entry_point('sourmash==2.0.0a7', 'console_scripts', 'sourmash')()
              File "/opt/conda/lib/python3.6/site-packages/sourmash-2.0.0a7-py3.6-linux-x86_64.egg/sourmash/__main__.py", line 76, in main
                cmd(sys.argv[2:])
              File "/opt/conda/lib/python3.6/site-packages/sourmash-2.0.0a7-py3.6-linux-x86_64.egg/sourmash/commands.py", line 287, in compute
                args.input_is_protein, args.check_sequence)
              File "/opt/conda/lib/python3.6/site-packages/sourmash-2.0.0a7-py3.6-linux-x86_64.egg/sourmash/commands.py", line 197, in add_seq
                E.add_sequence(seq, not check_sequence)
              File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
            ValueError: basic_string::substr: __pos (which is 16) > this->size() (which is 15)
        profile:
            cpu mean=6.7 max=7.6
            mem mean=16.7MiB max=16.8MiB
            disk mean=0B max=0B
            tmp mean=0B max=0B
reflow: total n=2 time=2m23s
        ident                           n   ncache transfer runtime(m) cpu         mem(GiB)    disk(GiB)   tmp(GiB)
        sourmash_compute.ComputeTrimmed 1   0      484.3MiB 0/0/0      6.7/6.7/6.7 0.0/0.0/0.0 0.0/0.0/0.0 0.0/0.0/0.0
        sourmash_compute.Trim           1   1      0B                                                      

evaluation error:
        exec sha256:c41f4510455966412bf670496198c11ebd680a7bce554cdda54722a502ba6f8c: exited with code 1

For reference, here is the trimmed fastq file: s3://olgabot-maca/temp/A10_B000419_S34.trimmed

Do you know what may be happening?

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 20, 2018

When run locally, I get a similar error:

 ✘  Wed 20 Jun - 21:05  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash info
sourmash version 2.0.0a7
- loaded from path: /home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash
 ✘  Wed 20 Jun - 21:07  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --protein --dna --scaled 100 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed  
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 900000Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
    sys.exit(sourmash.__main__.main())
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr

It seems interesting that with either file, the error happened after viewing kmer # 900000. Could this be a memory issue?

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 20, 2018

Hmm it seems that something is off with checking the sequence because adding --check-sequence gives the error that C is an invalid DNA character!!

 Wed 20 Jun - 21:19  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --protein --dna --scaled 100 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature --check-sequence G14-MAA001857-3_38_F-1-1.trimmed 
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
    sys.exit(sourmash.__main__.main())
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: invalid DNA character in input: C

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 20, 2018

For all of these, here is the input, trimmed fastq file: s3://olgabot-maca/temp/G14-MAA001857-3_38_F-1-1.trimmed (716 MB)

--protein --no-dna --scaled 10000 --ksizes 51 --> ValueError: basic_string::substr

Hunting down some more, I found that with --protein --no-dna, then we get the ValueError: basic_string::substr error:

 ✘  Wed 20 Jun - 21:31  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --protein --no-dna --scaled 10000 --ksizes 51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature --check-sequence G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [51]
Computing only protein (and not DNA) signatures.
Computing a total of 1 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 900000Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
    sys.exit(sourmash.__main__.main())
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr

Without --check-sequence and with --scale=1000 --ksizes=51 --> no error

Without --check-sequence and with --scale=1000 --ksizes=51, the signature gets computed just fine:

 ✘  Wed 20 Jun - 21:32  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --dna  --scaled 10000 --ksizes 51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [51]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 2839105 sequences
calculated 1 signatures for 2839105 sequences taken from 1 files
saved 1 signature(s). Note: signature license is CC0.

With --check-sequence --> ValueError: invalid DNA character in input: T

However, if --check-sequence is added, then suddenly T is not a valid DNA character (!!!!)

 Wed 20 Jun - 21:35  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --dna  --scaled 10000 --ksizes 51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --check-sequence --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed 
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [51]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
    sys.exit(sourmash.__main__.main())
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: invalid DNA character in input: T

--dna --scaled 10000 --ksizes 21,33,51 --> no error

This is computed just fine:

 ✘  Wed 20 Jun - 21:37  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --dna  --scaled 10000 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed            
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing only DNA (and not protein) signatures.
Computing a total of 3 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 2839105 sequences
calculated 3 signatures for 2839105 sequences taken from 1 files
saved 3 signature(s). Note: signature license is CC0.

--dna --scaled=10 --ksizes 21,33,51 --> no error

To test memory issues, I also tried a dense signature of --scaled=10, which seems to be computing fine and has been running for the past hour.

 ✘  Wed 20 Jun - 21:43  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --dna  --scaled 10 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing only DNA (and not protein) signatures.
Computing a total of 3 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 2839105 sequences
calculated 3 signatures for 2839105 sequences taken from 1 files
saved 3 signature(s). Note: signature license is CC0.

--dna --protein --scaled 10000 --ksizes 21,33,51 --> ValueError: basic_string::substr

 Wed 20 Jun - 21:30  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@olgabot-reflow  sourmash compute --track-abundance --dna --protein --scaled 10000  --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 900000Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 6, in <module>
    sys.exit(sourmash.__main__.main())
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/anaconda/lib/python3.6/site-packages/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr

@ctb
Copy link
Contributor

ctb commented Jun 20, 2018 via email

@ctb
Copy link
Contributor

ctb commented Jun 21, 2018 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 21, 2018

sorry about that! The S3 prefix is public now. Here is a HTTP download link too, just in case.

@ctb
Copy link
Contributor

ctb commented Jun 21, 2018

It looks like the problem is with this sequence:

CNTTAAATCAGTTATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCCGACGGGCGCTGACCCCCC

which has an invalid character (an 'N') in it. --check-sequence correctly (in my view :) flags this as a problem.

(The error message is reporting the wrong character, and I'll fix it when I track it down.)

@ctb
Copy link
Contributor

ctb commented Jun 21, 2018

Fixed in #503.

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 21, 2018

Ah okay great! For the offending N characters, how would you recommend removing them? These were ran through trim-low-abund.py so I would have thought that the Ns would be removed then. It seems like the data should get run through a generic sequence quality trimmer first.

Should the order be:

Option A

  1. trimmomatic/trim_galore
  2. trim-low-abund.py
  3. sourmash compute

Option B

  1. trim-low-abund.py
  2. trimmomatic/trim_galore
  3. sourmash compute

@ctb
Copy link
Contributor

ctb commented Jun 21, 2018 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 21, 2018

Got it. I'm still getting the ValueError: basic_string::substr with the latest master.

Here's the install command:

 Thu 21 Jun - 16:37  ~/sourmash   origin ☊ master ✔ 
 ubuntu@ip-172-31-42-179  pip install -e .
Obtaining file:///home/ubuntu/sourmash
Requirement already satisfied: screed>=0.9 in /mnt/data/anaconda/lib/python3.6/site-packages (from sourmash==2.0.0a7)
Requirement already satisfied: ijson in /mnt/data/anaconda/lib/python3.6/site-packages (from sourmash==2.0.0a7)
Requirement already satisfied: khmer>=2.1<3.0 in /mnt/data/anaconda/lib/python3.6/site-packages (from sourmash==2.0.0a7)
Requirement already satisfied: bz2file in /mnt/data/anaconda/lib/python3.6/site-packages (from screed>=0.9->sourmash==2.0.0a7)
Requirement already satisfied: Cython>=0.25.2 in /mnt/data/anaconda/lib/python3.6/site-packages (from khmer>=2.1<3.0->sourmash==2.0.0a7)
Installing collected packages: sourmash
  Found existing installation: sourmash 2.0.0a7
    Uninstalling sourmash-2.0.0a7:
      Successfully uninstalled sourmash-2.0.0a7
  Running setup.py develop for sourmash
Successfully installed sourmash
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
 Thu 21 Jun - 16:38  ~/sourmash   origin ☊ master ✔ 
 ubuntu@ip-172-31-42-179  git log | head
commit a49456c78bf8608b389ef90fef9330531fa0a438
Author: C. Titus Brown <titus@idyll.org>
Date:   Thu Jun 21 09:22:56 2018 -0700

    report entire bad k-mer (#503)

commit b0acc34b4ac9a0861fce5cb822c6b5fc34daaeaa
Author: C. Titus Brown <titus@idyll.org>
Date:   Tue Jun 19 10:02:38 2018 -0700

And here's the sourmash compute command:

 Thu 21 Jun - 16:38  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@ip-172-31-42-179  sourmash compute --track-abundance --protein --dna --scaled 10000 --ksizes 21,33,51 --merge 'G14-MAA001857-3_38_F-1-1|tissue:Pancreas|subtissue:Endocrine|cell_ontology_class:pancreatic_A_cell|free_annotation:pancreatic_A_cell' --output G14-MAA001857-3_38_F-1-1.signature G14-MAA001857-3_38_F-1-1.trimmed
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed
... G14-MAA001857-3_38_F-1-1.trimmed 10000
... G14-MAA001857-3_38_F-1-1.trimmed 900000Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/home/ubuntu/sourmash/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 287, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr: __pos (which is 10) > this->size() (which is 9)

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 21, 2018

I split the file into 10k seq pieces (40k lines) and _split9000 was the offending sequence (download) which caused the problem:

 Thu 21 Jun - 18:51  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@ip-172-31-42-179  sourmash compute --force --track-abundance --protein --dna --scaled 10000 --ksizes 21,33,51 G14-MAA001857-3_38_F-1-1.trimmed.split_*
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed.split_00, G14-MAA001857-3_38_F-1-1.trimmed.split_01, G14-MAA001857-3_38_F-1-1.trimmed.split_02, G14-MAA001857-3_38_F-1-1.trimmed.split_03, G14-MAA001857-3_38_F-1-1.trimmed.split_04, G14-MAA001857-3_38_F-1-1.trimmed.split_05, G14-MAA001857-3_38_F-1-1.trimmed.split_06, G14-MAA001857-3_38_F-1-1.trimmed.split_07, G14-MAA001857-3_38_F-1-1.trimmed.split_08, G14-MAA001857-3_38_F-1-1.trimmed.split_09, G14-MAA001857-3_38_F-1-1.trimmed.split_10, G14-MAA001857-3_38_F-1-1.trimmed.split_11, G14-MAA001857-3_38_F-1-1.trimmed.split_12, G14-MAA001857-3_38_F-1-1.trimmed.split_13, G14-MAA001857-3_38_F-1-1.trimmed.split_14, G14-MAA001857-3_38_F-1-1.trimmed.split_15, G14-MAA001857-3_38_F-1-1.trimmed.split_16, G14-MAA001857-3_38_F-1-1.trimmed.split_17, G14-MAA001857-3_38_F-1-1.trimmed.split_18, G14-MAA001857-3_38_F-1-1.trimmed.split_19, G14-MAA001857-3_38_F-1-1.trimmed.split_20, G14-MAA001857-3_38_F-1-1.trimmed.split_21, G14-MAA001857-3_38_F-1-1.trimmed.split_22, G14-MAA001857-3_38_F-1-1.trimmed.split_23, G14-MAA001857-3_38_F-1-1.trimmed.split_24, G14-MAA001857-3_38_F-1-1.trimmed.split_25, G14-MAA001857-3_38_F-1-1.trimmed.split_26, G14-MAA001857-3_38_F-1-1.trimmed.split_27, G14-MAA001857-3_38_F-1-1.trimmed.split_28, G14-MAA001857-3_38_F-1-1.trimmed.split_29, G14-MAA001857-3_38_F-1-1.trimmed.split_30, G14-MAA001857-3_38_F-1-1.trimmed.split_31, G14-MAA001857-3_38_F-1-1.trimmed.split_32, G14-MAA001857-3_38_F-1-1.trimmed.split_33, G14-MAA001857-3_38_F-1-1.trimmed.split_34, G14-MAA001857-3_38_F-1-1.trimmed.split_35, G14-MAA001857-3_38_F-1-1.trimmed.split_36, G14-MAA001857-3_38_F-1-1.trimmed.split_37, G14-MAA001857-3_38_F-1-1.trimmed.split_38, G14-MAA001857-3_38_F-1-1.trimmed.split_39, G14-MAA001857-3_38_F-1-1.trimmed.split_40, G14-MAA001857-3_38_F-1-1.trimmed.split_41, G14-MAA001857-3_38_F-1-1.trimmed.split_42, G14-MAA001857-3_38_F-1-1.trimmed.split_43, G14-MAA001857-3_38_F-1-1.trimmed.split_44, G14-MAA001857-3_38_F-1-1.trimmed.split_45, G14-MAA001857-3_38_F-1-1.trimmed.split_46, G14-MAA001857-3_38_F-1-1.trimmed.split_47, G14-MAA001857-3_38_F-1-1.trimmed.split_48, G14-MAA001857-3_38_F-1-1.trimmed.split_49, G14-MAA001857-3_38_F-1-1.trimmed.split_50, G14-MAA001857-3_38_F-1-1.trimmed.split_51, G14-MAA001857-3_38_F-1-1.trimmed.split_52, G14-MAA001857-3_38_F-1-1.trimmed.split_53, G14-MAA001857-3_38_F-1-1.trimmed.split_54, G14-MAA001857-3_38_F-1-1.trimmed.split_55, G14-MAA001857-3_38_F-1-1.trimmed.split_56, G14-MAA001857-3_38_F-1-1.trimmed.split_57, G14-MAA001857-3_38_F-1-1.trimmed.split_58, G14-MAA001857-3_38_F-1-1.trimmed.split_59, G14-MAA001857-3_38_F-1-1.trimmed.split_60, G14-MAA001857-3_38_F-1-1.trimmed.split_61, G14-MAA001857-3_38_F-1-1.trimmed.split_62, G14-MAA001857-3_38_F-1-1.trimmed.split_63, G14-MAA001857-3_38_F-1-1.trimmed.split_64, G14-MAA001857-3_38_F-1-1.trimmed.split_65, G14-MAA001857-3_38_F-1-1.trimmed.split_66, G14-MAA001857-3_38_F-1-1.trimmed.split_67, G14-MAA001857-3_38_F-1-1.trimmed.split_68, G14-MAA001857-3_38_F-1-1.trimmed.split_69, G14-MAA001857-3_38_F-1-1.trimmed.split_70, G14-MAA001857-3_38_F-1-1.trimmed.split_71, G14-MAA001857-3_38_F-1-1.trimmed.split_72, G14-MAA001857-3_38_F-1-1.trimmed.split_73, G14-MAA001857-3_38_F-1-1.trimmed.split_74, G14-MAA001857-3_38_F-1-1.trimmed.split_75, G14-MAA001857-3_38_F-1-1.trimmed.split_76, G14-MAA001857-3_38_F-1-1.trimmed.split_77, G14-MAA001857-3_38_F-1-1.trimmed.split_78, G14-MAA001857-3_38_F-1-1.trimmed.split_79, G14-MAA001857-3_38_F-1-1.trimmed.split_80, G14-MAA001857-3_38_F-1-1.trimmed.split_81, G14-MAA001857-3_38_F-1-1.trimmed.split_82, G14-MAA001857-3_38_F-1-1.trimmed.split_83, G14-MAA001857-3_38_F-1-1.trimmed.split_84, G14-MAA001857-3_38_F-1-1.trimmed.split_85, G14-MAA001857-3_38_F-1-1.trimmed.split_86, G14-MAA001857-3_38_F-1-1.trimmed.split_87, G14-MAA001857-3_38_F-1-1.trimmed.split_88, G14-MAA001857-3_38_F-1-1.trimmed.split_89, G14-MAA001857-3_38_F-1-1.trimmed.split_9000, G14-MAA001857-3_38_F-1-1.trimmed.split_9001, G14-MAA001857-3_38_F-1-1.trimmed.split_9002, G14-MAA001857-3_38_F-1-1.trimmed.split_9003, G14-MAA001857-3_38_F-1-1.trimmed.split_9004, G14-MAA001857-3_38_F-1-1.trimmed.split_9005, G14-MAA001857-3_38_F-1-1.trimmed.split_9006, G14-MAA001857-3_38_F-1-1.trimmed.split_9007, G14-MAA001857-3_38_F-1-1.trimmed.split_9008, G14-MAA001857-3_38_F-1-1.trimmed.split_9009, G14-MAA001857-3_38_F-1-1.trimmed.split_9010, G14-MAA001857-3_38_F-1-1.trimmed.split_9011, G14-MAA001857-3_38_F-1-1.trimmed.split_9012, G14-MAA001857-3_38_F-1-1.trimmed.split_9013, G14-MAA001857-3_38_F-1-1.trimmed.split_9014, G14-MAA001857-3_38_F-1-1.trimmed.split_9015, G14-MAA001857-3_38_F-1-1.trimmed.split_9016, G14-MAA001857-3_38_F-1-1.trimmed.split_9017, G14-MAA001857-3_38_F-1-1.trimmed.split_9018, G14-MAA001857-3_38_F-1-1.trimmed.split_9019, G14-MAA001857-3_38_F-1-1.trimmed.split_9020, G14-MAA001857-3_38_F-1-1.trimmed.split_9021, G14-MAA001857-3_38_F-1-1.trimmed.split_9022, G14-MAA001857-3_38_F-1-1.trimmed.split_9023, G14-MAA001857-3_38_F-1-1.trimmed.split_9024, G14-MAA001857-3_38_F-1-1.trimmed.split_9025, G14-MAA001857-3_38_F-1-1.trimmed.split_9026, G14-MAA001857-3_38_F-1-1.trimmed.split_9027, G14-MAA001857-3_38_F-1-1.trimmed.split_9028, G14-MAA001857-3_38_F-1-1.trimmed.split_9029, G14-MAA001857-3_38_F-1-1.trimmed.split_9030, G14-MAA001857-3_38_F-1-1.trimmed.split_9031, G14-MAA001857-3_38_F-1-1.trimmed.split_9032, G14-MAA001857-3_38_F-1-1.trimmed.split_9033, G14-MAA001857-3_38_F-1-1.trimmed.split_9034, G14-MAA001857-3_38_F-1-1.trimmed.split_9035, G14-MAA001857-3_38_F-1-1.trimmed.split_9036, G14-MAA001857-3_38_F-1-1.trimmed.split_9037, G14-MAA001857-3_38_F-1-1.trimmed.split_9038, G14-MAA001857-3_38_F-1-1.trimmed.split_9039, G14-MAA001857-3_38_F-1-1.trimmed.split_9040, G14-MAA001857-3_38_F-1-1.trimmed.split_9041, G14-MAA001857-3_38_F-1-1.trimmed.split_9042, G14-MAA001857-3_38_F-1-1.trimmed.split_9043, G14-MAA001857-3_38_F-1-1.trimmed.split_9044, G14-MAA001857-3_38_F-1-1.trimmed.split_9045, G14-MAA001857-3_38_F-1-1.trimmed.split_9046, G14-MAA001857-3_38_F-1-1.trimmed.split_9047, G14-MAA001857-3_38_F-1-1.trimmed.split_9048, G14-MAA001857-3_38_F-1-1.trimmed.split_9049, G14-MAA001857-3_38_F-1-1.trimmed.split_9050, G14-MAA001857-3_38_F-1-1.trimmed.split_9051, G14-MAA001857-3_38_F-1-1.trimmed.split_9052, G14-MAA001857-3_38_F-1-1.trimmed.split_9053, G14-MAA001857-3_38_F-1-1.trimmed.split_9054, G14-MAA001857-3_38_F-1-1.trimmed.split_9055, G14-MAA001857-3_38_F-1-1.trimmed.split_9056, G14-MAA001857-3_38_F-1-1.trimmed.split_9057, G14-MAA001857-3_38_F-1-1.trimmed.split_9058, G14-MAA001857-3_38_F-1-1.trimmed.split_9059, G14-MAA001857-3_38_F-1-1.trimmed.split_9060, G14-MAA001857-3_38_F-1-1.trimmed.split_9061, G14-MAA001857-3_38_F-1-1.trimmed.split_9062, G14-MAA001857-3_38_F-1-1.trimmed.split_9063, G14-MAA001857-3_38_F-1-1.trimmed.split_9064, G14-MAA001857-3_38_F-1-1.trimmed.split_9065, G14-MAA001857-3_38_F-1-1.trimmed.split_9066, G14-MAA001857-3_38_F-1-1.trimmed.split_9067, G14-MAA001857-3_38_F-1-1.trimmed.split_9068, G14-MAA001857-3_38_F-1-1.trimmed.split_9069, G14-MAA001857-3_38_F-1-1.trimmed.split_9070, G14-MAA001857-3_38_F-1-1.trimmed.split_9071, G14-MAA001857-3_38_F-1-1.trimmed.split_9072, G14-MAA001857-3_38_F-1-1.trimmed.split_9073, G14-MAA001857-3_38_F-1-1.trimmed.split_9074, G14-MAA001857-3_38_F-1-1.trimmed.split_9075, G14-MAA001857-3_38_F-1-1.trimmed.split_9076, G14-MAA001857-3_38_F-1-1.trimmed.split_9077, G14-MAA001857-3_38_F-1-1.trimmed.split_9078, G14-MAA001857-3_38_F-1-1.trimmed.split_9079, G14-MAA001857-3_38_F-1-1.trimmed.split_9080, G14-MAA001857-3_38_F-1-1.trimmed.split_9081, G14-MAA001857-3_38_F-1-1.trimmed.split_9082, G14-MAA001857-3_38_F-1-1.trimmed.split_9083, G14-MAA001857-3_38_F-1-1.trimmed.split_9084, G14-MAA001857-3_38_F-1-1.trimmed.split_9085, G14-MAA001857-3_38_F-1-1.trimmed.split_9086, G14-MAA001857-3_38_F-1-1.trimmed.split_9087, G14-MAA001857-3_38_F-1-1.trimmed.split_9088, G14-MAA001857-3_38_F-1-1.trimmed.split_9089, G14-MAA001857-3_38_F-1-1.trimmed.split_9090, G14-MAA001857-3_38_F-1-1.trimmed.split_9091, G14-MAA001857-3_38_F-1-1.trimmed.split_9092, G14-MAA001857-3_38_F-1-1.trimmed.split_9093, G14-MAA001857-3_38_F-1-1.trimmed.split_9094, G14-MAA001857-3_38_F-1-1.trimmed.split_9095, G14-MAA001857-3_38_F-1-1.trimmed.split_9096, G14-MAA001857-3_38_F-1-1.trimmed.split_9097, G14-MAA001857-3_38_F-1-1.trimmed.split_9098, G14-MAA001857-3_38_F-1-1.trimmed.split_9099, G14-MAA001857-3_38_F-1-1.trimmed.split_9100, G14-MAA001857-3_38_F-1-1.trimmed.split_9101, G14-MAA001857-3_38_F-1-1.trimmed.split_9102, G14-MAA001857-3_38_F-1-1.trimmed.split_9103, G14-MAA001857-3_38_F-1-1.trimmed.split_9104, G14-MAA001857-3_38_F-1-1.trimmed.split_9105, G14-MAA001857-3_38_F-1-1.trimmed.split_9106, G14-MAA001857-3_38_F-1-1.trimmed.split_9107, G14-MAA001857-3_38_F-1-1.trimmed.split_9108, G14-MAA001857-3_38_F-1-1.trimmed.split_9109, G14-MAA001857-3_38_F-1-1.trimmed.split_9110, G14-MAA001857-3_38_F-1-1.trimmed.split_9111, G14-MAA001857-3_38_F-1-1.trimmed.split_9112, G14-MAA001857-3_38_F-1-1.trimmed.split_9113, G14-MAA001857-3_38_F-1-1.trimmed.split_9114, G14-MAA001857-3_38_F-1-1.trimmed.split_9115, G14-MAA001857-3_38_F-1-1.trimmed.split_9116, G14-MAA001857-3_38_F-1-1.trimmed.split_9117, G14-MAA001857-3_38_F-1-1.trimmed.split_9118, G14-MAA001857-3_38_F-1-1.trimmed.split_9119, G14-MAA001857-3_38_F-1-1.trimmed.split_9120, G14-MAA001857-3_38_F-1-1.trimmed.split_9121, G14-MAA001857-3_38_F-1-1.trimmed.split_9122, G14-MAA001857-3_38_F-1-1.trimmed.split_9123, G14-MAA001857-3_38_F-1-1.trimmed.split_9124, G14-MAA001857-3_38_F-1-1.trimmed.split_9125, G14-MAA001857-3_38_F-1-1.trimmed.split_9126, G14-MAA001857-3_38_F-1-1.trimmed.split_9127, G14-MAA001857-3_38_F-1-1.trimmed.split_9128, G14-MAA001857-3_38_F-1-1.trimmed.split_9129, G14-MAA001857-3_38_F-1-1.trimmed.split_9130, G14-MAA001857-3_38_F-1-1.trimmed.split_9131, G14-MAA001857-3_38_F-1-1.trimmed.split_9132, G14-MAA001857-3_38_F-1-1.trimmed.split_9133, G14-MAA001857-3_38_F-1-1.trimmed.split_9134, G14-MAA001857-3_38_F-1-1.trimmed.split_9135, G14-MAA001857-3_38_F-1-1.trimmed.split_9136, G14-MAA001857-3_38_F-1-1.trimmed.split_9137, G14-MAA001857-3_38_F-1-1.trimmed.split_9138, G14-MAA001857-3_38_F-1-1.trimmed.split_9139, G14-MAA001857-3_38_F-1-1.trimmed.split_9140, G14-MAA001857-3_38_F-1-1.trimmed.split_9141, G14-MAA001857-3_38_F-1-1.trimmed.split_9142, G14-MAA001857-3_38_F-1-1.trimmed.split_9143, G14-MAA001857-3_38_F-1-1.trimmed.split_9144, G14-MAA001857-3_38_F-1-1.trimmed.split_9145, G14-MAA001857-3_38_F-1-1.trimmed.split_9146, G14-MAA001857-3_38_F-1-1.trimmed.split_9147, G14-MAA001857-3_38_F-1-1.trimmed.split_9148, G14-MAA001857-3_38_F-1-1.trimmed.split_9149, G14-MAA001857-3_38_F-1-1.trimmed.split_9150, G14-MAA001857-3_38_F-1-1.trimmed.split_9151, G14-MAA001857-3_38_F-1-1.trimmed.split_9152, G14-MAA001857-3_38_F-1-1.trimmed.split_9153, G14-MAA001857-3_38_F-1-1.trimmed.split_9154, G14-MAA001857-3_38_F-1-1.trimmed.split_9155, G14-MAA001857-3_38_F-1-1.trimmed.split_9156, G14-MAA001857-3_38_F-1-1.trimmed.split_9157, G14-MAA001857-3_38_F-1-1.trimmed.split_9158, G14-MAA001857-3_38_F-1-1.trimmed.split_9159, G14-MAA001857-3_38_F-1-1.trimmed.split_9160, G14-MAA001857-3_38_F-1-1.trimmed.split_9161, G14-MAA001857-3_38_F-1-1.trimmed.split_9162, G14-MAA001857-3_38_F-1-1.trimmed.split_9163, G14-MAA001857-3_38_F-1-1.trimmed.split_9164, G14-MAA001857-3_38_F-1-1.trimmed.split_9165, G14-MAA001857-3_38_F-1-1.trimmed.split_9166, G14-MAA001857-3_38_F-1-1.trimmed.split_9167, G14-MAA001857-3_38_F-1-1.trimmed.split_9168, G14-MAA001857-3_38_F-1-1.trimmed.split_9169, G14-MAA001857-3_38_F-1-1.trimmed.split_9170, G14-MAA001857-3_38_F-1-1.trimmed.split_9171, G14-MAA001857-3_38_F-1-1.trimmed.split_9172, G14-MAA001857-3_38_F-1-1.trimmed.split_9173, G14-MAA001857-3_38_F-1-1.trimmed.split_9174, G14-MAA001857-3_38_F-1-1.trimmed.split_9175, G14-MAA001857-3_38_F-1-1.trimmed.split_9176, G14-MAA001857-3_38_F-1-1.trimmed.split_9177, G14-MAA001857-3_38_F-1-1.trimmed.split_9178, G14-MAA001857-3_38_F-1-1.trimmed.split_9179, G14-MAA001857-3_38_F-1-1.trimmed.split_9180, G14-MAA001857-3_38_F-1-1.trimmed.split_9181, G14-MAA001857-3_38_F-1-1.trimmed.split_9182, G14-MAA001857-3_38_F-1-1.trimmed.split_9183, G14-MAA001857-3_38_F-1-1.trimmed.split_9184, G14-MAA001857-3_38_F-1-1.trimmed.split_9185, G14-MAA001857-3_38_F-1-1.trimmed.split_9186, G14-MAA001857-3_38_F-1-1.trimmed.split_9187, G14-MAA001857-3_38_F-1-1.trimmed.split_9188, G14-MAA001857-3_38_F-1-1.trimmed.split_9189, G14-MAA001857-3_38_F-1-1.trimmed.split_9190, G14-MAA001857-3_38_F-1-1.trimmed.split_9191, G14-MAA001857-3_38_F-1-1.trimmed.split_9192, G14-MAA001857-3_38_F-1-1.trimmed.split_9193
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_00
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_00
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_01
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_01
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_02
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_02
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_03
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_03
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_04
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_04
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_05
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_05
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_06
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_06
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_07
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_07
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_08
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_08
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_09
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_09
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_10
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_10
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_11
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_11
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_12
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_12
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_13
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_13
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_14
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_14
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_15
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_15
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_16
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_16
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_17
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_17
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_18
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_18
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_19
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_19
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_20
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_20
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_21
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_21
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_22
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_22
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_23
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_23
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_24
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_24
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_25
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_25
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_26
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_26
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_27
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_27
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_28
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_28
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_29
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_29
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_30
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_30
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_31
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_31
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_32
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_32
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_33
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_33
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_34
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_34
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_35
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_35
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_36
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_36
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_37
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_37
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_38
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_38
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_39
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_39
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_40
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_40
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_41
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_41
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_42
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_42
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_43
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_43
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_44
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_44
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_45
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_45
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_46
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_46
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_47
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_47
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_48
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_48
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_49
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_49
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_50
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_50
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_51
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_51
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_52
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_52
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_53
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_53
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_54
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_54
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_55
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_55
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_56
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_56
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_57
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_57
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_58
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_58
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_59
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_59
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_60
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_60
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_61
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_61
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_62
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_62
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_63
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_63
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_64
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_64
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_65
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_65
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_66
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_66
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_67
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_67
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_68
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_68
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_69
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_69
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_70
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_70
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_71
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_71
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_72
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_72
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_73
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_73
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_74
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_74
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_75
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_75
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_76
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_76
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_77
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_77
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_78
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_78
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_79
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_79
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_80
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_80
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_81
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_81
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_82
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_82
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_83
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_83
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_84
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_84
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_85
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_85
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_86
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_86
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_87
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_87
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_88
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_88
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_89
calculated 6 signatures for 10000 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_89
saved 6 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000
Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/home/ubuntu/sourmash/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 255, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr: __pos (which is 10) > this->size() (which is 9)

After hunting down the error and split-ing the file a bunch, the offending sequence is GCAAATACCTGGACTCCCGCCGTGNCCAAGATT which contains an N and a few other low-quality bases. I'm not sure where the 9 and 10 substrings are coming from because this sequence is length 33, and I'm looking for length 21,33,51 DNA kmers which would be length 7, 11, and 17 protein kmers.

 Thu 21 Jun - 18:57  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@ip-172-31-42-179  sourmash compute --force --track-abundance --protein --dna --scaled 10000 --ksizes 21,33,51 G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_0*
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_01, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_02, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_03, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_04, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_05, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_06, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_07, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_08, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_09
Computing signature for ksizes: [21, 33, 51]
Computing both DNA and protein signatures.
Computing a total of 6 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00
Traceback (most recent call last):
  File "/home/ubuntu/anaconda/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/home/ubuntu/sourmash/sourmash/__main__.py", line 76, in main
    cmd(sys.argv[2:])
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 255, in compute
    args.input_is_protein, args.check_sequence)
  File "/home/ubuntu/sourmash/sourmash/commands.py", line 197, in add_seq
    E.add_sequence(seq, not check_sequence)
  File "sourmash/_minhash.pyx", line 178, in sourmash._minhash.MinHash.add_sequence
ValueError: basic_string::substr: __pos (which is 10) > this->size() (which is 9)

 ✘  Thu 21 Jun - 18:57  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@ip-172-31-42-179  cat G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00
@A00111:62:H3FYHDMXX:2:1410:25590:37043 2:N:0:TTCCGGAG+GAACGTGA
GCAAATACCTGGACTCCCGCCGTGNCCAAGATT
+
FFFFFFFFFFFFFF88FFFFFFFF#FFFFF8FF

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 21, 2018

Note: this only happens with --protein:

 Thu 21 Jun - 19:02  ~/kmer-hashing/sourmash/maca/facs_v4_1000cell_scaled_trim_comparison   origin ☊ master 3☀ 6● 
 ubuntu@ip-172-31-42-179  sourmash compute --force --track-abundance --dna --scaled 10000 --ksizes 21,33,51 G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_0*        
setting num_hashes to 0 because --scaled is set
computing signatures for files: G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_01, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_02, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_03, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_04, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_05, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_06, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_07, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_08, G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_09
Computing signature for ksizes: [21, 33, 51]
Computing only DNA (and not protein) signatures.
Computing a total of 3 signature(s).
Tracking abundance of input k-mers.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_00
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_01
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_01
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_02
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_02
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_03
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_03
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_04
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_04
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_05
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_05
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_06
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_06
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_07
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_07
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_08
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_08
saved 3 signature(s). Note: signature license is CC0.
... reading sequences from G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_09
calculated 3 signatures for 1 sequences in G14-MAA001857-3_38_F-1-1.trimmed.split_9000_split_01_split_08_split_03_split_09
saved 3 signature(s). Note: signature license is CC0.

@luizirber luizirber reopened this Jun 21, 2018
@olgabot
Copy link
Collaborator Author

olgabot commented Aug 8, 2018

Any suggestions? This problem still only occurs for me when --protein is set, and in this case check_sequence is set to False: https://sourcegraph.com/github.com/dib-lab/sourmash/-/blob/sourmash/commands.py#L197 so somehow the "N"s in the sequence aren't getting skipped when they should.

I'm fine computing only DNA signatures for now but I'd like to do protein in the future, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants