Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Explore addition of a sketch command. #1159

Merged
merged 32 commits into from
Aug 14, 2020
Merged

[MRG] Explore addition of a sketch command. #1159

merged 32 commits into from
Aug 14, 2020

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 9, 2020

A new sourmash sketch command

In this experimental PR, I implement a new command-line submodule, sketch, which has three main subcommands:

sourmash sketch dna|rna <sketch params> [ files ]

sourmash sketch aa|prot|protein <sketch params> [ files]

sourmash sketch translate <sketch params> [ files ]

Note that:

  • sketch dna is a replacement for compute.
  • sketch protein is a replacement for compute --input-is-protein --no-dna.
  • sketch translate replaces compute --protein --no-dna.
  • sketch protein --dayhoff|--hp replaces compute --dayhoff|--hp --no-dna --input-is-protein.
  • sketch translate --dayhoff|--hp replaces compute --dayhoff|--hp --no-dna.
  • k-mer sizes for sketch protein and sketch translate are now 1/3rd of what they are for compute!!!
  • all of these have different, and appropriate, default parameters!!
    • dna: k=31,scaled=1000,noabund
    • protein: k=21,scaled=200,noabund
    • dayhoff: k=19,scaled=200,noabund
    • hp: k=30,scaled=200,noabund
  • you can pass multiple param strings with very different parameters and they should all work!

Sketch parameters

The <sketch params> arguments cover the common sketch config params: ksize, num/scaled, and track_abund. We use good defaults per #219, and use explicit per-MinHash notation, e.g. -p k=31,scaled=1000,abund to construct each MinHash (with support for multiple -p).

Examples of param strings that work :):

  • -p k=31,scaled=1000,abund
  • -p k=31,noabund
  • -p k=51 - default to a scaled and abund (based on moltype/command)
    • for dna, scaled=1000, noabund
    • for protein, scaled=200, noabund
  • -p k=31,k=51,k=21 - compute multiple ksizes, defaults otherwise
  • -p k=20,num=500,protein -p k=19,num=400,dayhoff,abund
    -p k=30,scaled=200,hp -p k=30,scaled=200,seed=58 - computes multiple ksizes, moltypes, scaled/num, and even uses a different seed.

Notes and brainstorming

  • we could provide additional subcommands, like sourmash sketch reads, sourmash sketch genome, and sourmash sketch ncbi, that choose good defaults for those kinds of inputs
    • e.g. for reads, scaled=1000,abund
  • could add sourmash sketch 10x too!
  • similarly, we could alias sourmash sketch dayhoff and sourmash sketch hp.
  • could add sourmash sketch 16s (see e.g. classification with full length 16S gene #1000)
  • sketch dna -p k=21,protein and sketch protein -p k=21,dna are errors.

Additional thoughts:

  • @luizirber proposed a passthrough option, where you could chain the inputs together:
    • sourmash sketch dna --passthrough <input> | sourmash sketch protein <input> - this would yield two signatures. Not sure how to do the output tho - where does the signature file go? do we need to specify multiple different -o options?!
  • translate should take options to specify top-strand/rc only, for RNA.
  • can we make the sketch module suitable for direct use by Python, per Idea: imports match CLI? #1112?
    • e.g. sourmash.sketch.dna(filenames=...)
  • the sketch module should auto-recognize signature files and not try to load them as FASTA... sourmash compute doesn't recognize signature files #814
  • might be good to add automated traverse functionality to discover all FASTA files, if possible, at least for specific commands...

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Aug 9, 2020

Codecov Report

Merging #1159 into latest will increase coverage by 25.15%.
The diff coverage is 94.52%.

Impacted file tree graph

@@             Coverage Diff             @@
##           latest    #1159       +/-   ##
===========================================
+ Coverage   67.86%   93.01%   +25.15%     
===========================================
  Files          24       76       +52     
  Lines        3314     5986     +2672     
===========================================
+ Hits         2249     5568     +3319     
+ Misses       1065      418      -647     
Flag Coverage Δ
#rusttests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sourmash/command_sketch.py 90.12% <90.12%> (ø)
sourmash/cli/__init__.py 98.92% <100.00%> (ø)
sourmash/cli/sketch/__init__.py 100.00% <100.00%> (ø)
sourmash/cli/sketch/dna.py 100.00% <100.00%> (ø)
sourmash/cli/sketch/protein.py 100.00% <100.00%> (ø)
sourmash/cli/sketch/translate.py 100.00% <100.00%> (ø)
sourmash/command_compute.py 97.70% <100.00%> (ø)
src/core/src/from.rs
src/core/src/index/storage.rs
src/core/src/index/mod.rs
... and 97 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90db4c4...ca14aa9. Read the comment docs.

@ctb

This comment has been minimized.

@ctb

This comment has been minimized.

@ctb
Copy link
Contributor Author

ctb commented Aug 10, 2020

OK. I think this addresses most of the UX problems with sourmash compute that were mentioned in #999. Would love feedback to see what's missing!

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

This is really exciting!!

@olgabot
Copy link
Collaborator

olgabot commented Aug 11, 2020

Notes and brainstorming

  • we could provide additional subcommands, like sourmash sketch reads, sourmash sketch genome, and sourmash sketch ncbi, that choose good defaults for those kinds of inputs

    • e.g. for reads, scaled=1000,abund
  • could add sourmash sketch 10x too!

Interesting! We've found that for "realistic" 10x data that we need to do several steps before getting to the sketch part. Check out the kmermaid pipeline for implementation details, and @pranathivemuri feel free to correct me where I'm wrong.

  1. Extract aligned reads from the 10x bam file into a fastq, with per-cell barcode and per-molecular barcode (aka "UMI") tags --> one mega-fastq per bam file (via samtools fastq)
  2. Extract unaligned reads from the 10x bam file into a fastq, with per-cell barcode and per-molecular barcode (aka "UMI") tags --> one mega-fastq per bam file (via samtools fastq)
  3. Count number of reads/UMIs per cell barcode (via bam2fasta count_umis_percell) --> text file of number molecular barcodes per cell barcode
  4. Extract per-cell fastqs from the mega-fastq of aligned reads, using only the cell barcodes which have UMIs greater than a threshold (e.g. minimum 1000 UMIs) --> per-cell fastqs (via bam2fasta make_fastqs_percell) Must be parallelized or will take ~3-30 days/bam file!
  5. Extract per-cell fastqs from the mega-fastq of unaligned reads, using only the cell barcodes which have UMIs greater than a threshold (e.g. minimum 1000 UMIs) --> per-cell fastqs (via bam2fasta make_fastqs_percell) Must be parallelized or will take ~3-30 days/bam file!
  6. Trim adapters and poly-X from reads --> fastqs (via fastp)
  7. Remove ribosomal RNA sequences --> fastqs (via sortMeRNA)
  8. Translate both aligned/unaligned per-cell fastqs to protein --> per-cell protein fastas (via sencha translate)
  9. NOW we can do sourmash compute --input-is-protein :)

Skipping step (3) results in ~700,000 cell barcode files per bam, because that's the total number of theoretical cell barcodes per 10x run (even more with the v3 chemistry I think) so it is necessary. The "real" number of cell barcodes per run is closer to 2,000-5,000. If you allow all ~700,000 through, then most of your "cells" have ~1-10 reads which usually get removed from downstream analyses anyway. Min 1000 UMIs is a pretty reasonable threshold. I've seen pipelines that filter for min 1000 UMIs/cell and min 100 genes/cell.

  • similarly, we could alias sourmash sketch dayhoff and sourmash sketch hp.

This would be great!

Additional thinking/tests needed:

  • what should/does sourmash sketch dna -p k=21,protein do?

Maybe the same as now, where it does the 6-frame translation?

  • what should/does sourmash sketch protein -p k=21,dna do?

I think this should either:

  1. Throw an error
  2. Compute a protein sketch with protein ksize k=21, then back-convert the protein sequences to all possible DNA sequences that could have generated it, and create DNA sketches with DNA ksize = 3*21 = 63

Additional thoughts:

  • @luizirber proposed a passthrough option, where you could chain the inputs together:

    • sourmash sketch dna --passthrough <input> | sourmash sketch protein <input> - this would yield two signatures. Not sure how to do the output tho - where does the signature file go? do we need to specify multiple different -o options?!

Interesting! Maybe --output-signature and --output-reads ?

  • translate should take options to specify top-strand/rc only, for RNA.

Does this assume that the RNA library prep is strand-specific? If it's not strand-specific, would both sides need to be used?

This would be awesome!

👍

  • might be good to add automated traverse functionality to discover all FASTA files, if possible, at least for specific commands...

👍

EDIT: Forgot trimming & ribosomal RNA removal steps

@ctb
Copy link
Contributor Author

ctb commented Aug 13, 2020

OK, after talking to @bluegenes she's 👍 on this functionality and already started using the terminology in slides 😂

So, I'm planning to put this up for merge; remaining items:

@ctb

This comment has been minimized.

@ctb
Copy link
Contributor Author

ctb commented Aug 13, 2020

ok, so currently:

sourmash sketch dna -p k=63,protein

is identical to

sourmash sketch translate -p k=21

which makes sense!

I might have it throw an error anyway, but it's nice the code "just works"...


conveniently,

sourmash sketch protein -p k=21,dna tests/test-data/ecoli.faa -o foo.sig

throws an error already :). I'll have to make this a nicer error message, is all!

== This is sourmash version 3.5.0a0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

computing signatures for files: tests/test-data/ecoli.faa
Computing a total of 1 signature(s).
... reading sequences from tests/test-data/ecoli.faa
Traceback (most recent call last):
  File "/Users/t/miniconda3/envs/py37/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/t/dev/sourmash/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/Users/t/dev/sourmash/sourmash/cli/sketch/protein.py", line 69, in main
    return sourmash.command_sketch.protein(args)
  File "/Users/t/dev/sourmash/sourmash/command_sketch.py", line 220, in protein
    _execute_sketch(args, signatures_factory)
  File "/Users/t/dev/sourmash/sourmash/command_sketch.py", line 175, in _execute_sketch
    _compute_individual(args, signatures_factory)
  File "/Users/t/dev/sourmash/sourmash/command_compute.py", line 228, in _compute_individual
    args.input_is_protein, args.check_sequence)
  File "/Users/t/dev/sourmash/sourmash/command_compute.py", line 281, in add_seq
    sig.add_protein(seq)
  File "/Users/t/dev/sourmash/sourmash/signature.py", line 155, in add_protein
    self._methodcall(lib.signature_add_protein, to_bytes(sequence))
  File "/Users/t/dev/sourmash/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/Users/t/dev/sourmash/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidHashFunction { function: "dna" }' at src/core/src/signature.rs:367

@luizirber
Copy link
Member

luizirber commented Aug 13, 2020

throws an error already :). I'll have to make this a nicer error message, is all!
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called Result::unwrap()on anErr value: InvalidHashFunction { function: "dna" }' at src/core/src/signature.rs:367

Yikes, a Panic... Oh, I see, this is the code where it is being thrown:

self.signatures 
    .iter_mut()  
    .for_each(|sketch| { 
        sketch.add_protein(&seq).unwrap(); }  
    );

I didn't know about try_for_each before, which allows errors, so the fix is

self.signatures 
    .iter_mut()  
    .try_for_each(|sketch| { 
        sketch.add_protein(&seq) }  
    )?;

This is going to raise a ValueError, which you can then capture and give a better error message

@ctb ctb changed the title [EXP] Explore addition of a sketch command. [MRG] Explore addition of a sketch command. Aug 14, 2020
@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

Ready for review and maybe merge. See #1169 for "next steps" and additions, #1168 for "write docs before 4.0", and #1166 and #1167 for rust fixes/improvements related to issues discovered herein.

@ctb ctb merged commit d921fc3 into latest Aug 14, 2020
@ctb ctb deleted the explore_sketch branch August 14, 2020 21:37
@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants