q2-usearch

editor_options

markdown

wrap
72

q2-usearch

QIIME2 plug-in for USEARCH integration

Introduction

For years, USEARCH had been the GOAT program for many researchers to process amplicon sequencing data (Including us🙈). We originally wrote this plug-in for internal use, now with USEARCH's conversion to open-source software, we decided to publish this plug-in for the community to use. Here are the commands/pipelines this plug-in integrates into QIIME2:

Denoise valid data into ZOTU table and ZOTUs using the uniose3 algorithm
Cluster valid data into OTU table and OTUs at 97% identity threshould using the uparse algorithm
Denoise then cluster valid data into OTU table and OTUs at an user-defined identity threshould using both the uniose3 and the uclust algorithm
Perform paired-end read merging

*The plug-in is still in early development, thus is subject to interface changes

Installation

Step 1: Clone this repository to your compute node

git clone https://github.com/magicprotoss/q2-usearch .

Step 2: Activate the QIIME2 conda enviroment you wish to install to

# conda activate qiime2-amplicon-2024.5
conda activate <replace-with-your-q2-conda-env-name>

In case your don't know your q2 env's name, please run the following command

conda env list | grep qiime2

The env's name should appear in your terminal

# qiime2-amplicon-2024.5     /home/navi/miniconda3/envs/qiime2-amplicon-2024.5

Step 3: Change directory to the project folder and execute the following command

cd q2-usearch && python ./setup.py install
qiime dev refresh-cache

Step 4: Install seqkit2 and usearch12 using mamba/conda

mamba install -c bioconda seqkit">=2.0.0" usearch
# conda install -c bioconda seqkit">=2.0.0" usearch

If every thing went smoothly, you should be seeing sth. like this printed on your terminal

usearch
# usearch v12.0 [b1d935b], 132Gb RAM, 24 cores
# (C) Copyright 2013-24 Robert C. Edgar.
# https://drive5.com/usearch

Step 5: Optional Clean Up

cd .. && rm -rf q2-usearch

Methods

1. denoise-no-primer-pooled

What's under the hood?
API document

2. cluster-no-primer-pooled

What's under the hood?
API document

3. denoise-then-cluster-no-primer-pooled

What's under the hood?
API document

Tutorials on zOTU Calling

Process 'Valid data' from sequencing centers

Identify and prepare your reads files

Nowadays the common way a sequencing center sends you data is by providing you a link to a shared folder. Let's take a look at the data structure:

What you need is the files in the "Valid data" folder. And before proceeding to the next step, we recommend you to backup the all content in said folder and then rename all the valid reads files by removing excess strings other than your Sample-ID. If you're on Windows, one of the easiest way to do it is to use the power rename tool in MS PowerToys:
Import reads files into a QIIME2 Artifact

Let's start by activating the QIIME2 environment first
```
conda activate qiime2-amplicon-2024.5
```
The easiest way to import your reads files into a QIIME2 Artifact is by using a manifest file, which serves the same purpose as a mapping file in QIIME1. Since we've already renamed our reads files with Sample-IDs, we can generate the manifest file easily using the file names. There's a couple of ways to do it, there are a lot of pure bash solutions on the internet, like the one in QIIME2's library, or use the utility script, which is our preferred method. This script requires one more dependency though, let's install it first.
```
# conda install xlsxwriter
mamba install xlsxwriter
```
With that out of the way, let's run the utility script
```
python generate_metadata.py --input_path <path-to-your-valid-data> --from_filename
```
The script generates two files in the current directory. The first file 'manifest.tsv' contains the Sample-IDs and absolute file-paths for all of our reads, this will be required when importing our reads.

The second file 'metadata.xlsx' is a pre-formatted QIIME2 metadata file which contains the Sample-IDs and an empty column 'default-group'.

If we want to use QIIME2 to perform downstream analysis, it's always worth spending a little more time filling the blanks Once we've uploaded the filled metadata file, we can convert it to tsv format for QIIME2 to use. See here on how to prepare QIIME2 compatible metadata.
```
python generate_metadata.py --to_tsv
```
Now everything's prepared, we can import our reads files into a QIIME2 Artifact. Note since we are using 'valid data', the input format and schematic type here are always 'SingleEndFastqManifestPhred33V2' and 'SampleData[SequenceWithQuality]', respectively, regardless of our sequencing strategy.
```
qiime tools import \
    --input-path manifest.tsv \
    --input-format 'SingleEndFastqManifestPhred33V2' \
    --type 'SampleData[SequencesWithQuality]' \
    --output-path fastq-seqs.qza
```

Denoise reads into zOTUs

Now you've got everything prepared, let's call the plug-in and finish the job.

qiime usearch denoise-no-primer-pooled \
    --i-demultiplexed-sequences fastq-seqs.qza \
    --p-min-size 4 \
    --o-representative-sequences rep-seqs-unoise3.qza \
    --o-table table-unoise3.qza \
    --o-denoising-stats stats-unoise3.qza

If you haven't received your sequencing data yet, We've prepared the data-set used in (Dong, Guo et al. 2021) study for you to try it out.

qiime usearch denoise-no-primer-pooled \
    --i-demultiplexed-sequences ddbj_dl.qza \
    --p-min-size 4 \
    --o-representative-sequences rep-seqs-unoise3.qza \
    --o-table table-unoise3.qza \
    --o-denoising-stats stats-unoise3.qza

General Use Cases

Usage on Single-End Runs

Let's use the "Moving Pictures" tutorial as a basis, please navigate to the 'Sequence quality control and feature table construction' section. Besides Option 1: DADA2 and Option 2: Deblur, we now have a third Option.

Option 3: UNOISE3

The unoise3 command uses the UNOISE algorithm to perform denoising (error-correction) of amplicon reads. After which chimeras are removed by performing denovo chimera identification using an imporved version of uchime2 algorithm. The q2-usearch plug-in have wrapped the valid-data processing pipeline described in (Yan, Lin et al. 2024) into the denoise-no-primer-pooled method.

For single-end runs, it's mandatory to perform global trimming to all your reads. Which means to trim all your reads to a fixed length, so that reads from the same template should have the same length, or they will be splited into separate zOTUs in later stage of the pipeline. For this specific data-set, global trimming is enforced by setting the parameter --p-trunc-len n, which truncates each sequence at position n and discards all sequences shorter than n.
```
qiime usearch denoise-no-primer-pooled \
    --i-demultiplexed-sequences demux.qza \
    --p-min-size 4 \
    --p-trunc-len 120 \
    --o-representative-sequences rep-seqs-unoise3.qza \
    --o-table table-unoise3.qza \
    --o-denoising-stats stats-unoise3.qza
```
```
qiime metadata tabulate \
    --m-input-file stats-unoise3.qza \
    --o-visualization stats-unoise3.qzv
```
If we'd like to continue the tutorial using this FeatureTable (opposed to the feature tables generated in Option 1 and Option 2), run the following commands.
```
mv rep-seqs-unoise3.qza rep-seqs.qza
mv table-unoise3.qza table.qza
```
Usage on Paired-End Runs

For paired-end reads, we need to merge them into joined reads prior to denoising. For this specific reason, we'll follow the "Atacama soil microbiome" tutorial. Please navigate to the section before running DADA2.

Merging Paired End Reads

The merge-pairs method was adopted from the plug-in q2-vsearch, with a few tweaks to the default parameters. For details on how to set parameters for reads merging, please check out the usearch manual.
```
qiime usearch merge-pairs \
    --i-demultiplexed-seqs demux.qza \
    --o-merged-sequences merged.qza \
    --o-unmerged-sequences unmerged.qza \
    --verbose
```
zOTU Calling
```
qiime usearch denoise-no-primer-pooled \
    --i-demultiplexed-sequences merged.qza \
    --p-min-size 4 \
    --o-representative-sequences rep-seqs-unoise3.qza \
    --o-table table-unoise3.qza \
    --o-denoising-stats stats-unoise3.qza
```

What's Planned for the future?

Several Methods and pipelines were planned for future releases:

Methods:

Classify FeatureData[Sequences] using sintax
Perform OTU cluster (uclust) by using raw PacBio CCS data and DADA2 outputs as inputs.

Pipelines:

Perform merging(PE) ➡️ primer-removal ➡️ denoise/OTU-cluster on demultiplexed raw illumina data in a single pipeline
Find exact matches of FeatureData[Sequences] against a given database using global search then classify unmatched reads using sintax (similar to q2-feature-classifier's classify-hybrid-vsearch-sklearn and dada2's assignTaxonomy() followed by addSpecies())

Let me know if you have any questions😉

Happy QIIMEing 🎉🎉🎉

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
ci/recipe		ci/recipe
docs		docs
q2_usearch		q2_usearch
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_ZH_CN.md		README_ZH_CN.md
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

q2-usearch

Introduction

Installation

Methods

1. denoise-no-primer-pooled

2. cluster-no-primer-pooled

3. denoise-then-cluster-no-primer-pooled

Tutorials on zOTU Calling

Process 'Valid data' from sequencing centers

General Use Cases

Option 3: UNOISE3

What's Planned for the future?

About

Releases

Packages

Languages

License

magicprotoss/q2-usearch

Folders and files

Latest commit

History

Repository files navigation

q2-usearch

Introduction

Installation

Methods

1. denoise-no-primer-pooled

2. cluster-no-primer-pooled

3. denoise-then-cluster-no-primer-pooled

Tutorials on zOTU Calling

Process 'Valid data' from sequencing centers

General Use Cases

Option 3: UNOISE3

What's Planned for the future?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages