Skip to content

Commit

Permalink
VSEARCH 2.25.0: Added chimeras_diff_pct option
Browse files Browse the repository at this point in the history
  • Loading branch information
torognes committed Nov 10, 2023
1 parent 8db8277 commit ddf6073
Show file tree
Hide file tree
Showing 8 changed files with 180 additions and 66 deletions.
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo

## Getting Help

If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.

## Example

Expand All @@ -50,9 +50,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:

```
wget https://github.com/torognes/vsearch/archive/v2.24.0.tar.gz
tar xzf v2.24.0.tar.gz
cd vsearch-2.24.0
wget https://github.com/torognes/vsearch/archive/v2.25.0.tar.gz
tar xzf v2.25.0.tar.gz
cd vsearch-2.25.0
./autogen.sh
./configure CFLAGS="-O3" CXXFLAGS="-O3"
make
Expand Down Expand Up @@ -81,48 +81,48 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:

```sh
wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-x86_64.tar.gz
tar xzf vsearch-2.24.0-linux-x86_64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-x86_64.tar.gz
tar xzf vsearch-2.25.0-linux-x86_64.tar.gz
```

Or these commands if you are using a Linux ppc64le system:

```sh
wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-ppc64le.tar.gz
tar xzf vsearch-2.24.0-linux-ppc64le.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-ppc64le.tar.gz
tar xzf vsearch-2.25.0-linux-ppc64le.tar.gz
```

Or these commands if you are using a Linux aarch64 (arm64) system:

```sh
wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-linux-aarch64.tar.gz
tar xzf vsearch-2.24.0-linux-aarch64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-linux-aarch64.tar.gz
tar xzf vsearch-2.25.0-linux-aarch64.tar.gz
```

Or these commands if you are using a Mac with an Apple Silicon CPU:

```sh
wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-macos-aarch64.tar.gz
tar xzf vsearch-2.24.0-macos-aarch64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-macos-aarch64.tar.gz
tar xzf vsearch-2.25.0-macos-aarch64.tar.gz
```

Or these commands if you are using a Mac with an Intel CPU:

```sh
wget https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-macos-x86_64.tar.gz
tar xzf vsearch-2.24.0-macos-x86_64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-macos-x86_64.tar.gz
tar xzf vsearch-2.25.0-macos-x86_64.tar.gz
```

Or if you are using Windows, download and extract (unzip) the contents of this file:

```
https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch-2.24.0-win-x86_64.zip
https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch-2.25.0-win-x86_64.zip
```

Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.24.0-linux-x86_64` or `vsearch-2.24.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.
Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.25.0-linux-x86_64` or `vsearch-2.25.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed.

**Windows**: You will now have the binary distribution in a folder
called `vsearch-2.24.0-win-x86_64`. The vsearch executable is called
called `vsearch-2.25.0-win-x86_64`. The vsearch executable is called
`vsearch.exe`. The manual in PDF format is called
`vsearch_manual.pdf`. If you want to be able to call `vsearch.exe`
from any command prompt window, you can put the vsearch executable in
Expand All @@ -133,7 +133,7 @@ searching for it in the Start menu, `Edit` user variables, add
your changes.


**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.24.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.25.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).


## Packages, plugins, and wrappers
Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.

AC_PREREQ([2.63])
AC_INIT([vsearch], [2.24.0], [torognes@ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
AC_INIT([vsearch], [2.25.0], [torognes@ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
Expand Down
6 changes: 5 additions & 1 deletion man/vsearch.1
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
.\" ============================================================================
.TH vsearch 1 "October 26, 2023" "version 2.24.0" "USER COMMANDS"
.TH vsearch 1 "November 10, 2023" "version 2.25.0" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em a versatile open-source tool for microbiome analysis,
Expand Down Expand Up @@ -4830,6 +4830,10 @@ with clustering. Add more references.
Update documentation. Improve code. Allow up to 20 parents for the
undocumented and experimental chimeras_denovo command. Fix compilation
warnings for sha1.c. Compile for release (not debug) by default.
.TP
.BR v2.25.0\~ "released November 10th, 2023"
Allow a given percentage of mismatches between chimeras and parents
for the experimental chimeras_denovo command.
.LP
.\" ============================================================================
.\" TODO:
Expand Down
158 changes: 126 additions & 32 deletions src/chimera.cc
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,9 @@ struct chimera_info_s
int * smooth;
int * maxsmooth;

double * scan_p;
double * scan_q;

int parents_found;
int best_parents[maxparents];
int best_start[maxparents];
Expand Down Expand Up @@ -213,6 +216,11 @@ void realloc_arrays(struct chimera_info_s * ci)
ci->smooth = (int*) xrealloc(ci->smooth,
maxcandidates * maxqlen * sizeof(int));

ci->scan_p = (double *) xrealloc(ci->scan_p,
(maxqlen + 1) * sizeof(double));
ci->scan_q = (double *) xrealloc(ci->scan_q,
(maxqlen + 1) * sizeof(double));

int maxalnlen = maxqlen + 2 * db_getlongestsequence();
for (int f = 0; f < maxparents ; f++)
{
Expand Down Expand Up @@ -306,10 +314,75 @@ int compare_positions(const void * a, const void * b)
return 0;
}

bool scan_matches(struct chimera_info_s * ci,
int * matches,
int len,
double percentage,
int * best_start,
int * best_len)
{
/*
Scan matches array of zeros and ones, and find the longest subsequence
having a match fraction above or equal to the given percentage (e.g. 2%).
Based on an idea of finding the longest positive sum substring:
https://stackoverflow.com/questions/28356453/longest-positive-sum-substring
If the percentage is 2%, matches are given a score of 2 and mismatches -98.
*/

double score_match = percentage;
double score_mismatch = percentage - 100.0;

double * p = ci->scan_p;
double * q = ci->scan_q;

p[0] = 0.0;
for (int i = 0; i < len; i++)
p[i + 1] = p[i] + (matches[i] ? score_match : score_mismatch);

q[len] = p[len];
for (int i = len - 1; i >= 0; i--)
q[i] = MAX(q[i + 1], p[i]);

int best_i = 0;
int best_d = -1;
double best_c = -1.0;
int i = 1;
int j = 1;
while (j <= len)
{
double c = q[j] - p[i - 1];
if (c >= 0.0)
{
int d = j - i + 1;
if (d > best_d)
{
best_i = i;
best_d = d;
best_c = c;
}
j += 1;
}
else
{
i += 1;
}
}

if (best_c >= 0.0)
{
* best_start = best_i - 1;
* best_len = best_d;
return true;
}
else
return false;
}

int find_best_parents_long(struct chimera_info_s * ci)
{
/* find parents with longest perfect match regions,
excluding regions matched by previously identified parents */
/* Find parents with longest matching regions, without indels, allowing
a given percentage of mismatches (specified with --chimeras_diff_pct),
and excluding regions matched by previously identified parents. */

find_matches(ci);

Expand Down Expand Up @@ -342,29 +415,38 @@ int find_best_parents_long(struct chimera_info_s * ci)
{
int start = 0;
int len = 0;

for (int j = 0; j < ci->query_len; j++)
int j = 0;
while (j < ci->query_len)
{
if ((position_used[j] == false) &&
(ci->match[i * ci->query_len + j] == 1) &&
((len == 0) || (ci->insert[i * ci->query_len + j] == 0)))
start = j;
len = 0;
while ((j < ci->query_len) &&
(! position_used[j]) &&
((len == 0) || (ci->insert[i * ci->query_len + j] == 0)))
{
if (len == 0)
{
start = j;
}
len++;
if (len > best_len)
{
best_cand = i;
best_start = start;
best_len = len;
}
j++;
}
else
if (len > best_len)
{
len = 0;
int scan_best_start = 0;
int scan_best_len = 0;
if (scan_matches(ci,
ci->match + i*ci->query_len + start,
len,
opt_chimeras_diff_pct,
& scan_best_start,
& scan_best_len))
{
if (scan_best_len > best_len)
{
best_cand = i;
best_start = start + scan_best_start;
best_len = scan_best_len;
}
}
}
j++;
}
}

Expand Down Expand Up @@ -727,7 +809,7 @@ int eval_parents_long(struct chimera_info_s * ci)
unsigned int qsym = chrmap_4bit[(int)(ci->qaln[i])];
unsigned int psym[maxparents];
for (int f = 0; f < maxparents; f++)
psym[f] = 0;
psym[f] = 0;
for (int f = 0; f < ci->parents_found; f++)
psym[f] = chrmap_4bit[(int)(ci->paln[f][i])];

Expand All @@ -748,13 +830,15 @@ int eval_parents_long(struct chimera_info_s * ci)

if (all_defined)
{
bool parents_equal = true;
for (int f = 1; f < ci->parents_found; f++)
if (psym[f] != psym[0])
parents_equal = false;

if (! parents_equal)
diff = ci->model[i];
int z = 0;
for (int f = 0; f < ci->parents_found; f++)
if (psym[f] == qsym)
{
diff = 'A' + f;
z++;
}
if (z > 1)
diff = ' ';
}

ci->diffs[i] = diff;
Expand All @@ -778,11 +862,11 @@ int eval_parents_long(struct chimera_info_s * ci)
char qsym = chrmap_4bit[(int)(ci->qaln[i])];

for(int f = 0; f < ci->parents_found; f++)
{
char psym = chrmap_4bit[(int)(ci->paln[f][i])];
if (qsym == psym)
match_QP[f]++;
}
{
char psym = chrmap_4bit[(int)(ci->paln[f][i])];
if (qsym == psym)
match_QP[f]++;
}
}


Expand Down Expand Up @@ -1632,6 +1716,8 @@ void chimera_thread_init(struct chimera_info_s * ci)
ci->votes = nullptr;
ci->model = nullptr;
ci->ignore = nullptr;
ci->scan_p = nullptr;
ci->scan_q = nullptr;

for (int f = 0; f < maxparents; f++)
{
Expand Down Expand Up @@ -1716,6 +1802,14 @@ void chimera_thread_exit(struct chimera_info_s * ci)
{
xfree(ci->query_head);
}
if (ci->scan_p)
{
xfree(ci->scan_p);
}
if (ci->scan_q)
{
xfree(ci->scan_q);
}

for (int f = 0; f < maxparents; f++)
if (ci->paln[f])
Expand Down
18 changes: 9 additions & 9 deletions src/fasta.cc
Original file line number Diff line number Diff line change
Expand Up @@ -426,23 +426,23 @@ void fasta_print_general(FILE * fp,
if ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0))
{
if (ee < 0.000000001)
fprintf(fp, ";ee=%.13lf", ee);
fprintf(fp, ";ee=%.13lf", ee);
else if (ee < 0.00000001)
fprintf(fp, ";ee=%.12lf", ee);
fprintf(fp, ";ee=%.12lf", ee);
else if (ee < 0.0000001)
fprintf(fp, ";ee=%.11lf", ee);
fprintf(fp, ";ee=%.11lf", ee);
else if (ee < 0.000001)
fprintf(fp, ";ee=%.10lf", ee);
fprintf(fp, ";ee=%.10lf", ee);
else if (ee < 0.00001)
fprintf(fp, ";ee=%.9lf", ee);
fprintf(fp, ";ee=%.9lf", ee);
else if (ee < 0.0001)
fprintf(fp, ";ee=%.8lf", ee);
fprintf(fp, ";ee=%.8lf", ee);
else if (ee < 0.001)
fprintf(fp, ";ee=%.7lf", ee);
fprintf(fp, ";ee=%.7lf", ee);
else if (ee < 0.01)
fprintf(fp, ";ee=%.6lf", ee);
fprintf(fp, ";ee=%.6lf", ee);
else if (ee < 0.1)
fprintf(fp, ";ee=%.5lf", ee);
fprintf(fp, ";ee=%.5lf", ee);
else
fprintf(fp, ";ee=%.4lf", ee);
}
Expand Down
Loading

0 comments on commit ddf6073

Please sign in to comment.