Skip to content

Commit

Permalink
Merge pull request #6 from gbouras13/dev2
Browse files Browse the repository at this point in the history
add benchmarking [skip ci]
  • Loading branch information
gbouras13 authored Nov 5, 2023
2 parents de4f277 + 8707e2a commit a43fe60
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 19 deletions.
4 changes: 2 additions & 2 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
History
=======

0.2.0 (2023-11-04)
0.2.0 (2023-11-05)
------------------

* Fixes bug where `pypolca` would always warn it found 0 variants.
* Also would result in the polishing not working.
* Please upgrade if you have used v0.1.1!
* I have tested `pypolca` v0.2.0 vs POLCA more thoroughly.
* Adds benchmarking of `pypolca` v0.2.0 vs POLCA more thoroughly.

0.1.1 (2023-10-09)
------------------
Expand Down
15 changes: 3 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,25 +52,16 @@ It was written for a number of reasons:

Note: I neither guarantee nor desire that `pypolca` will give identical results to POLCA implemented in MaSuRCA. This is because of the different versions of [freebayes](https://github.com/freebayes/freebayes) that might be used as a dependency.

In testing, `pypolca` v0.2.0 (running Freebayes v1.3.6) was extremely similar, but not identical to POLCA (running Freebayes v1.3.1-dirty). Please see [benchmarking](benchmarking.md) for more details.
In testing, `pypolca` v0.2.0 (running Freebayes v1.3.6 and Samtools v1.18) was extremely similar, but not identical to POLCA (running Freebayes v1.3.1-dirty and Samtools v0.1.20). Please see [benchmarking](benchmarking.md) for more details.

I have decided to use the newest version of freebayes possible rather than the version installed with MaSuRCA.

Note if you really want to replicate POLCA, the latest versions of MaSuRCA uses freebayes `v1.3.1-dirty`.

To enforce:

```
mamba create -n pypolca_env polca freebayes==1.3.1
```
I have decided to use the newest versions of freebayes and Samtools possible rather than the version installed with MaSuRCA, for ease of maintenance and particularly because the version of Samtools used is a major version behind and the CLI has changed.

### Note of Caution for Large (e.g. Eukaryotic) Genomes

* I have implemeted `pypolca` predominantly for the use-case of polishing long-read bacterial genome assemblies with short reads. Therefore, I decided not to implement the batched multiprocessing of freebayes included in POLCA, because it was a lot of work for no benefit for most bacterial genomes.
* However, this is certainly not true for larger genomes such as eukaryotic organisms. `pypolca` should be a lot slower than POLCA for such organisms if you run both with more than 1 thread.
* I do not intend to implement multiprocessing but if someone wants to feel free to make a PR.


## Installation

Installation from conda is recommended as this will install all non-python dependencies.
Expand Down Expand Up @@ -150,7 +141,7 @@ The polished output FASTA will be `{prefix}_corrected.fasta` in the specified ou

# Benchmarking

Please see [benchmarking](benchmarking.md) for more details. As can be seen, `pypolca` v0.2.0 (running Freebayes v1.3.6) was extremely similar, but not identical to POLCA (running Freebayes v1.3.1-dirty).
Please see [benchmarking](benchmarking.md) for more details. As can be seen, `pypolca` v0.2.0 was extremely similar, but not identical to POLCA.


# Citation
Expand Down
52 changes: 47 additions & 5 deletions benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,31 @@ I benchmarked `pypolca` v0.2.0 against POLCA (in MaSuRCA v4.1.0) using a variety

Overall, there were 18 isolates: 12 from this [preprint](https://www.biorxiv.org/content/10.1101/2023.09.25.559359v1) by Lerminiaux et al, JKD6159 from Wick et al [here](https://doi.org/10.1128/mra.01129-22) and 5 ATCC strain FASTQs provided by Ryan Wick and Lousie Judd from [here](https://rrwick.github.io/2023/05/05/ont-only-accuracy-with-r10.4.1.html).

All output can be found at the following link on Zenodo [here](_).
All output and this script can be found at the following link on Zenodo [here](https://zenodo.org/doi/10.5281/zenodo.10072192). The FASTQs will be available separately (soon) with the forthcoming hybracter manuscript and benchmarking.

Benchmarking was conducted on an Intel® Core™ i7-10700K CPU @ 3.80 GHz on a machine running Ubuntu 20.04.6 LTS.

The full methodology in how to download the FASTQs and FASTAs and subsample the FASTQs for the following isolates can be found at the hybracter benchmarking repository [here](https://github.com/gbouras13/hybracter_benchmarking). They were all subsampled to 100x genome size.
The full methodology in how to download the FASTQs and FASTAs and subsample the FASTQs for the following isolates can be found at the hybracter benchmarking repository https://github.com/gbouras13/hybracter_benchmarking. They were all subsampled to 100x genome size.

The assemblies polished were intermediate assemblies from [Flye](https://github.com/fenderglass/Flye) v2.9.2 done within hybracter.

# Results
The dependencies were:

* `pypolca`: Freebayes v1.3.6, bwa v 0.7.17-r1188 and Samtools v1.18
* `POLCA`: Freebayes v1.3.1-dirty, bwa v 0.7.17-r1188 and Samtools v0.1.20

# Results & Conclusion

* There were 16/18 identical assemblies.
* ATCC_33560 had 2 SNPs between `pypolca` and POLCA genomes.
* Lerminiaux_isolateI had 2 SNPs and 1 Indel between `pypolca` and POLCA genomes.
* Having checked these differences are in fact in the output vcfs, the differing versions of Freebayes (v1.3.6 in `pypolca`, v1.3.1-dirty in POLCA) seem most likely to be responsible for these differences.
* Given `pypolca` is running (by default) a newer version of Freebayes and that there are very few differences, I am deciding to move forward with this implementation.
* I also ran those 2 isolates with `pypolca` specifying Freebayes v1.3.1-dirty, with identical results to v1.3.6.
* I also attempted to install Samtools v0.1.20 with `pypolca`, but this was not available on bioconda. While I could attempt a source install etc, I decided I wasn't going to for wider purposes (e.g. including in hybracter) as bioconda integration is essential.
* I also tried to install Samtools v1.18 with POLCA, but this throws an error and won't work without modifying the polca.sh script - as the parameters of `samtools sort` have changed (line 136 [here](https://github.com/alekseyzimin/PacBio/blob/1efa908ed3127226aed40448f8d6edd20f798e53/src_reconcile/polca.sh#L136C1-L136C1)).

* Therefore, the likely difference is most likely to be explained by the different Samtools versions.
* Given `pypolca` runs (by default) a newer version of Freebayes and Samtools and that there are very few differences compared to POLCA, I am deciding to move forward with the `pypolca` implementation.
* I have decided to use the newest versions of freebayes and Samtools possible rather than the version installed with MaSuRCA, for ease of maintenance and particularly because the version of Samtools used is a major version behind and the CLI has changed.

### Setting up the environments

Expand Down Expand Up @@ -64,6 +74,26 @@ done
conda deactivate
```

* To run `pypolca` on ATCC_33560 and Lerminiaux_isolateI with Freebayes v1.3.1 and Samtools

```
mamba create -n pypolcafreebayes1.3.1 freebayes==1.3.1 samtools==0.1.20 bwa pypolca==0.2.0
conda activate pypolcafreebayes1.3.1
samples=(
ATCC_33560
Lerminiaux_isolateI
)
for sample in "${samples[@]}"; do
pypolca run -a flye_assemblies/${sample}_flye.fasta -1 all_sr_fastqs/${sample}_100x_1.fastq.gz -2 all_sr_fastqs/${sample}_100x_2.fastq.gz -t 8 -o pypolca_outputs/${sample}_freebayes1.3.1 -p ${sample} -f
done
```

* To run POLCA
Expand Down Expand Up @@ -148,4 +178,16 @@ for sample in "${samples[@]}"; do
dnadiff -p dnadiff_results/$sample POLCA_outputs/${sample}_flye.fasta.PolcaCorrected.fa pypolca_outputs/$sample/${sample}_corrected.fasta
done
samples=(
ATCC_33560
Lerminiaux_isolateI
)
for sample in "${samples[@]}"; do
dnadiff -p dnadiff_results/${sample}_freebayes1.3.1 POLCA_outputs/${sample}_flye.fasta.PolcaCorrected.fa pypolca_outputs/${sample}_freebayes1.3.1/${sample}_corrected.fasta
done
```

0 comments on commit a43fe60

Please sign in to comment.