Releases: waveygang/wfmash
high sensitivity mapping by default
Buildable source tarball: wfmash-v0.21.0.tar.gz
Previously, settings that might make runtime slightly better when aligning pangenomes hurt performance in comparative genomics contexts. Updates related to mashmap3 and alignment have made us much more robust to defaults that are more sensitive.
In this release, we're setting a bunch of defaults which have become standard in testing:
- Default minimum mapping identity reduced from 90% to 70%.
- Set maximum mapping length to 50k by default (previously unlimited).
- Changed block length default from 5x segment length to 3x segment length.
- Set default chain gap to 30kb (previously was 6x segment length, up to 30k).
- Reduced default segment length from 5k to 1k.
- Changed default kmer size from 19 to 15.
- Modified wflign to run on all fragments except very small ones (less than 1000 bp).
- Changed filtering logic to use Euclidean distance as an absolute cutoff instead of axis-weighted Euclidean distance, while still ranking based on axis-weighted distance.
These should tend to make wfmash more sensitive at the edges of its performance envelope with minimal costs for easy, low-divergence pangenome alignment problems.
chunking and gliding while head tail global patching
Buildable source tarball: wfmash-v0.20.0.tar.gz
Major Changes
-
New Global Alignment Approach:
- Replaced the previous head and tail patching with a comprehensive global alignment strategy.
- Implemented
erode_head
anderode_tail
functions to remove small, potentially spurious matches at alignment boundaries. - The alignment now aims to include the entire query sequence, crucial when using the
-P
option for chunking mappings. - This change ensures continuity across the entire sequence, especially important when mappings are broken into smaller pieces for easier alignment.
- Switched from a semi-global approach (pinned at one end) to a fully global alignment, improving accuracy across the entire sequence length.
-
Improved Chaining Algorithm:
- Introduced an axis-weighted Euclidean distance function for more accurate chaining of mappings.
- This new function helps break mappings when encountering large indels, which can be computationally expensive to align.
- Improves detection of large structural variations directly from the mapping stage.
- Reduces spurious chaining in satellite repetitive sequences by considering the diagonal nature of true matches.
- The weighting maintains the original chain gap threshold for on-diagonal matches while effectively shortening the allowed distance for off-diagonal matches.
-
Mapping and Alignment Improvements:
- Modified the logic for determining cuttable positions in long alignments to avoid breaking alignments in the middle of structural variations (SVs).
- Adjusted the merging of consecutive mappings to be more selective, prioritizing the preservation of potential SV signals.
- Enhanced the handling of complex genomic structures by improving coordination between mapping and alignment stages.
-
Performance Optimization:
- Temporarily disabled multithreaded FASTA input processing due to thread safety issues with the samtools faidx reader.
- This change addresses memory efficiency concerns and prevents potential errors in multi-threaded environments.
- Future updates may reintroduce multi-threaded processing with improved memory management.
- Optimized the mapping process when not splitting sequences.
- Improved efficiency of long mapping handling, particularly when max mapping length is set to infinity.
-
Default Changes:
- Changed the default maximum mapping length (
-P/--max-mapping-length
) to infinity, allowing for longer continuous alignments when appropriate.
- Changed the default maximum mapping length (
Minor Improvements and Bug Fixes
- Enhanced error handling and validation throughout the alignment process.
- Improved coordinate calculations, especially in edge cases involving sequence boundaries and large structural variations.
- Added additional PAF output fields, including a chain identifier for merged mappings.
- Adjusted parameters for more robust alignment in complex regions.
This release significantly improves wfmash's efficiency when handle complex genomic structures (e.g. centromeres) and large-scale variations, particularly when using the -P
option to chunk mappings for more efficient alignment. While this option has been left unset by default, we do strongly recommend exploring it if you find your alignment times are very slow. A good setting in testing has been -P50k
.
Better broken mappings
Buildable source tarball: wfmash-v0.18.0.tar.gz
What's Changed
This release fixes a bunch of small issues with previous updates to the mapping merging and splitting logic.
The main update should improve mapping coverage by correctly calculating the block length of the mapping based on the pre-split mapping. We also correctly organize cuts to be in regions without SVs.
Full Changelog: v0.18.0...v0.19.0
Unfolding
Buildable source tarball: wfmash-v0.18.0.tar.gz
Improving mapping in complex regions, debugging recursive patching, and other fun.
-
Recursive Inversion Patching:
- Implemented recursive patching for inversions, completing the "multipatch" functionality.
- This allows for more accurate alignment of complex genomic regions with inversions.
-
SAM Output for Multipatch Alignments:
- Added support for SAM output format for multipatch alignments.
- Ensures consistent representation of complex alignments across different output formats.
-
Orientation-Consistent Alignments:
- Improved alignment consistency across all orientations of reference-query pairs.
- Enhances reliability and reproducibility of alignment results.
-
Optimized Inversion Patching:
- Implemented a bound on the maximum score for inverted patches.
- Allows for early termination of alignment when the inverted patch is worse than the forward alignment.
-
Dynamic Multi-Producer Alignment Module:
- Rewrote the alignment module to support multiple producers filling the work queue.
- Dynamically handles memory issues, improving efficiency and scalability.
-
Overlap Filtering in Plane Sweep Algorithm:
- Implemented an overlap filter to prevent keeping suboptimal mappings.
- New CLI option:
-O, --overlap-threshold <F>
- Allows setting the fraction F for dropping mappings overlapping with higher scoring mappings.
- Default value is 0.5.
-
Long Mapping Fragmentation:
- Enabled breaking of long mappings into smaller fragments at junction points.
- Junctions are defined by four consecutive segments, allowing for more precise breakpoint detection around structural variations.
- New CLI option:
-P, --max-mapping-length <N>
- Sets the maximum length of a single mapping before breaking.
- Default value is 1M (1 million bases).
-
Improved Handling of Satellite Sequences:
- The combination of overlap filtering, mapping fragmentation, and recursive patching significantly improves wfmash's ability to handle satellite sequences.
- These changes address common performance issues and mapping problems associated with highly repetitive regions.
- Users should expect better accuracy and efficiency when aligning genomes with abundant satellite sequences.
-
Performance Improvements:
- Various optimizations and code refactoring for better overall performance.
-
Bug Fixes and Minor Enhancements:
- Multiple bug fixes and small improvements throughout the codebase.
This release significantly enhances wfmash's ability to handle complex genomic structures, including challenging satellite sequences. It improves output consistency and optimizes performance for large-scale alignments. The new features and CLI options provide more accurate and detailed alignment information, particularly for regions with inversions, structural variations, and repetitive elements, while offering users greater control over the alignment process. These improvements make wfmash more robust and efficient for a wider range of genomic analyses, especially those involving highly repetitive or complex regions.
What's Changed
Full Changelog: v0.17.0...v0.18.0
Multipatch
Buildable source tarball: wfmash-v0.17.0.tar.gz
This release introduces multipatch alignment capabilities, significantly enhancing wfmash's ability to handle complex genomic structures, particularly inversions and other rearrangements. Multipatching refers to a process in which the initial wflign traceback is patched, we determine that an inverted orientation of the patch is preferable (as introduced in v0.16.0), and (in v0.17.0) we now attempt multiple patching steps to span the gap. Key improvements include:
Multipatch Alignment:
- Implemented a progressive alignment approach that can detect and align multiple patches, including inversions, within a single alignment region.
- Added a new tag
patch:Z:true
to indicate multipatch alignments in the output. - Introduced an
inv:Z:true/false
tag to specify whether a patch is inverted.
Alignment Refinements:
- Implemented trimming of alignments to remove leading and trailing indels, improving alignment quality.
- Added bounds detection for alignments to better handle partial matches.
- Increased the default chain gap to 6x segment length or 30k, allowing for detection of larger variants.
Output Enhancements:
- Modified the output format to clearly distinguish multipatch alignments.
- Improved logging and debugging output for better insight into the alignment process.
Code Improvements:
- Enhanced the
alignment_t
class with new accessors for query and target begin/end positions. - Implemented pruning of overlapping patches to avoid redundant alignments.
- Refactored several core functions for better modularity and readability.
Build System:
- Added libdeflate as a dependency in the Guix build configuration.
This release significantly improves wfmash's ability to handle complex genomic alignments, particularly those involving local inversions and other structural variations. The multipatch approach allows for a more complete representation of genomic relationships in challenging regions than is available in other methods.
Happy aligning with enhanced structural variation breakpoint resolution! 🧬🔍🧮
What's Changed
- add deflate to guix.scm by @AndreaGuarracino in #258
- Multi-patch by @ekg in #259
Full Changelog: v0.16.0...v0.17.0
Inversion patching and mashmap3 index saving
Buildable source tarball: wfmash-v0.16.0.tar.gz
The primary enhancement in this release is the implementation of inversion detection during the alignment patching process. This feature significantly improves the alignment accuracy for sequences containing inversions.
How it works:
-
Patching Process: During the wflign high-level trace patching, the algorithm identifies regions that do not align well in the forward orientation.
-
Reverse Complement Alignment: For these poorly aligned regions, the algorithm attempts an alignment with the reverse complement of the sequence.
-
Score Comparison: The algorithm compares the alignment scores of the forward and reverse complement alignments.
-
Selection: If the reverse complement alignment produces a better score, it is selected for that region.
-
Output: Reverse complement alignments are reported with an additional SAM tag
rc:Z:true
.
Key Components:
- New parameter
wflign_min_inv_patch_len
: Sets the minimum length of an inverted patch to be considered (default: 23). calculate_alignment_score
function: Computes alignment scores based on the CIGAR string and penalties.- Modified
do_wfa_patch_alignment
function: Now handles both forward and reverse complement alignments. - Updated
write_merged_alignment
function: Processes and outputs reverse complement alignments.
This feature allows wfmash to accurately align sequences with inversions, improving its utility for complex genomic comparisons.
Other Significant Changes
-
MashMap Index Support:
- Implemented creation and usage of MashMap indexes for faster repeat mapping.
- New CLI options:
--mm-index
,--create-index-only
,--overwrite-mm-index
.
-
Memory Optimization:
- Improved memory usage in the
Sketch
class.
- Improved memory usage in the
-
Kmer Size Calculation:
- Modified to handle edge cases with high-identity alignments.
-
Alignment Class Improvements:
- Enhanced
alignment_t
class with proper copy and move semantics.
- Enhanced
-
Index File Handling:
- Improved reading and writing processes with parameter validation.
Detailed Log of Changes
src/align/include/align_parameters.hpp
- Added
wflign_min_inv_patch_len
parameter toParameters
struct.
src/align/include/computeAlignments.hpp
- Integrated
wflign_min_inv_patch_len
intoWFlign
constructor call.
src/common/wflign/src/wflign.cpp and wflign.hpp
- Added
min_inversion_length
toWFlign
constructor and member variables. - Modified
minhash_kmer_size
calculation for edge cases.
src/common/wflign/src/wflign_alignment.cpp and wflign_alignment.hpp
- Implemented copy/move constructors and assignment operators for
alignment_t
. - Added
calculate_alignment_score
function.
src/common/wflign/src/wflign_patch.cpp and wflign_patch.hpp
- Modified
do_wfa_patch_alignment
for reverse complement handling. - Updated
write_merged_alignment
for reverse complement output. - Refined patching process for bidirectional alignment consideration.
src/interface/parse_args.hpp
- Added CLI options for MashMap indexing and
wflign_min_inv_patch_len
.
src/map/include/map_parameters.hpp
- Added parameters for MashMap indexing support.
src/map/include/parseCmdArgs.hpp
- Updated parsing for new MashMap indexing options.
src/map/include/winSketch.hpp
- Implemented MashMap index functions (create, read, write).
- Added CLI-index file parameter validation.
- Optimized
Sketch
class memory usage.
anything, anywhere, everywhere
Buildable Source Tarball: wfmash-v0.15.0.tar.gz
Initial experiments in our all-to-all alignment of the draft vertebrate genomes project demonstrated that we were not generating end-to-end alignments for many mashmap3 homology pairs at 70% ANI (wfmash -m -p 70
). Exploration showed that our attempts at automatically tuning alignment parameters based on mashmap estimated identity simply didn't work. The parameter settings we used meant that optimal wflign alignments were often I*D*
, or "fully indel-ed", leading to no insight into the homology between the pairs even when internally WFA segments did match.
To avoid this "gotcha" and ensure we obtain an alignment, we set the softest wflign parameters possible to maintain the inequality match < gap-extend < mismatch < gap-open
: match=0 mismatch=2, gap-open=3, gap-extend=1. We also use 0,3,4,2,24,1 for our WFA patching parameters, matching minimap2's asm20 setting. These changes lead to a major improvement in runtime and memory usage during alignment. In WFA, where everything is order of score or score*score, smaller scores mean lower memory and faster runtime.
We also ran into portability issues. The biggest improvement was to bring back static builds with options to enable generic compatibility with many recent x86 systems. This will allow direct distribution of binaries in these releases.
We also hit some very weird software bugs that led us to drop jemalloc. It was causing very strange problems (like IOT like invalid instruction errors, signal 9 allocation errors with 5% RAM usage, etc.) and offers no obvious performance advantage in wfmash's current setup, mentioning here because it was a very tricky bug to resolve.
New Features and Enhancements
Breaking Changes
wfmash
now requires the query FASTA sequence to be bgzipped and samtools faidx indexed as well as the target sequence. This lets us basically be able to randomly access the query which improves performance in parallel and high-performance computing settings because we don't have to spool through very big query files if we're only aligning a very small part of them.
Publications
- Added a new citation for the biWFA algorithm:
- Santiago Marco-Sola, Jordan M. Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. "Optimal gap-affine alignment in O (s) space". Bioinformatics, 2023.
Build System
- Configurable Build Options: Introduced new CMake options to make the build process more flexible:
BUILD_STATIC
: Option to build a static binary.BUILD_DEPS
: Option to build external dependencies (htslib, gsl, libdeflate) from source.BUILD_RETARGETABLE
: Option to build a retargetable binary without machine-specific optimizations.
- Static Compilation: Improved support for static compilation, including the ability to build static binaries and handle external dependencies more flexibly.
- OpenMP Support: Added OpenMP support for parallel processing.
- Improved Documentation: Updated the README to provide detailed instructions for building from source, including static and retargetable binaries.
Performance and Optimization
- Optimized Compilation Flags: Adjusted compilation flags for better performance and compatibility across different systems.
- Memory Management: Improved memory management by reducing the number of sketches kept in memory during large alignments.
- Query Sequence Handling: Enhanced the handling of query sequences to support random access, reducing memory usage and improving performance.
Bug Fixes
- Memory Access Errors: Fixed potential memory access errors by adding bounds checks for sequence indices.
- Thread Safety: Ensured thread safety by using a single
faidx_t
object for sequence fetching, shared among multiple threads. - Alignment Filtering: Disabled low-identity filtering by default to ensure all alignments are kept for post-processing.
Miscellaneous
- Nix and Guix Support: Added support for building wfmash using Nix and Guix, including Docker image generation.
- Test Cases: Added a script to generate test cases for wflign, facilitating easier testing and validation.
Detailed Changes
Commit Highlights
- Commit 577c3de: Added biWFA citation to the README.
- Commit 1d142d9: Merged changes for Stampede3 build configuration.
- Commit d55cfe7: Made the build configurable and documented how to use the new options.
- Commit 18e33b0: Fixed the path for libdeflate in the CMake configuration.
- Commit 9ff0452: Merged updates for scoring parameter optimizations.
- Commit 609082b: Updated build to use Clang and removed jemalloc dependency.
- Commit e6f1824: Restored micromamba/anaconda support.
- Commit debeff7: Debugged build on TACC's Stampede3 cluster.
- Commit 75a6631: Improved build process for Stampede3 cluster.
- Commit 719381c: Avoided
-march=native
for broader compatibility. - Commit 081213c: Fixed memory management issues in alignment code.
- Commit fb4c6d0: Used generic modern optimizations, avoiding processor-specific flags.
- Commit c04088e: Ensured zero-termination of sequence data fetched with
faidx_fetch_seq64
. - Commit ea76722: Added validation for mashmap input rows.
- Commit fedad55: Reduced the number of sketches kept in memory during large alignments.
- Commit a7aa342: Improved queue behavior and memory management in alignment code.
- Commit 581b364: Disabled low-identity filtering by default.
- Commit 3f1f7af: Corrected documentation of queues.
- Commit acd7fdc: Updated atomic queue definition for better single-producer multi-consumer behavior.
- Commit 2b91145: Avoided deadlock on empty input files.
- Commit edcd281: Used a single
faidx_t
object for sequence fetching to save memory. - Commit a04908b: Fixed scoring parameters for diverse alignment problems.
- Commit bb0d43d: Merged updates for forcibly using biWFA alignment.
- Commit 3e434f8: Added a script to create test cases for wflign.
- Commit cc89be6: Added option to force global biWFA alignment.
- Commit 35194d8: Merged updates for random access to queries during alignment.
- Commit c738c1d: Stopped sorting the input mapping file for better performance.
- Commit 70e896a: Removed redundant query sequence processing.
- Commit 6fcddc2: Enabled random access of query subsequences in alignment.
- Commit d9a0880: Limited to one query sequence file for simplicity.
- Commit 729d9d7: Merged updates for static build reimplementation.
- Commit 8477337: Corrected debugging build with PNG and TSV support.
- Commit efc8f04: Added libdeflate as a dependency.
- Commit deb1472: Updated minimum CMake versions.
- Commit d1588e6: Described static compilation options in the README.
- Commit 2713899: Defaulted to non-static builds.
- Commit a96919e: Reimplemented static builds.
- Commit 899e154: Bumped Nix build configuration.
- Commit 7376468: Reverted removal of
flake.lock
. - Commit 526995a: Removed
flake.lock
. - Commit f94b7ee: Updated Nix build configuration.
- Commit e2df9c8: Moved to Nix flake.
- Commit cbedc8f: Locked the Nix flake.
- Commit b0d0ada: Added Nix flake configuration.
Happy whole-genome-aligning! 🔬🧬📊
tackling the all-vs-all matrix
Buildable Source Tarball: wfmash-v0.14.0.tar.gz
This release provides support for subsetting the queries which are used in addition to the target subsetting. A list of queries can be offered. (We still work with only a single target though.) The idea is that this will make it possible for us to subdivide the all-versus-all alignment matrix and run many small jobs where multiple queries are aligned against a single target. However, running all queries against one target would be computationally infeasible, because there might be many hundreds of thousands of queries. There are some other bug fixes and updates as well, but the main difference that triggers a release is the change in the command line API.
changelog
Query filtering and specification improvements
- Added support for specifying a comma-delimited list of query name prefixes to filter queries with the
-Q
/--query-prefix
option. - Added
-A
/--query-list
option to specify a file containing a list of query sequence names to use. - Updated internal sequence iteration and counting logic to properly apply the new query filtering options.
Target filtering option name changes
- Renamed target prefix filtering option from
-P
/--target-prefix
to-T
/--target-prefix
for consistency. - Renamed target list filtering option from
-A
/--target-list
to-R
/--target-list
.
All-to-all alignment script improvements
- Updated
scripts/all2all_jobs.py
to:- Support grouping by genome, haplotype, or contig.
- Allow specifying different grouping levels for target and query sequences.
- Directly generate wfmash command lines.
- Added
scripts/make_source_targball.sh
to generate a source tarball for releases.
Build and testing updates
- Added back
rt
library to CMake configuration. - Updated CI tests to run on the
main
branch. - Adjusted CI test cases for the subset of the LPA dataset.
Bug fixes
- Fixed a
heap-use-after-free
error inwflign_affine_wavefront()
.
v0.13.1
Buildable Source Tarball: wfmash-v0.13.1.tar.gz
What's Changed
- drop timing from the output by default by @AndreaGuarracino in #231
- fix "alignment block length" in PAF output by @AndreaGuarracino in #235
- improve MAC compatibility by @AndreaGuarracino in #236
- Make chain gap dynamic by @bkille in #234
- for the chain_gap, do not go above 20k by default by @AndreaGuarracino in #238
Full Changelog: v0.13.0...v0.13.1
v0.13.0
Buildable Source Tarball: wfmash-v0.13.0.tar.gz
What's Changed
- Do not allow a segment to chain with itself by @bkille in #225
- Convex penalties for the alignment patching by @AndreaGuarracino in #229
Full Changelog: v0.12.6...v0.13.0