Implement k-mer subsetting methods #510

padix-key · 2023-12-01T13:55:30Z

This PR adds modern k-mer subsetting methods (e.g. minimizers and syncmers) to allow more rapid sequence matching:

MinimizerSelector
SyncmerSelector
CachedSyncmerSelector
MincodeSelector

The implemented methods are based on finding the 'smallest' k-mer in a certain window. As the k-mer code itself would imply lexicographical sorting, different permutation schemes are implemented:

Permutation
RandomPermutation
FrequencyPermutation

To support the long k-mers that are typically used in modern read mapping methods, the BucketKmerTable is introduced as memory-efficient twin of the KmerTable.

Furthermore, an example is introduced that demonstrates gene counting using a combination of the new features.

padix-key and others added 4 commits December 1, 2023 14:07

Add submers

2672626

Add submer matching

22eef22

Fix memory leak of index C arrays

90e8f40

Rename 'submer' to 'subset'

007598b

padix-key force-pushed the minimizer branch 2 times, most recently from d25ca0a to dc0ee6b Compare December 1, 2023 14:21

padix-key added 18 commits December 1, 2023 16:23

Refactor and add permutation classes

72269de

Implement further subset rules

d4e5341

Fix wrong output dtype of 'split()'

c8c9d2c

Rename 'rule' to 'selector'

0654efe

Better test string representation

b1be5a7

Add docstrings

73640ca

Add BinnedKmerTable

2a3299b

Fix location identifiers with whitespace

9a49e95

Cap number of workers for lower RAM consumption

60ce645

Fix docstring errors

19243c1

Fix JSON syntax

f9ec6ac

Add example

3af0cfd

Remove unused imports

8cb5952

Implement get_kmers() for BinnedKmerTable

9765307

Add docstrings

2b3820b

Rename 'bins' to 'buckets' and determine them automatically

77505c9

Support compilation of C++ modules

8c9602a

Remove print statement

af7fa4b

padix-key force-pushed the minimizer branch from dc0ee6b to af7fa4b Compare December 1, 2023 15:25

padix-key merged commit 9445a25 into biotite-dev:master Dec 2, 2023
17 checks passed

padix-key deleted the minimizer branch December 17, 2023 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement k-mer subsetting methods #510

Implement k-mer subsetting methods #510

padix-key commented Dec 1, 2023 •

edited

Loading

Implement k-mer subsetting methods #510

Implement k-mer subsetting methods #510

Conversation

padix-key commented Dec 1, 2023 • edited Loading

padix-key commented Dec 1, 2023 •

edited

Loading