Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement k-mer subsetting methods #510

Merged
merged 22 commits into from
Dec 2, 2023
Merged

Conversation

padix-key
Copy link
Member

@padix-key padix-key commented Dec 1, 2023

This PR adds modern k-mer subsetting methods (e.g. minimizers and syncmers) to allow more rapid sequence matching:

  • MinimizerSelector
  • SyncmerSelector
  • CachedSyncmerSelector
  • MincodeSelector

The implemented methods are based on finding the 'smallest' k-mer in a certain window. As the k-mer code itself would imply lexicographical sorting, different permutation schemes are implemented:

  • Permutation
  • RandomPermutation
  • FrequencyPermutation

To support the long k-mers that are typically used in modern read mapping methods, the BucketKmerTable is introduced as memory-efficient twin of the KmerTable.

Furthermore, an example is introduced that demonstrates gene counting using a combination of the new features.

@padix-key padix-key force-pushed the minimizer branch 2 times, most recently from d25ca0a to dc0ee6b Compare December 1, 2023 14:21
@padix-key padix-key merged commit 9445a25 into biotite-dev:master Dec 2, 2023
17 checks passed
@padix-key padix-key deleted the minimizer branch December 17, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant