Table of Algorithms

Name	Description	Use case	delete?
edit_dist_t	2 row	Accurate distances in all cases	no
edit_dist_t_2d	2D, no trimming, no max	Benchmark and unit tests only	no
bounded_edit_dist_t	2 row, banded, max early exit	Compute distances less than k	no
min_edit_dist_t	2 row, banded, max early exit, tracks min	Find closest to const string	no
similarity_t	2 row, banded, max early exit	Similarity Score at least p	no
min_similarity_t	replace with variant of min_edit_dist_t	Find closest similarity	replace
bounded_edit_dist	1 row, banded, max early exit	Compute Edit Distance	no
bounded_edit_dist	1 row, aggressive banding	Benchmarks only	no
noop	does nothing	Benchmarks only	no

Algorithm renames

new	old
edit_dist	lev
edit_dist_t	damlev
bounded_edit_dist	levlim
bounded_edit_dist_t	damlevlim
min_edit_dist	levmin
min_edit_dist_t	damlevmin
similarity_t	damlevp
min_similarity_t	damlevminp

To Do

Change "policy" macros/variables to use string values so they are guaranteed to be mutually exclusive.
Use the CACHE STRING option for user configurable values in CMake.
Benchmark: make max edit distance configurable instead of hard coded.
One-off: make two strings configurable instead of hard coded.
Need to look at CMake files and decide how to clean them up a bit for linux and windows builds.
Test our include/mysql-8-0-39, put all mysql source code version in the include, this removes the need to use host system for build, allows for cross compiler, would also allow other verisons to be built. EDIT: This is tricky. We'll always be chasing new versions, different platforms. Using the system header is probably the right thing to do.
Policy for buffer exceeded is not implemented. We just return 0. However, policy is selectable in CMakeLists.txt.
In description of optimizations, discuss the condition start_j = std::max(1, i - (effective_max - minimum_within_row)) and start_j = std::max(1, i - effective_max).

Changes Made

MySQL no longer uses my_bool. Use int instead for all *_init functions.
We do all our arithmetic in int, but the functions return long long, because that is the C++ datatype corresponding to MySQL's INTEGER. This isn't a problem.
clang-tidy "unused" warnings are silences with [[maybe_unused]], which is part of the standard as of C++17. We no longer use __attribute__((unused)).
Factor out common "pre-algorithm" code that is the same across algorithms.

Questions

Why initialize DAMLEV_MAX_EDIT_DISTANCE to the buffer size? Maybe use inverse function of that used in buffer size check, or possibly something else.
How does benchmark differ from unittests? ANSWER: Benchmark more closely simulates our use case for correcting scientific names. Also, unittests does setup and teardown for every call.

Some results

Narrowing band in banded variant

The standard banded algorithm only computes cells within max+1 of the diagonal. More aggressive narrowing of the band is possible. For example, the condition start_j = std::max(1, i - (effective_max - minimum_within_row)) is superior to start_j = std::max(1, i - effective_max) in terms of computing fewer cells and exiting early more often. However, this condition apparently increases branch misprediction and consequently performs worse that the standard band.

Another band narrowing strategy is the following:

if(j > i && current_cell > effective_max && buffer[j] > effective_max) {
    // Don't bother computing the remainder of the band.
    buffer[j] = current_cell;
    if(j<=m) buffer[j+1] = max + 1; // Set sentinel.
    end_j = j;
    break;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes.md

Notes.md

Table of Algorithms

To Do

Changes Made

Questions

Some results

Narrowing band in banded variant

Files

Notes.md

Latest commit

History

Notes.md

File metadata and controls

Table of Algorithms

To Do

Changes Made

Questions

Some results

Narrowing band in banded variant