Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple of (very minor) issues with distance calculations #28

Open
boydgreenfield opened this issue Dec 23, 2015 · 11 comments
Open

Couple of (very minor) issues with distance calculations #28

boydgreenfield opened this issue Dec 23, 2015 · 11 comments

Comments

@boydgreenfield
Copy link

  1. The manuscript refers to transposition as an issue, but Hamming only captures character-by-character differences, not true edit distance (i.e., Levenshtein). In practice, this may not be a huge issue, but it does mean that a transposed ID could get confused (i.e., it would allow both "abcdef" and "bcdefg" as valid IDs, which could be easily confused if a letter was left off).
  2. The fix.py function doesn't use the same edit distance calculation as mint.py. Probably best to harmonize these, but again given the random nature of the generated IDs this is likely quite a minor issue in practice.

Opened a PR (#27) with one proposed fix. And thanks for the work – package + paper do a nice job outlining the benefits of the approach!

@johnchase
Copy link
Owner

Thanks @boydgreenfield for taking the time to look through the code and put together these suggestions and pull request!

I agree that Levenshtein is the correct algorithm for minting IDs, we debated about using it initially but chose Hamming due to run time. For instance minting 1000 IDs with Hamming takes ~1s whereas Levenshtein takes over a minute. @ebolyen may have a way of implementing Levenshtein distance with Levenshtein automata that could increase runtime sufficiently, I will let him comment on that.

In terms of the second point, once again I agree this is likely the correct algorithm to be using. I need to run the same evaluations we used for fixing IDs with pythons difflib (figure 3. in the paper) in order to be sure that we don't see an increase in false positive or false negatives using Levenshtein. My guess is that the implementation of Levenshtein in your pull request will work at least as well (if not better), I just want to be sure we don't see any decrease in predictions. I should have time to evaluate that in the next few days.

@boydgreenfield
Copy link
Author

Hi @johnchase – sounds great. I realize the (implemented) Levenshtein algorithm is slow... again very minor but just thought I'd send it over. I'll also poke around a bit towards a faster Levenshtein implementation...

And happy New Year's!

@johnchase
Copy link
Owner

Just to give an update, based on some small tests the Levenshtein distance implemented in #27 by @boydgreenfield seems to perform much better than difflib for resolving IDs with typos, while it is a bit slower than difflib it shouldn't be a problem for fixing IDs.

Before merging I want to run it through a parameter sweep to account for as many different situations as possible. This includes different numbers of IDs, lengths, number of errors and then multiple iterations of all of those.

Given this I have run the analysis in parallel in the past, however I am unable to get lru_cache to work with ipyparallel. SO post about it here. It may not be possible, in which case I might try to find a workaround( lru_cache should not cause problems for any normal usage of cual-id).

@boydgreenfield
Copy link
Author

@johnchase thanks for the update – and sorry about the lru_cache issue (I hadn't realized it wasn't compatible).

One could also use one of the various Python libs out there that wrap a C extension for Levenshtein distance if you don't mind the extra dependency (or you could try/catch import them). It's not ideal, but should solve the speed issue (this one looks pretty good, is MIT licensed, and is likely much faster than the code I included).

@johnchase
Copy link
Owner

Thanks @boydgreenfield, I don't believe we can use GPL license with our software (BSD), though I may be mistaken. Another possibility is an implementation of Levensthein from a package called leven; thanks @jairideout for finding that. This is about 10 times faster for fixing and slightly faster for minting IDs than what currently exists in cual-id. One drawback here is that it requires two additional dependencies, cython and leven.

Because Levenshtein is by definition an edit distance it will not fix any IDs with more mutations than a given threshold. For instance if we define the minimum distance for an ID to be kept as 3, any ID with more than 3 mutations will be discarded. Anything with 3 or less will be fixed, something I initially overlooked, (difflib is is bit fuzzier here as it is not defined by edit distance). What this means in practice is that there are fewer 'false negatives' (rejections) below a threshold, however there are more 'false positives' as well.
image
image

These figures are analogous to figure 3 from the paper. The top figure represents false positives, the bottom false negatives. I was hoping this would be a straightforward answer, where one algorithm performed better across the board, but I suppose it rarely is.

@gregcaporaso do you have any thoughts on using Levenshtein vs. difflib and Hamming?

@boydgreenfield
Copy link
Author

@johnchase Hmm, I'm going to have to dive in and look at that in more detail. I think you should have no inaccurate corrections given that (1) you only mint IDs with edit distance > x apart; and (2) you only correct for edit distance <= x (that's what motivated my PR and using the same distance function).

At the same time, anything over edit distance x from the correct ID becomes uncorrectable (this is of course true now, but a bit fuzzy since x in the Hamming distance isn't the same x as difflib is implicitly using).

@johnchase
Copy link
Owner

The reason we can see inaccurate corrections is that while IDs start with a distance greater than say 2 or 3, there is no guarantee that the random mutations that are introduced will not make them closer than the given distance. While it is unlikely, we are generating a 1000 IDs and mutating them. We could theoretically have a scenario like this. Start with 2 IDs: ['123456', 'abcdef'] then introduce three random mutations to each ID and as a result we have this ['123def', '123def'] Each ID had exactly 3 mutations, and though they were a certain distance apart to begin with, there is no way we could accurately resolve them. The reason that Levenshtein had more false positives is because the way it is implemented it is going to take the first ID that is within the defined threshold, the way difflib is implemented is that it will take the best of all IDs within a given threshold. It may be possible to tweak the Levenshtein implementation to reduce the number of false positives, but I haven't explored that yet.

@boydgreenfield
Copy link
Author

@johnchase Ah of course, that makes sense (sorry I'd forgotten about this line from looking at fix.py).

Out of curiosity, what's the rationale for doing fixed_id = fixed_id[0] (taking the first item) rather than checking that only one possible match is found and otherwise setting fixed_id to null? The latter would seem to avoid all of these false positive issues (and Levenshtein speed issues notwithstanding probably should have been in my PR).

@johnchase
Copy link
Owner

I'll need to look through the code a bit more closely to remember exactly what everything is doing. I do wonder if we were to use Levenshtein for fixing where it took the best match of all matches above a given threshold if the false positive rate would go down a bit.

@johnchase johnchase reopened this Jan 26, 2016
@johnchase
Copy link
Owner

Sorry, hit close by accident

@boydgreenfield
Copy link
Author

No problem – looks like it's simply a design question of whether one wants to resolve ambiguous errors (arbitrarily in the case of my Levenshtein code didn't order the matches, thus the FP result jumping vs. the sorted difflib code) or not resolve them / mark them as ambiguous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants