Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve approximate string search #507

Open
carlgieringer opened this issue Aug 10, 2023 · 1 comment
Open

Improve approximate string search #507

carlgieringer opened this issue Aug 10, 2023 · 1 comment
Assignees
Labels
fact-checking Features or improvements supporting fact-checking

Comments

@carlgieringer
Copy link
Contributor

carlgieringer commented Aug 10, 2023

Sometimes our approximate string search library can't find a more complete match, even though it seems like it should be possible with fewer errors.

E.g. for the quotation

Lex Fridman

(00:21:33) And you think that kind of empathy that you referred to, that requires moral courage?

(from https://lexfridman.com/robert-f-kennedy-jr-transcript/)

https://github.com/robertknight/approx-string-match-js#readme returns {start: 19933, end: 20035, errors: 21}, which corresponds to "33) And you think that kind of empathy that you referred to, that requires moral courage?' when it should be possible for this quotation to match the text with fewer errors:

                Lex Fridman
                (00:21:33)
                And you think that kind of empathy that you referred to, that requires moral courage?

At the least it could match more of the timestamp and have an equal number of edits with a larger match.

The problem may be the leading whitespace. We could probably remove this for approx-string-match-js, but we'd need to figure out how to accommodate that in dom-anchor-text-position toRange.

@carlgieringer
Copy link
Contributor Author

Possibly we could provide our own toRange that provides a custom dom-seek approach that will:

  • Remove leading whitespace
  • Collapse more than two newlines into two newlines.

https://github.com/tilgovi/dom-anchor-text-position/blob/6502c48aff7f3f0ce3bb225d1c04ca8624c3b88f/src/index.js#L34-L58

@carlgieringer carlgieringer self-assigned this Aug 10, 2023
@carlgieringer carlgieringer added the fact-checking Features or improvements supporting fact-checking label Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fact-checking Features or improvements supporting fact-checking
Projects
Status: No status
Development

No branches or pull requests

1 participant