Skip to content

Latest commit

 

History

History
50 lines (37 loc) · 1.9 KB

README.md

File metadata and controls

50 lines (37 loc) · 1.9 KB

old20

Calculate Yarkoni, Baloto & Yap's OLD20.

The OLD20 distance is defined as the average Orthographic Levenshtein Distance (OLD) to the 20 closest neighbors for a given word in a given corpus. This is a multi-threaded version of OLD20 which uses a fast, cythonized version of the Levenshtein distance, although we support any function that takes two strings and outputs a distance between them. It can therefore also be used with the Levenshtein distances in difflib and pyxdameraulevenshtein.

  • Warning this code is slower than the leven_c utility. If you just need OLD20 scores, use that one. As a comparison, leven_c takes about 14 seconds to process 10,000 words, while this implementation takes about 21 seconds on the same corpus. Our main speed bottleneck is the levenshtein distance calculation, which, even though we use a fast implementation from jellyfish, is still a lot slower than the Levenshtein calculation in leven_c.

If you use the code in this repository, please cite the following paper:

@article{yarkoni2008moving,
  title={Moving beyond Coltheart’s N: A new measure of orthographic similarity},
  author={Yarkoni, Tal and Balota, David and Yap, Melvin},
  journal={Psychonomic Bulletin \& Review},
  volume={15},
  number={5},
  pages={971--979},
  year={2008},
  publisher={Springer}
}

Example

import numpy as np
from old20 import old20, old_n
# Assumes some wordlist.
wordlist = open("mywords.txt").readlines()

words_neighborhood = old20(wordlist)

# Get multiple n
words_neighborhood = old_n(wordlist, n=[20, 30, 40, 50])

# Only calculate some words against a reference corpus.
words_subset = old_n(wordlist[:100], wordlist, n=20)

# Change the number of CPU jobs (the default is to use all cpus)
words_neighborhood = old_n(wordlist, n_jobs=2)

Author

Stéphan Tulkens