Match - Probabilistic Entity Detection and Matching in Python

Free software: MIT license
Documentation: https://match.readthedocs.io.

Match brings common-sense entity detection and matching to python. Match is:

Dead-simple to use
Fast
Lightweight (no heavy dependencies)
Magic!

Installation

TODO

Usage

Auto-detect common entity types

>>> import match


>>> match.detect_type('608-555-5555')
(1, 'phonenumber')

>>> match.detect_type('joe.van.gogh@example.com')
(1, 'email')

>>> match.detect_type('John R. Smith')
(.95, 'fullname')

>>> match.detect_type('Hi, how are you?')
(1, 'string')

>>> match.score_types('@squaredloss: match v0.2.0 is out!')
[(0, 'email'), (.05, 'fullname'), (0, 'phonenumber'), (1, 'string'), (0, 'datetime'), ...

Intelligently score similarity based on detected type

>>> match.score_similarity('Jonathan R. Smith', 'john r smith')
(.82, 'fullname') # Similar, but common name

>>> match.score_similarity('Jayden R. Smith', 'jayden r smith')
(.93, 'fullname') # Similar, but uncommon name, so higher match probability

>>> match.score_similarity('123 easy st, NY, NY', '123 Easy Street, New York City')
(.98, 'address')

>>> match.score_similarity('223 easy st, NY, NY', '123 easy st, NY, NY')
(.6, 'address') # Locations are close but unlikely to be the same physical place (barring a typo)

>>> match.score_similarity('Hi, how are you Joe?', 'hi how are you doing joe?')
(.81, 'string')

>>> match.score_similarity('608-555-5555', '608-555-5554', as_type='phonenumber')
.0

>>> match.score_similarity('608-555-5555', '608-555-5554', as_type='string')
.9

Parse normalized entity representations

# As string
>>> match.parse('(608) 555-5555')
('+1 608 555 5555', 'phonenumber')

>>> match.parse('6085555555')
('+1 608 555 5555', 'phonenumber')

# As object
>>> match.parse(' march 3rd, 1997', to_object=True)
(datetime.datetime(1997, 3, 3), 'datetime')

>>> match.parse_as(' march 3rd, 1997', 'email')
None

Probabilistic similarities, based on frequencies in a given corpus.

>>> from match import similarities
>>> import random


# Build similarity model from weighted random corpus of a's, b's, c's, and d's
>>> corpus = [''.join(random.sample('a'*10000 + ' '*10000 + 'b'*1000 + 'c'*100 + 'd'*10, k=10)) for _ in range(1000)]
>>> model = match.build_similarity_model(corpus, model_type='tfidf', tokenizer='2grams')
>>> model.similarity('ab ba c', 'ab ba d')
.6  # Lower similarity since 'a' is common

>>> model.similarity('db bd c', 'db bd a')
.8  # Higher similarity since 'd' is rare

# Use in high-level api
>>> match.score_similarity('db bd c', 'db bd a', similarity_measure=model)
.8


# Efficient similarity lookups with indexing (requires numpy and pandas, optional requirements)
>>> model.build_index() # Requires O(n*k) space, where n is number of docs and k is average doc length
>>> len(model.get_all_similar('db bd c', measure='overlap', threshold=.6))
48 # O(k) similarity search

Custom type detection and scoring

>>> from match.similarity import ProbabilisticDiceCoefficient


# Build similarity model from custom corpus
>>> corpus = ['cheddar', 'brie', 'guyere', 'mozzarella', 'parmesian', 'jack', 'colby']
>>> model = match.build_similarity_model(corpus, model_type='dice', tokenizer='3grams')
>>> match.add_type('cheese', similarity_model=model)
>>> match.detect_type('colby jack')
(.8, 'cheese')

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
docs		docs
match		match
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
travis_pypi_setup.py		travis_pypi_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Match - Probabilistic Entity Detection and Matching in Python

Installation

Usage

Credits

About

Releases

Packages

Languages

License

kvh/match

Folders and files

Latest commit

History

Repository files navigation

Match - Probabilistic Entity Detection and Matching in Python

Installation

Usage

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages