Entity Resolution Project README

Overview

This project focuses on entity resolution, specifically mapping grants to doctors from multiple datasets. The goal is to build a classifier that can predict matches between grants and doctors using various features such as the Jaro-Winkler distance between last names and using word embeddings from huggingface models and fasttext. The process involves reading in data, cleaning and preprocessing it, building and training a classifier (which includes simulating initial training data), setting up a database and creating tables to store connections, and deploying the classifier on testing data for matching purposes.

Tools Used

Python (programming language)
Pandas (data manipulation)
Scikit-learn (machine learning)
XGBoost (Classifier)
SQLite (database)
Git (version control)
GitHub (code hosting and collaboration)

Data Cleaning

The data cleaning phase involves preprocessing both the grants and doctors datasets. Tasks include handling missing values, standardizing formats (e.g., names, dates), and extracting relevant from the datasets for matching purposes. Specifically, various dates were imputed and sub selection of columns were chosen for the classfier.

Classifier Building

We built a classifier using machine learning techniques to predict matches between grants and doctors. Features such as Jaro-Winkler distance between last names, matching city names, and the degrees of spearation between embeddings were used. We instanitated an XGBoost classifier, simulated training data by sampling from common names and hand labelling matches, then trained and evaluated our model.

Database Setup

We set up an SQLite database to store our data and establish connections between grants and doctors. This database allows for efficient querying and retrieval of matched entities. We also set up bridge tables to house potential matches (doctors and grants with the same "last name" feature, for example).

Training Data Simulation

To train our classifier, we simulated training data by generating positive and negative samples of matched and unmatched pairs of grants and doctors. This simulated data helps improve the classifier's accuracy and generalization. The data simulation process can be found in the data_simulator file within the program_files.distance_classifier directory.

Deployment

The trained classifier is deployed to perform real-time matching between grants and doctors. The deployment can find matches between grants and doctors to analyze how and by who doctors are recieving money.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
EntityResolution		EntityResolution
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Resolution Project README

Table of Contents

Overview

Tools Used

Data Cleaning

Classifier Building

Database Setup

Training Data Simulation

Deployment

About

Releases

Packages

Languages

jakemaz66/EntityResolution

Folders and files

Latest commit

History

Repository files navigation

Entity Resolution Project README

Table of Contents

Overview

Tools Used

Data Cleaning

Classifier Building

Database Setup

Training Data Simulation

Deployment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages