Skip to content

Predict the distribution of race/ethnicity for a last name

Notifications You must be signed in to change notification settings

appeler/colornumber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Color Number

Models that predict race/ethnicity based on the name often formalize the problem as a classification problem (see Sood and Laohaprapanon 2018, etc.). However, if name is the only thing we know about a person, there is generally no unique mapping to a race/ethnicity. Instead, there is a distribution, e.g., XX% identify as White, YY% as Asian, etc. Posing the regression problem as a classification involves making one of two choices---classifying to the mode, which involves losing data, or keeping the training data in a way that there is no unique mapping between a name (string) and race (Sood and Laohaprapanon 2018 choose this option). (To get calibrated probability estimates, the training data needs to be a random sample of the population though calibration for less popular names is likely to be poor given variability stemming from sampling.) A simpler (better) way to formalize the problem is to formalize it as a multi-value regression problem --- predict the race/ethnic distribution of each name. Using the Florida Voting Registration Data for 2022, we estimate a set of models that predict the distribution of race/ethnicity per name.

Our y variable is multi-output:

p_asian, p_white, p_black, p_hispanic, p_other

Our input variable is the name string.

After estimation, we normalize the outputs for it to sum of 1.

We produce the data by grouping data by last_name and producing the data. We then split the data into train/test at 80/20. We use a loss function that is the mean absolute squared loss. (We also try L1Loss and cross-entropy loss.) We fit a MLP, LSTM, and a transformer model to predict the distribution of probabilities.

Todo

We compare how treating the same problem as a classification problem leads to performance differences in the performance metric of interest: mean absolute difference to the underlying prob. distribution. (We also try cross-entropy.)

Scripts

Authors

Gaurav Sood

About

Predict the distribution of race/ethnicity for a last name

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published