NamSor Tools V2

Welcome to the namsor-tools-v2 wiki!

In scientific papers, please indicate software version (NamSorAPIv2.X.YY) and date of data retrieval.

Release Notes

NamSorAPIv2.0.29 (2023-12-03)

Improvements for Albanian and Kosovo Albanian Diaspora Mapping
Created a new API for Community Engagement (this option requires a specific licence)
Added full name classification for Origin and Diaspora
Added first/last name classification for Country
Differenciated bw/ regionStat, religionStatAlt and religionStatSynthetic

NamSorAPIv2.0.28 (2023-10-08)

Added Filipino ethnicity to Diaspora
Fixed typo in ethnicity Belarusian
Improvements on names in Cyrillic

NamSorAPIv2.0.27 (2023-07-16)

added India enpoints to classify first/last names by Religion/Caste/Castegroup/Indian State
India caste group General was split as General and General/High Caste
Added a finer grained classification by detailed caste
Fixed smallish issue on admin CreditAPI
further improved free account abuse/spam detection

NamSorAPIv2.0.26 (2023-06-18)

added option to return country religion statistics for taxonomies Origin/Country/Diaspora, with header X-OPTION-RELIGION-STATS=True
improved AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
replaced JNBC with gotyai-java
further improved free account abuse/spam detection

NamSorAPIv2.0.25 (2023-05-20)

fixed Italian names issue ex Andrea/Rossini (https://github.com/namsor/namsor-tools-v2/issues/23)
added AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
improved some logging features as well as other internal services (free account abuse/spam detection)

NamSorAPIv2.0.24 (2023-03-12)

Added specific endpoints for Indian names : Indian State subclassification, Religion (Hindu, Muslim, Jain, Christian), Caste Category (General, ST, SC, OST)
Other improvements on Latin American countries, Portuguese and Spanish names

NamSorAPIv2.0.23 (2023-01-15)

Improvements for Indian names sub-classification
New taxonomies for Indian names : Religion, Caste Category (General, ST, SC, OST)

NamSorAPIv2.0.22 (2022-12-17)

Improvements on names in ARABIC (gender, origin, country)
Improvements for Indian names sub-classification
Other improvements : Malaysia, Indonesia, Brasil/Portugal, Spain/LatAm

NamSorAPIv2.0.21 (2022-09-25)

Added mandatory email verification before activation of API Key due to abuse
Improvements for Indian names gender classification
Improvements for classification of Italian names with US "Race"/Ethnicity (issue not full resolved, current recommendation is to combine with Diaspora model)

NamSorAPIv2.0.20 (2022-07-28)

Diaspora model enhancements for 7 European countries Italy (IT),France (FR), Germany (DE), Ireland (IE), Netherlands (NL), Belgium (BE), Spain (ES).
Added an open endpoint to query the list of countries/regions
Minor back-end changes to support features of the new front-end version namsor.app
Added OPTIONS to CORS for ethnicity-estimate.com

NamSorAPIv2.0.19 (2022-05-08)

Fixed rare issue with NaN score/probability
Improvements for Minorities/Diversity Analytics in Italy
No major change to API (still at v2.namsor.com)
Added OPTIONS to CORS for gender-guesser.com

NamSorAPIv2.0.18 (2022-01-16)

Second batch of improvements on CYRILLIC : Diaspora
No major change to API (still at v2.namsor.com), but we're now redirecting namsor.com front-end to the new version namsor.app
Added a gender endpoint for just given names (defaulting to 'US' local context)
Fixed Stripe redirect
Fixed ParseName issue when not trimmed()
Added X-OPTION-USRACEETHNICITY-TAXONOMY to CORS

NamSorAPIv2.0.17 (2021-12-05)

First batch of improvements on CYRILLIC : Origin, Country and Gender

NamSorAPIv2.0.16 (2021-09-26)

Diaspora model now has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model).
Various admin API enhancements to support the new Website and CSV tool
[BugFix] Slightly negative score were causing Python lib error https://github.com/namsor/namsor-tools-v2/issues/17

NamSorAPIv2.0.15 (2021-07-18)

US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY:

USRACEETHNICITY-4CLASSES is a new classifier compatible with prior version, but trained using a combination of US data and non-US data (ex. international names of sub-Saharan africa are classified as B_NL; international names of East Asia are classified as A) in alignment with https://www.census.gov/topics/population/race/about.html
USRACEETHNICITY-4CLASSES-CLASSIC for the classic US'Race'/Ethnicity classifier (pre-version 2.0.15) which has 4 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino) purely trained on US data.
USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander) purely trained on US data.

general improvements to gender / country / origin models accross all countries
specific improvements to better classify names of : NG (Nigeria), BD (Bengladesh), ZA (South-Africa), AF (Afghanistan), IR (Iran).

NamSorAPIv2.0.14 (2021-04-11)

US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).
[BugFix] Regression on parsing Spanish names without context ES https://github.com/namsor/namsor-tools-v2/issues/16

NamSorAPIv2.0.13 (2021-03-14)

Japanese names : improvements for gender classification (LATIN, HAN / Kanji, KATAKANA) ; translation LATIN->HAN / Kanji and back.
Improvements on US 'Race'/Ethnicity model, Diaspora Model
UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.
[Known Issue] Regression on parsing Spanish names without context ES (pls specify country code ES as a workaround)

NamSorAPIv2.0.12 (2021-01-31)

Improvements for gender classification of full names
Split Diaspora taxonomy classes Irish,British -> Irish,English,Scottish,Welsh (British remains as first/second best alternative for now)
Improve Diaspora classification with non LATIN names
Added Corridor API for classifying names in cross-border contexts (relevant for : diaspora remittances, international travel, foreign direct investment, crowdfunding etc.)
Added a general name classification API (nameType), accuracy in range 90-95%
[BETA] UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.

NamSorAPIv2.0.11 (2020-10-31)

(Infratructure) SSL Certificates updated
Improvements for names in Brazil, Pakistan, Indonesia
Improvements for gender classification of full names in Pakistan, Indonesia
[BETA] added a general name classification API (nameType)
added a new SDK for JavaScript

NamSorAPIv2.0.10 (2020-06-08)

Gender, Origin, Country improvements for NON-LATIN Scripts (CYRILLIC, HAN, ARABIC, KATAKANA, HANGUL, GREEK, BENGALI, ARMENIAN, DEVANAGARI, TAMIL, GEORGIAN, TELUGU, ORIYA, ...)
Gender for parsed names with only initials (ex. J. Smith) now return a probability close to 0.5 https://github.com/namsor/namsor-tools-v2/issues/10
Prepared a specific API for translating apanese Names (not active yet)
Other bug fixes, https://github.com/namsor/namsor-tools-v2/issues/9 https://github.com/namsor/namsor-tools-v2/issues/8

NamSorAPIv2.0.9 (2020-03-15)

Diaspora API improvements for US, FR

NamSorAPIv2.0.8 (2020-01-04)

Updated Naive Bayes Classifier library to refactored JNBC (v2.0.4)
Diaspora API : Fix bias towards classifying Eastern European and some Middle Eastern names to Jewish https://github.com/namsor/namsor-tools-v2/issues/4

NamSorAPIv2.0.7 (2019-11-24)

The probability calibration is no longer based on the Score, but based on the probability estimates.

NamSorAPIv2.0.6 (2019-10-25)

Gender API : Fix issue where probability could be between 0.33 and 0.5; with a low score, the probability should be 0.5 (corresponding to randomly choosing Male/Female). https://github.com/namsor/namsor-tools-v2/issues/6

NamSorAPIv2.0.5 (2019-07-21)

Added a calibrated probability based on the Score and a validation set

2019-06-30 : NamSorAPIv2.0.4

Gender API : Improvements on Chinese names
Chinese API : Specific API end-points for https://chinese-names.app/

classification

NamSor V2 uses Naive Bayes, a class of algorithms which is excellent at classification. Each classifier will output a SCORE, which is based on the relative probability between the predicted value and the other alternatives.

For most classifiers, we also use a validation dataset to calibrate the probability estimates with actual precision / recall, then return calibratedProbability that can directly be read as a probability. The calibratedProbabilityAlt corresponds to getting the first choice OR the best alternative right.

For example, to determine Gender of names in the US, with score>0, we have a 95% precision and 100% recall. By filtering score>1, you can exclude ambiguous names and increase the precision (at the cost of reducing the recall).

Mapping Rounded Gender Score to Precision and Recall :
======================================================
SCORE	PREC.	RECALL
0	95%	100%
1	96%	98%
2	97%	93%
3	97%	86%
4	98%	73%
5	99%	56%
6	99%	40%
7	99%	25%

country ISO2 codes

Several classifiers will take contextual geographic information as input, or return a country ISO2 code as output. Please find the list of ISO2 country codes, https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

classifiers

NamSor V2 provides several different classifiers : gender, origin, diaspora, US 'race'/ethnicity ... each classifier learns from each other's input and outputs classification according to a specific taxonomy.

gender classifier

NamSor aims to offer the best accuracy on predicting likely gender from names on a global scale : not just for US and European names, but also Asian, African ... in all languages and alphabets.

The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_gender This is a binary classifier, but the results are probability estimates. Other non-binary gender may exist which should be accounted for using other methods (ex. survey).

inferring gender from names, with no geographic context

If you append gender to a simple list of first and last names (ex. John|Smith), without any geographic context, the software will try to detect automatically the geographic context from the last name.

Input :

John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux

Output :

#uid|firstName|lastName|likelyGender|likelyGenderScore|genderScale|rowId
uid2|Elena|Rossi|female|6.053053604522956|1.0|0
uid3|Robert|Durieux|male|7.339173503361962|-1.0|1
uid0|John W.|Smith|male|8.436845363426315|-1.0|2
uid1|Mary|Smith|female|4.349891771969253|1.0|3

inferring gender from names, with a geographic context

The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.

Input :

id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR

Output :

#uid|firstName|lastName|countryIso2|likelyGender|likelyGenderScore|genderScale|rowId
id13|Mary|Smith|GB|female|4.164040354303551|1.0|0
id12|John W.|Smith|US|male|8.436845363426315|-1.0|1
id15|Robert|Durieux|FR|male|7.162120388463375|-1.0|2
id14|Elena|Rossi|IT|female|5.555580235429088|1.0|3

gender data output

{ "id": null, "firstName": "John", "lastName": "Smith", "likelyGender": "male", "genderScale": -0.9918105205926329, "score": 41.11285807293116, "probabilityCalibrated": 0.9959052602963164 }

Field	Example	Description
id	ref12315	The input identifier
firstName	John	The input given name / firstName
lastName	Smith	The input family name / surname / lastName
likelyGender	male	The likely gender : male or female
probabilityCalibrated	0.99	The calibrated probability : 0.5 is Unknown, +1 is sure
genderScale	-0.99	The scale is -1..0..+1 and is based on the probability (Probability = 0.5 -> Scale = 0; Gender = Male & Probabilty = 1 -> Scale = -1; Gender = Female & Probability = 1 -> Scale = +1)
score	41	A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

country classifier

This classification model infers the likely country of residence, based on the full name alone. The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalfullname_country

country data output

{ "id": null, "name": "Jing Cao", "score": 33.88839357879743, "country": "CN", "countryAlt": "TW", "region": "Asia", "topRegion": "Asia", "subRegion": "Eastern Asia", "countriesTop": [ "CN", "TW", "HK", "SG", "KR", "PH", "MO", "VN", "KH", "AU" ], "probabilityCalibrated": 0.8966946013357358, "probabilityAltCalibrated": 0.9205811403508772 }

Field	Example	Description
id	ref12315	The input identifier
name	Jing Cao	The input full name
country	CN	The likely residence country ISO2 code, which CAN include melting-pot countries
countryAlt	TW	The best alternative residence country
region	Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion	Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion	Eastern Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
countriesTop	CN, TW, HK...	The top 10 likely residence country ISO2 codes
probabilityCalibrated	.89	The calibrated probability of having guessed right the country of residence (CN)
probabilityCalibratedAlt	0.92	The calibrated probability of having guessed right the country of residence as either CN or TW.
score	41	A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

origin classifier

This classification model infers the likely country of origin from a name, based on how the name appear in the country of origin. This classifier doen't attempt to classify to any of the melting-pot countries (US, CA, etc.) but would recognize a French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan. The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_origin_country

Input :

John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux

Output :

#uid|firstName|lastName|countryOrigin|countryOriginAlt|countryOriginScore|rowId
uid2|Elena|Rossi|IT|FR|14.848086484203032|0
uid3|Robert|Durieux|FR|BE|39.63483415843564|1
uid0|John W.|Smith|GB|IE|21.09482904145537|2
uid1|Mary|Smith|GB|IE|12.87667003646059|3

origin data output

{ "id": null, "firstName": "Jing", "lastName": "Cao", "countryOrigin": "CN", "countryOriginAlt": "TW", "countriesOriginTop": [ "CN", "TW", "HK", "VN", "KR", "MY", "KH", "ID", "DK", "CM" ], "score": 25.613603655787934, "regionOrigin": "Asia", "topRegionOrigin": "Asia", "subRegionOrigin": "Eastern Asia", "probabilityCalibrated": 0.9092268352804216, "probabilityAltCalibrated": 0.9883173013909269 }

Field	Example	Description
id	ref12315	The input identifier
firstName	Jing	The input first name / given name
lastName	Cao	The input last name / surname
countryOrigin	CN	The likely country of origin (ISO2 code)
countryOriginAlt	TW	The best alternative country of origin (ISO2 code)
region	Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion	Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion	Eastern Asia	An arbitrary grouping of countries by topRegion/Region/subRegion
countriesOriginTop	CN, TW, HK...	The top 10 likely countries of origin (ISO2 code)
probabilityCalibrated	0.90	The calibrated probability of having guessed right the country of origin (CN)
probabilityCalibratedAlt	0.98	The calibrated probability of having guessed right the country of origin as either CN or TW.
score	25	A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

diaspora classifier

This classification model infers the ethnicity or likely diaspora from a name, given a geographic context (ex. US, CA, ...) This model attempts to recognize both French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan, but also as Diaspora French, Italian, British, Japanese would be named in the United-States, for example. From v2.0.16, Diaspora model has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_country_diaspora

Input :

id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR

Output :

#uid|firstName|lastName|countryIso2|ethnicity|ethnicityAlt|ethnicityScore|rowId
id13|Mary|Smith|GB|British|Irish|12.348217311566847|0
id12|John W.|Smith|US|British|Irish|27.307137947726286|1
id15|Robert|Durieux|FR|French|Jewish|75.65330992570755|2
id14|Elena|Rossi|IT|Italian|Portuguese|46.084654576433834|3

diaspora data output

{ "id": null, "firstName": "Mary", "lastName": "Cao", "score": 12.163977377279767, "ethnicityAlt": "Vietnamese", "ethnicity": "Chinese", "lifted": false, "countryIso2": "US", "ethnicitiesTop": [ "Chinese", "Vietnamese", "NativeHawaiian", "HispanoLatino", "Portuguese", "Cambodian", "Italian", "Malays", "Jewish", "Hispanic" ] }

Field	Example	Description
id	ref12315	The input identifier
firstName	Mary	The input first name / given name
lastName	Cao	The input last name / surname
countryIso2	US	The country of residence, the host country (ex. US, CA, NZ, GB)
ethnicity	Chinese	The likely ethnicity
ethnicityAlt	Vietnamese	The best alternative ethnicity
ethnicitiesTop	Chinese, Vietnamese, Korean ...	The top 10 likely ethnicities
or TW.
score	25	A non calibrated Score : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 ; NB: diaspora doesn't have calibrated probabilities YET
lifted	false	Some classifications are 'lifted' by a dictionary rule, instead of the machine learning

US 'race'/ethnicity classifier

This classification model infers the US 'race' / ethnicity from a US name. The geographic context HAS TO BE 'US', or the model will fail. This model outputs race/ethnicity according to US Census taxonomy W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_us_race_ethnicity

This is an independant assessment of the model's accuracy provided by ResearchDone.com : https://www.dropbox.com/s/xkfll1nswqjwdn1/Race%20Classification%20Results.txt

From NamSorAPIv2.0.14, it is possible to adjust the taxonomy using a header parameter, X-OPTION-USRACEETHNICITY-TAXONOMY

X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-4CLASSES returns 4 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), by default.
X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES returns 6 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).

Input :

id12|John W.|Smith|US
id15|Robert|Durieux|US
id16|Jordan|Jackson|US
id17|Carmen|Garcia|US

Output :

#uid|firstName|lastName|countryIso2|raceEthnicity|raceEthnicityAlt|raceEthnicityScore|rowId
id17|Carmen|Garcia|US|HL|A|10.32374080384995|0
id16|Jordan|Jackson|US|B_NL|W_NL|1.9105209599712982|1
id12|John W.|Smith|US|W_NL|B_NL|2.783278508661135|2
id15|Robert|Durieux|US|W_NL|B_NL|1.8889062776993453|3

US 'race'/ethnicity data output

{ "id": null, "firstName": "Mary", "lastName": "Cao", "raceEthnicityAlt": "W_NL", "raceEthnicity": "A", "score": 27.341640697082248, "raceEthnicitiesTop": [ "A", "W_NL", "HL", "B_NL" ], "probabilityCalibrated": 0.9104267920103436, "probabilityAltCalibrated": 0.954264449825495 }

Field	Example	Description
id	ref12315	The input identifier
firstName	Mary	The input first name / given name
lastName	Cao	The input last name / surname
countryIso2	US	The country of residence, the host country (ex. US, CA, NZ, GB)
raceEthnicity	A	The likely 'race'/ethnicity : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino)
raceEthnicityAlt	W_NL	The best alternative 'race'/ethnicity
raceEthnicitiesTop	A, W_NL, ...	The likely 'race'/ethnicities
probabilityCalibrated	0.91	The calibrated probability of having guessed right the 'race'/ethnicity as A (Asian)
probabilityCalibratedAlt	0.95	The calibrated probability of having guessed right the 'race'/ethnicity as either A or W_NL (White Non Latino)
score	27	A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

parse classifier

This classification model is a utility for parsing full names (ex. John Smith or Smith, John) into the first and last name components. The system will detect which part is more likely a given name or a family name, and decide where to split in complex cases (such as aristocratic names, composed names, etc.)

Input :

John W. Smith
Mary Smith
Elena Rossi
Robert Durieux
Durieux Robert
Smith Mary

Output :

#uid|fullName|firstNameParsed|lastNameParsed|nameParserType|nameParserTypeAlt|nameParserTypeScore|rowId
uid4|Durieux Robert|Robert|Durieux|LN1FN1|null|8.984422928615022|0
uid5|Smith Mary|Mary|Smith|LN1FN1|null|8.313637255008238|1
uid2|Elena Rossi|Elena|Rossi|FN1LN1|null|7.534662622281973|2
uid3|Robert Durieux|Robert|Durieux|FN1LN1|null|8.429672146707018|3
uid0|John W. Smith|John W.|Smith|FN2LN1|null|16.30796669777909|4
uid1|Mary Smith|Mary|Smith|FN1LN1|null|7.758738464551846|5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NamSor Tools V2

Release Notes

classification

country ISO2 codes

classifiers

gender classifier

inferring gender from names, with no geographic context

inferring gender from names, with a geographic context

gender data output

country classifier

country data output

origin classifier

origin data output

diaspora classifier

diaspora data output

US 'race'/ethnicity classifier

US 'race'/ethnicity data output

parse classifier

Clone this wiki locally