Skip to content

NamSor Tools V2

NamSor edited this page Dec 4, 2023 · 53 revisions

Welcome to the namsor-tools-v2 wiki!

In scientific papers, please indicate software version (NamSorAPIv2.X.YY) and date of data retrieval.

Release Notes

NamSorAPIv2.0.29 (2023-12-03)

  • Improvements for Albanian and Kosovo Albanian Diaspora Mapping
  • Created a new API for Community Engagement (this option requires a specific licence)
  • Added full name classification for Origin and Diaspora
  • Added first/last name classification for Country
  • Differenciated bw/ regionStat, religionStatAlt and religionStatSynthetic

NamSorAPIv2.0.28 (2023-10-08)

  • Added Filipino ethnicity to Diaspora
  • Fixed typo in ethnicity Belarusian
  • Improvements on names in Cyrillic

NamSorAPIv2.0.27 (2023-07-16)

  • added India enpoints to classify first/last names by Religion/Caste/Castegroup/Indian State
  • India caste group General was split as General and General/High Caste
  • Added a finer grained classification by detailed caste
  • Fixed smallish issue on admin CreditAPI
  • further improved free account abuse/spam detection

NamSorAPIv2.0.26 (2023-06-18)

  • added option to return country religion statistics for taxonomies Origin/Country/Diaspora, with header X-OPTION-RELIGION-STATS=True
  • improved AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
  • replaced JNBC with gotyai-java
  • further improved free account abuse/spam detection

NamSorAPIv2.0.25 (2023-05-20)

  • fixed Italian names issue ex Andrea/Rossini (https://github.com/namsor/namsor-tools-v2/issues/23)
  • added AI explainability for Enterprise users (API Key should be set to Explainability=True and API queries with header X-OPTION-EXPLAINABILITY=True)
  • improved some logging features as well as other internal services (free account abuse/spam detection)

NamSorAPIv2.0.24 (2023-03-12)

  • Added specific endpoints for Indian names : Indian State subclassification, Religion (Hindu, Muslim, Jain, Christian), Caste Category (General, ST, SC, OST)
  • Other improvements on Latin American countries, Portuguese and Spanish names

NamSorAPIv2.0.23 (2023-01-15)

  • Improvements for Indian names sub-classification
  • New taxonomies for Indian names : Religion, Caste Category (General, ST, SC, OST)

NamSorAPIv2.0.22 (2022-12-17)

  • Improvements on names in ARABIC (gender, origin, country)
  • Improvements for Indian names sub-classification
  • Other improvements : Malaysia, Indonesia, Brasil/Portugal, Spain/LatAm

NamSorAPIv2.0.21 (2022-09-25)

  • Added mandatory email verification before activation of API Key due to abuse
  • Improvements for Indian names gender classification
  • Improvements for classification of Italian names with US "Race"/Ethnicity (issue not full resolved, current recommendation is to combine with Diaspora model)

NamSorAPIv2.0.20 (2022-07-28)

  • Diaspora model enhancements for 7 European countries Italy (IT),France (FR), Germany (DE), Ireland (IE), Netherlands (NL), Belgium (BE), Spain (ES).
  • Added an open endpoint to query the list of countries/regions
  • Minor back-end changes to support features of the new front-end version namsor.app
  • Added OPTIONS to CORS for ethnicity-estimate.com

NamSorAPIv2.0.19 (2022-05-08)

  • Fixed rare issue with NaN score/probability
  • Improvements for Minorities/Diversity Analytics in Italy
  • No major change to API (still at v2.namsor.com)
  • Added OPTIONS to CORS for gender-guesser.com

NamSorAPIv2.0.18 (2022-01-16)

  • Second batch of improvements on CYRILLIC : Diaspora
  • No major change to API (still at v2.namsor.com), but we're now redirecting namsor.com front-end to the new version namsor.app
  • Added a gender endpoint for just given names (defaulting to 'US' local context)
  • Fixed Stripe redirect
  • Fixed ParseName issue when not trimmed()
  • Added X-OPTION-USRACEETHNICITY-TAXONOMY to CORS

NamSorAPIv2.0.17 (2021-12-05)

  • First batch of improvements on CYRILLIC : Origin, Country and Gender

NamSorAPIv2.0.16 (2021-09-26)

  • Diaspora model now has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model).
  • Various admin API enhancements to support the new Website and CSV tool
  • [BugFix] Slightly negative score were causing Python lib error https://github.com/namsor/namsor-tools-v2/issues/17

NamSorAPIv2.0.15 (2021-07-18)

  1. US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY:
  • USRACEETHNICITY-4CLASSES is a new classifier compatible with prior version, but trained using a combination of US data and non-US data (ex. international names of sub-Saharan africa are classified as B_NL; international names of East Asia are classified as A) in alignment with https://www.census.gov/topics/population/race/about.html
  • USRACEETHNICITY-4CLASSES-CLASSIC for the classic US'Race'/Ethnicity classifier (pre-version 2.0.15) which has 4 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino) purely trained on US data.
  • USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander) purely trained on US data.
  1. general improvements to gender / country / origin models accross all countries
  2. specific improvements to better classify names of : NG (Nigeria), BD (Bengladesh), ZA (South-Africa), AF (Afghanistan), IR (Iran).

NamSorAPIv2.0.14 (2021-04-11)

  • US 'Race'/Ethnicity : Optionally add header X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander). With this option, classifier has 6 classes : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).
  • [BugFix] Regression on parsing Spanish names without context ES https://github.com/namsor/namsor-tools-v2/issues/16

NamSorAPIv2.0.13 (2021-03-14)

  • Japanese names : improvements for gender classification (LATIN, HAN / Kanji, KATAKANA) ; translation LATIN->HAN / Kanji and back.
  • Improvements on US 'Race'/Ethnicity model, Diaspora Model
  • UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.
  • [Known Issue] Regression on parsing Spanish names without context ES (pls specify country code ES as a workaround)

NamSorAPIv2.0.12 (2021-01-31)

  • Improvements for gender classification of full names
  • Split Diaspora taxonomy classes Irish,British -> Irish,English,Scottish,Welsh (British remains as first/second best alternative for now)
  • Improve Diaspora classification with non LATIN names
  • Added Corridor API for classifying names in cross-border contexts (relevant for : diaspora remittances, international travel, foreign direct investment, crowdfunding etc.)
  • Added a general name classification API (nameType), accuracy in range 90-95%
  • [BETA] UI : added an online CSV tool to process files from JavaScript client, append gender, origin, country, diaspora or US 'race'/ethnicity to a list of names in Excel/CSV format.

NamSorAPIv2.0.11 (2020-10-31)

  • (Infratructure) SSL Certificates updated
  • Improvements for names in Brazil, Pakistan, Indonesia
  • Improvements for gender classification of full names in Pakistan, Indonesia
  • [BETA] added a general name classification API (nameType)
  • added a new SDK for JavaScript

NamSorAPIv2.0.10 (2020-06-08)

NamSorAPIv2.0.9 (2020-03-15)

  • Diaspora API improvements for US, FR

NamSorAPIv2.0.8 (2020-01-04)

NamSorAPIv2.0.7 (2019-11-24)

  • The probability calibration is no longer based on the Score, but based on the probability estimates.

NamSorAPIv2.0.6 (2019-10-25)

NamSorAPIv2.0.5 (2019-07-21)

  • Added a calibrated probability based on the Score and a validation set

2019-06-30 : NamSorAPIv2.0.4

classification

NamSor V2 uses Naive Bayes, a class of algorithms which is excellent at classification. Each classifier will output a SCORE, which is based on the relative probability between the predicted value and the other alternatives.

For most classifiers, we also use a validation dataset to calibrate the probability estimates with actual precision / recall, then return calibratedProbability that can directly be read as a probability. The calibratedProbabilityAlt corresponds to getting the first choice OR the best alternative right.

For example, to determine Gender of names in the US, with score>0, we have a 95% precision and 100% recall. By filtering score>1, you can exclude ambiguous names and increase the precision (at the cost of reducing the recall).

Mapping Rounded Gender Score to Precision and Recall :
======================================================
SCORE	PREC.	RECALL
0	95%	100%
1	96%	98%
2	97%	93%
3	97%	86%
4	98%	73%
5	99%	56%
6	99%	40%
7	99%	25%

country ISO2 codes

Several classifiers will take contextual geographic information as input, or return a country ISO2 code as output. Please find the list of ISO2 country codes, https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

classifiers

NamSor V2 provides several different classifiers : gender, origin, diaspora, US 'race'/ethnicity ... each classifier learns from each other's input and outputs classification according to a specific taxonomy.

gender classifier

NamSor aims to offer the best accuracy on predicting likely gender from names on a global scale : not just for US and European names, but also Asian, African ... in all languages and alphabets.

The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_gender This is a binary classifier, but the results are probability estimates. Other non-binary gender may exist which should be accounted for using other methods (ex. survey).

inferring gender from names, with no geographic context

If you append gender to a simple list of first and last names (ex. John|Smith), without any geographic context, the software will try to detect automatically the geographic context from the last name.

Input :

John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux

Output :

#uid|firstName|lastName|likelyGender|likelyGenderScore|genderScale|rowId
uid2|Elena|Rossi|female|6.053053604522956|1.0|0
uid3|Robert|Durieux|male|7.339173503361962|-1.0|1
uid0|John W.|Smith|male|8.436845363426315|-1.0|2
uid1|Mary|Smith|female|4.349891771969253|1.0|3

inferring gender from names, with a geographic context

The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.

Input :

id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR

Output :

#uid|firstName|lastName|countryIso2|likelyGender|likelyGenderScore|genderScale|rowId
id13|Mary|Smith|GB|female|4.164040354303551|1.0|0
id12|John W.|Smith|US|male|8.436845363426315|-1.0|1
id15|Robert|Durieux|FR|male|7.162120388463375|-1.0|2
id14|Elena|Rossi|IT|female|5.555580235429088|1.0|3

gender data output

{ "id": null, "firstName": "John", "lastName": "Smith", "likelyGender": "male", "genderScale": -0.9918105205926329, "score": 41.11285807293116, "probabilityCalibrated": 0.9959052602963164 }

Field Example Description
id ref12315 The input identifier
firstName John The input given name / firstName
lastName Smith The input family name / surname / lastName
likelyGender male The likely gender : male or female
probabilityCalibrated 0.99 The calibrated probability : 0.5 is Unknown, +1 is sure
genderScale -0.99 The scale is -1..0..+1 and is based on the probability (Probability = 0.5 -> Scale = 0; Gender = Male & Probabilty = 1 -> Scale = -1; Gender = Female & Probability = 1 -> Scale = +1)
score 41 A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

country classifier

This classification model infers the likely country of residence, based on the full name alone. The taxonomy classes are https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalfullname_country

country data output

{ "id": null, "name": "Jing Cao", "score": 33.88839357879743, "country": "CN", "countryAlt": "TW", "region": "Asia", "topRegion": "Asia", "subRegion": "Eastern Asia", "countriesTop": [ "CN", "TW", "HK", "SG", "KR", "PH", "MO", "VN", "KH", "AU" ], "probabilityCalibrated": 0.8966946013357358, "probabilityAltCalibrated": 0.9205811403508772 }

Field Example Description
id ref12315 The input identifier
name Jing Cao The input full name
country CN The likely residence country ISO2 code, which CAN include melting-pot countries
countryAlt TW The best alternative residence country
region Asia An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion Eastern Asia An arbitrary grouping of countries by topRegion/Region/subRegion
countriesTop CN, TW, HK... The top 10 likely residence country ISO2 codes
probabilityCalibrated .89 The calibrated probability of having guessed right the country of residence (CN)
probabilityCalibratedAlt 0.92 The calibrated probability of having guessed right the country of residence as either CN or TW.
score 41 A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

origin classifier

This classification model infers the likely country of origin from a name, based on how the name appear in the country of origin. This classifier doen't attempt to classify to any of the melting-pot countries (US, CA, etc.) but would recognize a French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan. The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_origin_country

Input :

John W.|Smith
Mary|Smith
Elena|Rossi
Robert|Durieux

Output :

#uid|firstName|lastName|countryOrigin|countryOriginAlt|countryOriginScore|rowId
uid2|Elena|Rossi|IT|FR|14.848086484203032|0
uid3|Robert|Durieux|FR|BE|39.63483415843564|1
uid0|John W.|Smith|GB|IE|21.09482904145537|2
uid1|Mary|Smith|GB|IE|12.87667003646059|3

origin data output

{ "id": null, "firstName": "Jing", "lastName": "Cao", "countryOrigin": "CN", "countryOriginAlt": "TW", "countriesOriginTop": [ "CN", "TW", "HK", "VN", "KR", "MY", "KH", "ID", "DK", "CM" ], "score": 25.613603655787934, "regionOrigin": "Asia", "topRegionOrigin": "Asia", "subRegionOrigin": "Eastern Asia", "probabilityCalibrated": 0.9092268352804216, "probabilityAltCalibrated": 0.9883173013909269 }

Field Example Description
id ref12315 The input identifier
firstName Jing The input first name / given name
lastName Cao The input last name / surname
countryOrigin CN The likely country of origin (ISO2 code)
countryOriginAlt TW The best alternative country of origin (ISO2 code)
region Asia An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion Eastern Asia An arbitrary grouping of countries by topRegion/Region/subRegion
countriesOriginTop CN, TW, HK... The top 10 likely countries of origin (ISO2 code)
probabilityCalibrated 0.90 The calibrated probability of having guessed right the country of origin (CN)
probabilityCalibratedAlt 0.98 The calibrated probability of having guessed right the country of origin as either CN or TW.
score 25 A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

diaspora classifier

This classification model infers the ethnicity or likely diaspora from a name, given a geographic context (ex. US, CA, ...) This model attempts to recognize both French, Italian, British, Japanese name etc. as they appear in France, Italy, Great-Britain, Japan, but also as Diaspora French, Italian, British, Japanese would be named in the United-States, for example. From v2.0.16, Diaspora model has calibratedProbability/calibratedProbabilityAlt as the other models (Gender, Country, Origin, US 'Race'/Ethnicity) based on ability to predict either (A) the Diaspora country of birth or (B) the name country of Origin (ie. consistency with NamSor Origin model). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_country_diaspora

Input :

id12|John W.|Smith|US
id13|Mary|Smith|GB
id14|Elena|Rossi|IT
id15|Robert|Durieux|FR

Output :

#uid|firstName|lastName|countryIso2|ethnicity|ethnicityAlt|ethnicityScore|rowId
id13|Mary|Smith|GB|British|Irish|12.348217311566847|0
id12|John W.|Smith|US|British|Irish|27.307137947726286|1
id15|Robert|Durieux|FR|French|Jewish|75.65330992570755|2
id14|Elena|Rossi|IT|Italian|Portuguese|46.084654576433834|3

diaspora data output

{ "id": null, "firstName": "Mary", "lastName": "Cao", "score": 12.163977377279767, "ethnicityAlt": "Vietnamese", "ethnicity": "Chinese", "lifted": false, "countryIso2": "US", "ethnicitiesTop": [ "Chinese", "Vietnamese", "NativeHawaiian", "HispanoLatino", "Portuguese", "Cambodian", "Italian", "Malays", "Jewish", "Hispanic" ] }

Field Example Description
id ref12315 The input identifier
firstName Mary The input first name / given name
lastName Cao The input last name / surname
countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB)
ethnicity Chinese The likely ethnicity
ethnicityAlt Vietnamese The best alternative ethnicity
ethnicitiesTop Chinese, Vietnamese, Korean ... The top 10 likely ethnicities
or TW.
score 25 A non calibrated Score : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 ; NB: diaspora doesn't have calibrated probabilities YET
lifted false Some classifications are 'lifted' by a dictionary rule, instead of the machine learning

US 'race'/ethnicity classifier

This classification model infers the US 'race' / ethnicity from a US name. The geographic context HAS TO BE 'US', or the model will fail. This model outputs race/ethnicity according to US Census taxonomy W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino). The taxonomy classes are : https://v2.namsor.com/NamSorAPIv2/api2/json/taxonomyClasses/personalname_us_race_ethnicity

This is an independant assessment of the model's accuracy provided by ResearchDone.com : https://www.dropbox.com/s/xkfll1nswqjwdn1/Race%20Classification%20Results.txt

From NamSorAPIv2.0.14, it is possible to adjust the taxonomy using a header parameter, X-OPTION-USRACEETHNICITY-TAXONOMY

  • X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-4CLASSES returns 4 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), by default.
  • X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES returns 6 classes W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino), AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander).

Input :

id12|John W.|Smith|US
id15|Robert|Durieux|US
id16|Jordan|Jackson|US
id17|Carmen|Garcia|US

Output :

#uid|firstName|lastName|countryIso2|raceEthnicity|raceEthnicityAlt|raceEthnicityScore|rowId
id17|Carmen|Garcia|US|HL|A|10.32374080384995|0
id16|Jordan|Jackson|US|B_NL|W_NL|1.9105209599712982|1
id12|John W.|Smith|US|W_NL|B_NL|2.783278508661135|2
id15|Robert|Durieux|US|W_NL|B_NL|1.8889062776993453|3

US 'race'/ethnicity data output

{ "id": null, "firstName": "Mary", "lastName": "Cao", "raceEthnicityAlt": "W_NL", "raceEthnicity": "A", "score": 27.341640697082248, "raceEthnicitiesTop": [ "A", "W_NL", "HL", "B_NL" ], "probabilityCalibrated": 0.9104267920103436, "probabilityAltCalibrated": 0.954264449825495 }

Field Example Description
id ref12315 The input identifier
firstName Mary The input first name / given name
lastName Cao The input last name / surname
countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB)
raceEthnicity A The likely 'race'/ethnicity : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino)
raceEthnicityAlt W_NL The best alternative 'race'/ethnicity
raceEthnicitiesTop A, W_NL, ... The likely 'race'/ethnicities
probabilityCalibrated 0.91 The calibrated probability of having guessed right the 'race'/ethnicity as A (Asian)
probabilityCalibratedAlt 0.95 The calibrated probability of having guessed right the 'race'/ethnicity as either A or W_NL (White Non Latino)
score 27 A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100

parse classifier

This classification model is a utility for parsing full names (ex. John Smith or Smith, John) into the first and last name components. The system will detect which part is more likely a given name or a family name, and decide where to split in complex cases (such as aristocratic names, composed names, etc.)

Input :

John W. Smith
Mary Smith
Elena Rossi
Robert Durieux
Durieux Robert
Smith Mary

Output :

#uid|fullName|firstNameParsed|lastNameParsed|nameParserType|nameParserTypeAlt|nameParserTypeScore|rowId
uid4|Durieux Robert|Robert|Durieux|LN1FN1|null|8.984422928615022|0
uid5|Smith Mary|Mary|Smith|LN1FN1|null|8.313637255008238|1
uid2|Elena Rossi|Elena|Rossi|FN1LN1|null|7.534662622281973|2
uid3|Robert Durieux|Robert|Durieux|FN1LN1|null|8.429672146707018|3
uid0|John W. Smith|John W.|Smith|FN2LN1|null|16.30796669777909|4
uid1|Mary Smith|Mary|Smith|FN1LN1|null|7.758738464551846|5