Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to add new names #908

Open
vbchinnam opened this issue Jun 25, 2021 · 10 comments
Open

how to add new names #908

vbchinnam opened this issue Jun 25, 2021 · 10 comments

Comments

@vbchinnam
Copy link

Hello,
I am trying to add new names. I have 1000 names and surnames which needs to be added. and i need to generate 10k patients.
how to provide the names to randomize?
Do I need to change the language and ethnicity also? If it is mandatory please let me know how to change it or add new language and ethnicity.

Appreciate your response and time

Thankyou

@jawalonoski
Copy link
Member

Replace or edit the src/main/resources/names.yml file -- it is that straight forward.

Just replace the names under spanish and english if you want to override the names without regard for language.

If you want to add names for a new language without overwriting the current names (i.e. not Spanish or English), just add a language section (e.g. french or chinese or whatever language you want to add), and then you need to edit the src/main/java/org/mitre/synthea/world/concepts/Names.java file, since it does not use them automatically.

@citizenrich
Copy link

Hi. I've tried out adding new names this morning and modified the names.yml and Names.java files. The generation outputs almost entirely English names, despite there being the same number of names in all languages (elements from the periodic table translated into all UN languages). Is there a tweak I need to make to evenly pick out names from each language?

Edits to Names.java in case I'm making a mistake (likely)...
pastebin of names.yml: https://pastebin.com/K18R3rG2

  public static String fakeFirstName(String gender, String language, Person person) {
    List<String> choices;
    if ("spanish".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("spanish." + gender);
    } else if ("french".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("french." + gender);
    } else if ("arabic".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("arabic." + gender);
    } else if ("chinese".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("chinese." + gender);
    } else if ("russian".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("russian." + gender);
    } else {
      choices = (List<String>) names.get("english." + gender);
    }
  public static String fakeLastName(String language, Person person) {
    List<String> choices;
    if ("spanish".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("spanish.family");
    } else if ("french".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("french.family");
    } else if ("arabic".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("arabic.family");
    } else if ("chinese".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("chinese.family");
    } else if ("russian".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("russian.family");
    } else {
      choices = (List<String>) names.get("english.family");
    }

@jawalonoski
Copy link
Member

If you want all names to have equal probability, the easiest solution is to just shove all the names under english.

What you have done is fine. However, the reason that all names do not have equal probability, is that the number of patients who speak a foreign language as their primary language is a significant minority. If you want to edit that, you need to edit the primary language code (honestly, this should probably be in a configuration file somewhere):

/**
* Selects a language based on race and ethnicity.
* For those of Hispanic ethnicity, language statistics are pulled from the national distribution
* of spoken languages. For non-Hispanic, national distributions by race are used.
* @param race US Census race
* @param ethnicity "hispanic" or "nonhispanic"
* @param random random to use
* @return the language spoken
*/
public String languageFromRaceAndEthnicity(String race, String ethnicity, Random random) {
if (ethnicity.equals("hispanic")) {
RandomCollection<String> hispanicLanguageUsage = new RandomCollection<>();
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16006&prodType=table
// Of the estimated 51,375,831 people with Hispanic ethnicity in the US:
// - 13,957,749 speak only English (27.1%)
// - 27,902,879 speak Spanish and English very well or well (54.3%)
// - 9,278,993 speak Spanish and English not well or not at all (18%)
// - 0.4% speak another language, which we will ignore to simplify things
// 48.85% will speak English (only English + half of bilingual) the rest will speak Spanish
hispanicLanguageUsage.add(48.85, "english");
hispanicLanguageUsage.add(51.15, "spanish");
return hispanicLanguageUsage.next(random);
} else {
switch (race) {
// For the people who are of nonhispanic ethnicity, use the national distribution of
// languages spoken:
// http://www2.census.gov/library/data/tables/2008/demo/language-use/2009-2013-acs-lang-tables-nation.xls?#
//
// While the census does not provide a breakdown of language usage by
// race, previously Synthea would associate languages to race through ethnicity. This
// code "flattens" out that older relationship.
case "white":
// Only 1.5% of people who report a race of white alone speak English less than very well.
// Given the previous categorization of languages by Synthea, the numbers line up closely.
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005H&prodType=table
RandomCollection<String> whiteLanguageUsage = new RandomCollection();
whiteLanguageUsage.add(0.002, "italian");
whiteLanguageUsage.add(0.004, "french");
whiteLanguageUsage.add(0.003, "german");
whiteLanguageUsage.add(0.001, "polish");
whiteLanguageUsage.add(0.002, "portuguese");
whiteLanguageUsage.add(0.003, "russian");
whiteLanguageUsage.add(0.001, "greek");
whiteLanguageUsage.add(0.984, "english");
return whiteLanguageUsage.next(random);
case "black":
// Only 3% of people who report a race of black or African American alone speak English
// less than very well.
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005B&prodType=table
RandomCollection<String> blackLanguageUsage = new RandomCollection();
blackLanguageUsage.add(0.004, "french");
blackLanguageUsage.add(0.026, "spanish");
blackLanguageUsage.add(0.97, "english");
return blackLanguageUsage.next(random);
case "asian":
// 33% of people who report a race of Asian alone speak English less than very well
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005D&prodType=table
// From the national language numbers:
// - 2,896,766 Chinese speakers
// - 449,475 Japanese speakers
// - 1,117,343 Korean speakers
// - 1,399,936 Vietnamese speakers
// - 643,337 Hindi speakers
// So, 44.5% of the selected Asian language speakers use Chinese, which accounts for 14.7%
// of the overall population of people who report a race of Asian. This is repeated for
// the rest of the languages.
RandomCollection<String> asianLanguageUsage = new RandomCollection();
asianLanguageUsage.add(0.147, "chinese");
asianLanguageUsage.add(0.022, "japanese");
asianLanguageUsage.add(0.056, "korean");
asianLanguageUsage.add(0.07, "vietnamese");
asianLanguageUsage.add(0.033, "hindi");
asianLanguageUsage.add(0.67, "english");
return asianLanguageUsage.next(random);
case "native":
// TODO: This is overly simplistic, 7% of people who report a race of American Indian and
// Alaska Native speak English less than well.
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005C&prodType=table
return "english";
case "hawaiian":
// https://files.hawaii.gov/dbedt/economic/data_reports/Non_English_Speaking_Population_in_Hawaii_April_2016.pdf
RandomCollection<String> hawaiianLanguageUsage = new RandomCollection();
hawaiianLanguageUsage.add(0.891, "english");
hawaiianLanguageUsage.add(0.109, "hawaiian");
return hawaiianLanguageUsage.next(random);
case "other":
// 36% of people who report a race of something else speak English less than well
// https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005F&prodType=table
// There are 924,374 Arabic speakers estimated nationally. Since there are 14,270,613
// people report some other race, we'll give people in this race category a 6.5% chance
// of speaking Arabic.
// TODO: Figure out what languages to assign to the missing 30%
RandomCollection<String> otherLanguageUsage = new RandomCollection();
otherLanguageUsage.add(0.065, "arabic");
otherLanguageUsage.add(0.935, "english");
return otherLanguageUsage.next(random);
default:
// Should never happen
return "english";
}
}
}

@citizenrich
Copy link

Thanks @jawalonoski I think I understand. When using Other Areas is the class languageFromRaceAndEthnicity still used? If so, is there a way to override it when using international or other locations?

@jawalonoski
Copy link
Member

Thanks @jawalonoski I think I understand. When using Other Areas is the class languageFromRaceAndEthnicity still used? If so, is there a way to override it when using international or other locations?

Yes, it is still used. No, there is no way to currently override that except through code. As I said though, it really should be in a configuration file. We'd be happy to take that as a pull request if you or anyone else wants to make that contribution.

@vbchinnam
Copy link
Author

The generated patients are having the numbers attached to them. Is it possible to generate the patient names without that numbers?
image

@jawalonoski
Copy link
Member

Edit the following property to false:

# If true, person names have numbers appended to them to make them more obviously fake
generate.append_numbers_to_person_names = true

That being said, I do not recommend that you do this, since the numbers are a good indicator that these people are fake.

@vbchinnam
Copy link
Author

got it.
Thankyou for suggesting me.
Much appreciated.

@citizenrich
Copy link

I hope a small follow-up is ok. For names, it looks like the project must be rebuilt after replacing name.yml. Is that also true for demographics/Other Areas, and is there some gradle/Java trick to not rebuild but still load the new files for testing?

@jawalonoski
Copy link
Member

Gradle has a feature called compile avoidance, so it only rebuilds the things that have changed.

If you use the ./run_synthea command it should just pickup the new files (yaml or other configuration settings) without rebuilding everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants