how to add new names #908

vbchinnam · 2021-06-25T09:10:53Z

Hello,
I am trying to add new names. I have 1000 names and surnames which needs to be added. and i need to generate 10k patients.
how to provide the names to randomize?
Do I need to change the language and ethnicity also? If it is mandatory please let me know how to change it or add new language and ethnicity.

Appreciate your response and time

Thankyou

jawalonoski · 2021-06-25T12:25:48Z

Replace or edit the src/main/resources/names.yml file -- it is that straight forward.

Just replace the names under spanish and english if you want to override the names without regard for language.

If you want to add names for a new language without overwriting the current names (i.e. not Spanish or English), just add a language section (e.g. french or chinese or whatever language you want to add), and then you need to edit the src/main/java/org/mitre/synthea/world/concepts/Names.java file, since it does not use them automatically.

citizenrich · 2021-06-25T16:57:34Z

Hi. I've tried out adding new names this morning and modified the names.yml and Names.java files. The generation outputs almost entirely English names, despite there being the same number of names in all languages (elements from the periodic table translated into all UN languages). Is there a tweak I need to make to evenly pick out names from each language?

Edits to Names.java in case I'm making a mistake (likely)...
pastebin of names.yml: https://pastebin.com/K18R3rG2

  public static String fakeFirstName(String gender, String language, Person person) {
    List<String> choices;
    if ("spanish".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("spanish." + gender);
    } else if ("french".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("french." + gender);
    } else if ("arabic".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("arabic." + gender);
    } else if ("chinese".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("chinese." + gender);
    } else if ("russian".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("russian." + gender);
    } else {
      choices = (List<String>) names.get("english." + gender);
    }

  public static String fakeLastName(String language, Person person) {
    List<String> choices;
    if ("spanish".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("spanish.family");
    } else if ("french".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("french.family");
    } else if ("arabic".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("arabic.family");
    } else if ("chinese".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("chinese.family");
    } else if ("russian".equalsIgnoreCase(language)) {
      choices = (List<String>) names.get("russian.family");
    } else {
      choices = (List<String>) names.get("english.family");
    }

jawalonoski · 2021-06-25T17:44:18Z

If you want all names to have equal probability, the easiest solution is to just shove all the names under english.

What you have done is fine. However, the reason that all names do not have equal probability, is that the number of patients who speak a foreign language as their primary language is a significant minority. If you want to edit that, you need to edit the primary language code (honestly, this should probably be in a configuration file somewhere):

synthea/src/main/java/org/mitre/synthea/world/geography/Demographics.java

Lines 130 to 231 in 6ed19ab

    
             /** 
        
              * Selects a language based on race and ethnicity. 
        
              * For those of Hispanic ethnicity, language statistics are pulled from the national distribution 
        
              * of spoken languages. For non-Hispanic, national distributions by race are used. 
        
              * @param race US Census race 
        
              * @param ethnicity "hispanic" or "nonhispanic" 
        
              * @param random random to use 
        
              * @return the language spoken 
        
              */ 
        
             public String languageFromRaceAndEthnicity(String race, String ethnicity, Random random) { 
        
               if (ethnicity.equals("hispanic")) { 
        
                 RandomCollection<String> hispanicLanguageUsage = new RandomCollection<>(); 
        
                 // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16006&prodType=table 
        
                 // Of the estimated 51,375,831 people with Hispanic ethnicity in the US: 
        
                 // - 13,957,749 speak only English (27.1%) 
        
                 // - 27,902,879 speak Spanish and English very well or well (54.3%) 
        
                 // - 9,278,993 speak Spanish and English not well or not at all (18%) 
        
                 // - 0.4% speak another language, which we will ignore to simplify things 
        
                 // 48.85% will speak English (only English + half of bilingual) the rest will speak Spanish 
        
                 hispanicLanguageUsage.add(48.85, "english"); 
        
                 hispanicLanguageUsage.add(51.15, "spanish"); 
        
                 return hispanicLanguageUsage.next(random); 
        
               } else { 
        
                 switch (race) { 
        
                   // For the people who are of nonhispanic ethnicity, use the national distribution of 
        
                   // languages spoken: 
        
                   // http://www2.census.gov/library/data/tables/2008/demo/language-use/2009-2013-acs-lang-tables-nation.xls?# 
        
                   // 
        
                   // While the census does not provide a breakdown of language usage by 
        
                   // race, previously Synthea would associate languages to race through ethnicity. This 
        
                   // code "flattens" out that older relationship. 
        
                   case "white": 
        
                     // Only 1.5% of people who report a race of white alone speak English less than very well. 
        
                     // Given the previous categorization of languages by Synthea, the numbers line up closely. 
        
                     // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005H&prodType=table 
        
                     RandomCollection<String> whiteLanguageUsage = new RandomCollection(); 
        
                     whiteLanguageUsage.add(0.002, "italian"); 
        
                     whiteLanguageUsage.add(0.004, "french"); 
        
                     whiteLanguageUsage.add(0.003, "german"); 
        
                     whiteLanguageUsage.add(0.001, "polish"); 
        
                     whiteLanguageUsage.add(0.002, "portuguese"); 
        
                     whiteLanguageUsage.add(0.003, "russian"); 
        
                     whiteLanguageUsage.add(0.001, "greek"); 
        
                     whiteLanguageUsage.add(0.984, "english"); 
        
                     return whiteLanguageUsage.next(random); 
        
                   case "black": 
        
                     // Only 3% of people who report a race of black or African American alone speak English 
        
                     // less than very well. 
        
                     // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005B&prodType=table 
        
                     RandomCollection<String> blackLanguageUsage = new RandomCollection(); 
        
                     blackLanguageUsage.add(0.004, "french"); 
        
                     blackLanguageUsage.add(0.026, "spanish"); 
        
                     blackLanguageUsage.add(0.97, "english"); 
        
                     return blackLanguageUsage.next(random); 
        
                   case "asian": 
        
                     // 33% of people who report a race of Asian alone speak English less than very well 
        
                     // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005D&prodType=table 
        
                     // From the national language numbers: 
        
                     // - 2,896,766 Chinese speakers 
        
                     // - 449,475 Japanese speakers 
        
                     // - 1,117,343 Korean speakers 
        
                     // - 1,399,936 Vietnamese speakers 
        
                     // - 643,337 Hindi speakers 
        
                     // So, 44.5% of the selected Asian language speakers use Chinese, which accounts for 14.7% 
        
                     // of the overall population of people who report a race of Asian. This is repeated for 
        
                     // the rest of the languages. 
        
                     RandomCollection<String> asianLanguageUsage = new RandomCollection(); 
        
                     asianLanguageUsage.add(0.147, "chinese"); 
        
                     asianLanguageUsage.add(0.022, "japanese"); 
        
                     asianLanguageUsage.add(0.056, "korean"); 
        
                     asianLanguageUsage.add(0.07, "vietnamese"); 
        
                     asianLanguageUsage.add(0.033, "hindi"); 
        
                     asianLanguageUsage.add(0.67, "english"); 
        
                     return asianLanguageUsage.next(random); 
        
                   case "native": 
        
                     // TODO: This is overly simplistic, 7% of people who report a race of American Indian and 
        
                     // Alaska Native speak English less than well. 
        
                     // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005C&prodType=table 
        
                     return "english"; 
        
                   case "hawaiian": 
        
                     // https://files.hawaii.gov/dbedt/economic/data_reports/Non_English_Speaking_Population_in_Hawaii_April_2016.pdf 
        
                     RandomCollection<String> hawaiianLanguageUsage = new RandomCollection(); 
        
                     hawaiianLanguageUsage.add(0.891, "english"); 
        
                     hawaiianLanguageUsage.add(0.109, "hawaiian"); 
        
                     return hawaiianLanguageUsage.next(random); 
        
                   case "other": 
        
                     // 36% of people who report a race of something else speak English less than well 
        
                     // https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_B16005F&prodType=table 
        
                     // There are 924,374 Arabic speakers estimated nationally. Since there are 14,270,613 
        
                     // people report some other race, we'll give people in this race category a 6.5% chance 
        
                     // of speaking Arabic. 
        
                     // TODO: Figure out what languages to assign to the missing 30% 
        
                     RandomCollection<String> otherLanguageUsage = new RandomCollection(); 
        
                     otherLanguageUsage.add(0.065, "arabic"); 
        
                     otherLanguageUsage.add(0.935, "english"); 
        
                     return otherLanguageUsage.next(random); 
        
                   default: 
        
                     // Should never happen 
        
                     return "english"; 
        
                 } 
        
               } 
        
             }

citizenrich · 2021-06-25T18:44:06Z

Thanks @jawalonoski I think I understand. When using Other Areas is the class languageFromRaceAndEthnicity still used? If so, is there a way to override it when using international or other locations?

jawalonoski · 2021-06-25T18:51:13Z

Thanks @jawalonoski I think I understand. When using Other Areas is the class languageFromRaceAndEthnicity still used? If so, is there a way to override it when using international or other locations?

Yes, it is still used. No, there is no way to currently override that except through code. As I said though, it really should be in a configuration file. We'd be happy to take that as a pull request if you or anyone else wants to make that contribution.

vbchinnam · 2021-06-28T18:17:17Z

The generated patients are having the numbers attached to them. Is it possible to generate the patient names without that numbers?

jawalonoski · 2021-06-28T18:28:02Z

Edit the following property to false:

# If true, person names have numbers appended to them to make them more obviously fake
generate.append_numbers_to_person_names = true

That being said, I do not recommend that you do this, since the numbers are a good indicator that these people are fake.

vbchinnam · 2021-06-28T18:36:12Z

got it.
Thankyou for suggesting me.
Much appreciated.

citizenrich · 2021-06-29T20:48:38Z

I hope a small follow-up is ok. For names, it looks like the project must be rebuilt after replacing name.yml. Is that also true for demographics/Other Areas, and is there some gradle/Java trick to not rebuild but still load the new files for testing?

jawalonoski · 2021-06-29T20:58:38Z

Gradle has a feature called compile avoidance, so it only rebuilds the things that have changed.

If you use the ./run_synthea command it should just pickup the new files (yaml or other configuration settings) without rebuilding everything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to add new names #908

how to add new names #908

vbchinnam commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

citizenrich commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

citizenrich commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

vbchinnam commented Jun 28, 2021

jawalonoski commented Jun 28, 2021

vbchinnam commented Jun 28, 2021

citizenrich commented Jun 29, 2021

jawalonoski commented Jun 29, 2021

how to add new names #908

how to add new names #908

Comments

vbchinnam commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

citizenrich commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

citizenrich commented Jun 25, 2021

jawalonoski commented Jun 25, 2021

vbchinnam commented Jun 28, 2021

jawalonoski commented Jun 28, 2021

vbchinnam commented Jun 28, 2021

citizenrich commented Jun 29, 2021

jawalonoski commented Jun 29, 2021