Skip to content

Latest commit

 

History

History
271 lines (213 loc) · 28.5 KB

30.methods.md

File metadata and controls

271 lines (213 loc) · 28.5 KB

STAR Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Software and algorithms
Data and code Zenodo deposit This paper https://doi.org/10.5281/zenodo.5014756
Source code on GitHub This paper https://github.com/greenelab/iscb-diversity
Browsable Python and R notebooks This paper https://greenelab.github.io/iscb-diversity/
PubmedPy This paper https://github.com/dhimmel/pubmedpy
Manubot Himmelstein et al., 2019 https://doi.org/10.1371/journal.pcbi.1007128
Name origin prediction Wiki2019-LSTM This paper https://github.com/greenelab/wiki-nationality-estimate
Geotext Yaser Martinez Palenzuela yaser.martinez@gmail.com https://github.com/elyase/geotext
Geopy Nominatim https://github.com/geopy/geopy https://geopy.readthedocs.io/en/stable/#nominatim
Genderize.io https://genderize.io/ https://genderize.io/
Other
Manuscript repository This paper https://github.com/greenelab/iscb-diversity-manuscript

Resource Availability

Lead Contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Casey Greene (casey.s.greene@cuanschutz.edu).

Materials Availability

This study did not generate new materials.

Data and Code Availability

All data have been deposited at https://github.com/greenelab/iscb-diversity and are publicly available as of the date of publication. DOIs are listed in the key resources table. Our Wikipedia name dataset is dedicated to the public domain under CC0 License at https://github.com/greenelab/wiki-nationality-estimate, with source code to construct the dataset available under a BSD 3-Clause License.

All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table. Specifically, our analysis of authors and ISCB-associated honorees is available under CC BY 4.0 at https://github.com/greenelab/iscb-diversity, with source code also distributed under a BSD 3-Clause License [@doi:10.5281/zenodo.5014756]. Rendered Python and R notebooks from this repository are browsable at greenelab.github.io/iscb-diversity. Our analysis of PubMed, PubMed Central, and author names relies on the Python pubmedpy package, developed as part of this project and available under a Blue Oak Model License 1.0 at https://github.com/dhimmel/pubmedpy and on PyPI. No additional information is required to reanalyze the data reported in this paper.

This manuscript was written openly on GitHub at github.com/greenelab/iscb-diversity-manuscript using Manubot [@doi:10.1371/journal.pcbi.1007128]. The Manubot HTML version is available under a Creative Commons Attribution (CC BY 4.0) License at greenelab.github.io/iscb-diversity-manuscript.

Method details

Honoree Curation

From ISCB's webpage listing ISCB Distinguished Fellows, we found recipients listed by their full names for the years 2009--2019. We gleaned the full name of the Fellow as well as the year in which they received the honor. We identified major ISCB-associated conferences as those designated flagship (ISMB) or those that had been held on many continents (RECOMB). To identify ISMB Keynote Speakers, we examined the webpage for each ISMB meeting. The invited speakers at ISMB before 2001 were listed in the Preface pages of each year's proceedings, which were archived in the ISMB collection of the AAAI digital library. We found full names of all keynote speakers for the years 1993--2019.

For the RECOMB meeting, we found conference webpages with keynote speakers for 1999, 2000, 2001, 2004, 2007, 2008, and 2010--2019. We were able to fill in the missing years using information from the RECOMB 2016 proceedings, which summarizes the first 20 years of the RECOMB conference [@doi:10.1007/978-3-319-31957-5]. This volume has two tables of keynote speakers from 1997--2006 (Table 14, page XXVII) and 2007--2016 (Table 4, page 8). Using these tables to verify the conference speaker lists, we arrived at two special instances of inclusion/exclusion. Although Jun Wang was not included in these tables, we were able to confirm that he was a keynote speaker in 2011 with the RECOMB 2011 proceedings [@doi:10.1007/978-3-642-20036-6], and thus we included this speaker in the dataset. Marian Walhout was invited as a keynote speaker but had to cancel the talk due to other obligations. Because her name was neither mentioned in the 2015 proceedings [@doi:10.1007/978-3-319-16706-0] nor in the above-mentioned tables, we excluded this speaker from our dataset.

Name processing

When extracting honoree names, we began with the full name as provided on the site. Because our prediction methods required separated first and last names, we chose the first non-initial name as the first name and the final name as the last name. We did not consider a hyphen to be a name separator: for hyphenated names, all components were included. For metadata from PubMed and PMC where first (fore) and last names are coded separately, we applied the same cleaning steps. We created functions to simplify names in the pubmedpy Python package to support standardized fore and last name processing.

Last author extraction

We assumed that, in the list of authors for a specific paper, last authors (often research advisors) would be most likely to be invited for keynotes or to be honored as fellows. Therefore, we utilized PubMed to retrieve last author names to assess the composition of the field. PubMed is a search engine resource provided by the US National Library of Medicine and index scholarly articles. PubMed contains a record for every article published in journals it indexes (30 million records total circa 2020), within which we were able to extract author first and last names and their order using the E-Utilities APIs. To automate and generalize these tasks, we created the pubmedpy Python package.

From PubMed, we compiled a catalog of 176,773 journal articles that were published from 1993 through 2019 that were written in English and tagged with the MeSH term "computational biology", which is equivalent to "bioinformatics" and includes categories such as genomics and systems biology (via PubMed's term explosion to include subterms). Excluding 663 articles with no author information and years with less than 200 articles/year, we analyzed 176,110 articles from 1998--2019. We extracted the number of times an article has been cited by PubMed Central articles from the PmcRefCount of the PubMed DocSum XML records.

Countries of Affiliations

Publications often provide affiliation lists for authors, which generally associate authors with research organizations and their corresponding physical addresses. We implemented affiliation extraction in the pubmedpy Python package for both PubMed and PMC XML records. These methods extract a sequence of textual affiliations for each author.

We relied on two Python utilities to extract countries from text: geotext and geopy.geocoders.Nominatim. The first, geotext, used regular expressions to find mentions of places from the GeoNames database. To avoid mislabeling, we only mapped the affiliation to a country if geotext identified 2 or more mentions of that country. For example, in the affiliation string "Laboratory of Computational and Quantitative Biology, Paris, France", geotext detected 2 mentions of places in France: Paris, France. In this case, we assign France to this affiliation.

This country extraction method accommodates multiple countries. Although ideally each affiliation record would refer to one and only one research organization, sometimes journals deposit multiple affiliations in a single structured affiliation. In these cases, we assigned multiple countries to the article. For more details on this approach, please consult the accompanying 07.affiliations-to-countries notebook and label dataset.

When geotext did not return results, we use the geopy approach, which returns a single country for an affiliation when successful. Its geocoders.Nominatim function converts names / addresses to geographic coordinates using the OpenStreetMap's Nomatim service. With this method, we split a textual affiliation by punctuation into a list of strings and iterate backward through this list until we found a Nomatim search result. For the above affiliation, the search order would be "France", "Paris", and "Laboratory of Computational and Quantitative Biology". Since Nomatim would return a match for the first term "France" (matched to France), the search would stop before getting to "Paris", and "France" would be assigned to this affiliation.

Our ability to assign countries to authors was largely driven by the availability of affiliations. The country-assignment-rate for last authors from PubMed records was approximately 47%. This reflects the varying availability of affiliation metadata by journal.

For ISCB honorees, during the curation process, if an honoree was listed with their affiliation at the time, we recorded this affiliation for analysis. For ISCB Fellows, we used the affiliation listed on the ISCB page. Because we could not find affiliations for the 1997 and 1998 RECOMB keynote speakers' listed for these years, they were left blank. If an author or speaker had more than one affiliation, each was inversely weighted by the number of affiliations that individual had.

Estimation of Gender

We predicted the gender of honorees and authors using the https://genderize.io API, which was trained on over 100 million name-gender pairings collected from the web and is one of the three widely-used gender inference services [@doi:10.7717/peerj-cs.156]. We used author and honoree first names to retrieve predictions from genderize.io. The predictions represent the probability of an honoree or author being male or female. We used the estimated probabilities and did not convert to a hard group assignment. For example, a query to https://genderize.io on January 26, 2020 for "Casey" returns a probability of male of 0.74 and a probability of female of 0.26, which we would add for an author with this first name. Because of technical limitations, our analysis only considered two binary gender categories, and we used "male" and "female" to refer to the gender of the scientists. However, we recognized the limitation of not accounting for non-binary gender categories and only considered predictions in aggregate and not as individual values for specific scientists.

Of 412 ISCB honorees, genderize.io fails to provide gender predictions for one name. Of 176,110 last authors, 1,014 were missing a fore name in the raw paper metadata and 11,498 had fore names consisting of only initials. Specifically, the metadata for most papers before 2002 (2,566 out of 2,601 papers) only have initials for first and/or middle author names. Without gender predictions for these names, we consider only articles from 2002 on when comparing gender compositions between two groups. Of the remaining authors, genderize.io failed to predict gender for 10,003 of these fore names. We note that approximately 42% of these NA predictions are hyphenated names, which is likely because they are more unique and thus are more difficult to find predictions for.

This bias of NA predictions toward non-English names has been previously observed [@doi:10.32614/RJ-2016-002] and may have a minor influence on the final estimate of gender compositions.

Estimation of Name Origin Groups

We developed a model to predict geographical origins of names. The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited. For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin.

To address these limitations in ethnicolr, we built a similar classifier, a Long Short-term Memory (LSTM) neural network, to infer the region of origin from patterns in the sequences of letters in full names. We applied this model on an updated, approximately 4.5 times larger training dataset called Wiki2019 (described below). We tested multiple character sequence lengths and, based on this comparison, selected tri-characters for the primary results described in this work. We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%. This model, which we term Wiki2019-LSTM, is available in the online file LSTM.h5.

To generate a training dataset for name origin prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of Living People. This category, which contained approximately 930,000 pages at the time of processing in November 2019, is regularly curated and allowed us to avoid pages related to non-persons. For each Wikipedia page, we used two strategies to find a full birth name and location context for that person. First, we looked for nationality mention in the first sentence in the body of the text. In most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, "John Edward Smith (born 1 January 1970) is an American novelist known for ..." This structure comes from editor guidance on biography articles and is designed to capture:

... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable.

Second, if this information is not available in the first sentence of the main text, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth.

We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of nationalities. We were able to define a name and nationality for 708,493 people by using the union of these strategies. This process produced country labels that were more fine-grained than the broader patterns that we sought to examine among honorees and authors. We initially grouped names by continent, but later decided to model our categorization after the hierarchical taxonomy used by NamePrism [@doi:10.1145/3132847.3133008]. The NamePrism taxonomy was derived from name-country pairs by producing an embedding of names by Twitter contact patterns and then grouping countries using the similarity of names from those countries. The countries associated with each grouping are shown in supplementary figure S1. NamePrism excluded the US, Canada and Australia because these countries have been populated by a mix of immigrant groups [@doi:10.1145/3132847.3133008].

In an earlier version of this manuscript, we also used category names derived from NamePrism, but a reader pointed out the titles of the groupings were problematic; therefore, in this version, we renamed these groupings to reflect that the NamePrism approach primarily identifies groups based on linguistic patterns from name etymology rather than religious or racial similarities. We note that our mapping from nationality to name origins was not without error. For example, a scientist of Israeli nationality may not bear a Hebrew name. These mismatches were assessed via the heatmap of the model performance (supplementary figure S2) and complemented by the affiliation analysis below. An alternative approach is to assign arbitrary names to these groups such as via letter coding (e.g., A, B, C, etc.), but we did not choose this strategy because ten arbitrary letters for ten groups can greatly reduce the paper's readibility.

Predicting Name Origin Groups with LSTM Neural Networks and Wikipedia

Table @tbl:example_names shows the size of the training set for each of the name origin groups as well as a few examples of PubMed author names that had at least 90% prediction probability in that group. We refer to this dataset as Wiki2019 (available online in annotated_names.tsv).

Group Training Size Example Names
Celtic/English names 154,890 Adam O Hebb, Oliver G Pybus, David W Ritchie, James WJ Anderson, James W MacDonald, Robert Clarke
European names 78,157 Tracey M Filzen, Jos H Beijnen, Caroline Louis-Jeune, Christian Lorenzi, Boris Vassilev, Verena Heinrich
Hispanic names 66,931 Ramón Latorre, Antonio J Jimeno-Yepes, Felipe A Simão, Paulo S L de Oliveira, Juan Carlos Rodríguez-Manzaneque, Natalia Acevedo-Luna
East Asian names 54,588 Heejoon Chae, Wenchao Jiang, Haizhou Liu, Miho Uchida, Wenxuan Zhang, Jiali Feng
Arabic/Turkic/Persian names 31,418 Hamidreza Chitsaz, Farzad Sangi, Habib Motieghader, Berke Ç Toptas, Ali Aliyari, Bülent Arman Aksoy
Nordic names 28,978 Cecilia M Lindgren, Ellen Larsen, Jesper R Gådin, Janne H Korhonen, Johan Åqvist, Jens Nilsson
South Asian names 20,025 Amitabh Chak, Matthew G Seetin, Matrika Gupta, Sumudu P Leelananda, VS Kumar Kolli, Swanand Gore
African names 17,826 Timothy Kinyanjui, Jammbe Z Musoro, Nyaradzo M Mgodi, Magambo Phillip Kimuda, Probhonjon Baruah, Adaoha E C Ihekwaba
Hebrew names 4,549 Alexander J Sadovsky, Boris Shraiman, Gil Goldshlager, Eytan Adar, Aviva Peleg, Nir Esterman
Greek names 4,138 Gianni Panagiotou, Themis Lazaridis, Eleni Mijalis, Nikolaos Tsiantis, Konstantinos A Kyritsis, Dimitris E Messinis

Table: Predicting name-origin groups of names trained on Wikipedia's living people. The table lists the 10 groups and the number of living people for each region that the LSTM was trained on. Example names shows actual author names that received a high prediction for each region. Full information about which countries comprised each region can be found in the online dataset country_to_region.tsv. {#tbl:example_names}

We next aimed to predict the name origin groups of honorees and authors. We constructed a training dataset with more than 700,000 name-nationality pairs by parsing the English-language Wikipedia. We trained a LSTM neural network on n-grams to predict name groups. We found similar performance across 1, 2, and 3-grams; however, because the classifier required fewer epochs to train with 3-grams, we used this length in the model that we term Wiki2019-LSTM. Our Wiki2019-LSTM returns, for each given name, a probability of that name originating from each of the specified 10 groups. We observed a multiclass area under the receiver operating characteristic curve (AUC) score of 95.9% for the classifier, indicating that the classifier can recapitulate name origins with high sensitivity and specificity. For each individual group, the high AUC (above 92%) suggests that our classifier was sufficient for use in a broad-scale examination of disparities. We also observed that the model was well calibrated. We also examined potential systematic errors between pairs of name origin groupings with a confusion heatmap and did not find off-diagonal enrichment for any pairing (supplementary figure S2).

Applying Wiki2019-LSTM on the author and honoree datasets, we obtained name origin estimates for all honorees' and authors' name, except the 12,512 that did not have fore names (see breakdown in the Estimation of Gender section above). Once again, because the large majority of author fore names prior to 2002 were recorded with initials only, predictions were not possible, and we excluded 1998--2001 when comparing name origin compositions between two groups.

Affiliation Analysis

For each country, we computed the expected number of honorees by multiplying the proportion of authors whose affiliations were in that country with the total number of honorees. We then performed an enrichment analysis to examine the difference in country affiliation proportions between ISCB honorees and field-specific last authors. We calculated each country's enrichment by dividing the observed proportion of honorees by the expected proportion of honorees. The 95% confidence interval of the log2 enrichment was estimated using the Poisson model method [@isbn:978-0849394447, page 17-18].

Statistical Analysis

We estimated the levels of representation by performing the following logistic regression of the group of scientists on each name's prediction probability while controlling for year: $$g = \beta_0 + \beta_1 prob + \beta_2 y + \epsilon.$$ The variable $prob$ is the prediction probablity of a demographic variable (gender and name origin) for names of scientists in group $g$ (honoree or author) during year $y$. $\epsilon \sim N(0, \sigma^2)$ accounts for random variation. An effect was deemed statistically significant if its corresponding p < $\alpha$ = 0.01.

We emphasize that the units in this analysis are honors and authorships. Therefore, each row of the input data frame represents either an honor or authorship with the scientist's name's probability value from the gender or name origin classifier ($prob$). We also performed an alternative approach in which each pair of forename and last name as the unit of measurement and the citations were totaled across different papers whose last author had that name. Controlling for each name's citations, this method tested the demographic effects but did not account for names honored more than once and may be affected by name collisions.

To reiterate, we only consider the prediction probabilities in aggregate and not as individual values for specific scientists. Moreover, although the average of the probabilities is not exactly "proportion", we use the phrase "estimated proportion" for readability. For example, the average of the probabilities of authors having an East Asian name origin is the estimate for the proportion of authors with East Asian names.

Iterative Research Process

Parallel analyses for the other versions are available in supplementary figure S3. In our first version of the analysis pipeline, we sought to characterize the distribution of authorships in the field using field-specific journals. This resulted in an analysis set of 29,755 authorships. We also examined corresponding authors, as we considered that senior authors may occasionally occupy a different position [@doi:10.1177/0306312716650046], and only fell back to last authors in cases where corresponding author annotations were unavailable.

In the next version of the analysis, we extended the analysis to all 176,110 computational biology PubMed articles, substantially increasing the sample size. We also extracted names of last authors instead of other potential selections to better capture the honoree population. Our assumption was that, among the authors of a specific paper, the last author (often research advisors) would be most likely to be invited for a keynote or honored as a fellow. Also, the availability of information on corresponding authors was limited, and extracting last author became a more consistent approach.

In version 3, instead of weighting all articles equally as in the earlier versions, we used citation count to weight articles to control for the differential impact of research contributions. Using citation counts has key limitations: female scientists of different disciplines are cited less often than their male counterparts [@arxiv:2001.01002; @doi:10.1002/ece3.4993; @doi:10.1038/s41550-017-0141]. Furthermore, the act of being honored, particularly with a keynote at an international meeting, could lead to work being more recognized and cited, which would reverse the arrow of causality. In version 4, we returned to the equal weight for all articles.

Finally, in all versions of the analysis, rather than applying a hard assignment for each prediction, we analyzed the raw prediction probability values to capture the uncertainty of the prediction model. Although we expect our estimates of disparities for citation-weighted analyses to be conservative, through each analysis, the overall findings remained consistent. Examining the literature with and without citation weighting, we learned that disparities exist and these disparities are large enough to overcome existing disparities in citation patterns.