Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oa_fetch () missing additional author's institutions? #270

Closed
yhan818 opened this issue Aug 28, 2024 · 8 comments · Fixed by #294
Closed

oa_fetch () missing additional author's institutions? #270

yhan818 opened this issue Aug 28, 2024 · 8 comments · Fixed by #294

Comments

@yhan818
Copy link
Contributor

yhan818 commented Aug 28, 2024

I am conducting institutional-level citation analysis.

There are some cases that an author having multiple affiliations. A parent organization may have multiple child organizations. For example, University of Arizona ROR (https://ror.org/03m2x1q45) have multiple units, including Lunar and Planetary Institute (https://ror.org/01r4eh644)

For certain works, an author has multiple institutions/affiliations associated with the work's metadata in OpenAlex.

  1. If I fetch the work's data using openAlex.
    oa_fetch_test1 <-oa_fetch( entity="works", id="https://openalex.org/W4401226694")
    view(oa_fetch_test1[[4]][[1]])

It has " 2 https://openalex.org/I58286723 Lunar and Planetary Institute https://ror.org/01r4eh644 " only.

  1. If going back to openAlex's API https://api.openalex.org/works/W4401226694
    It has both (Lunar and Planetary Institute" and "University of Arizona".

So oa_fetch() for "works" missing the additional institutions from openAlex's API data?

Screenshot from 2024-08-28 15-48-21

@yjunechoe
Copy link
Collaborator

yjunechoe commented Aug 29, 2024

If the author has multiple institutions, we track only the first in $institution_id but still track all in a flat (comma-separated string) structure in $institution_lineage:

oa_fetch_test1$author[[1]][2,]$institution_id
#> [1] "https://openalex.org/I58286723"

oa_fetch_test1$author[[1]][2,]$institution_lineage
#> [1] "https://openalex.org/I1329765538, https://openalex.org/I58286723"

So fetching those 2 institution IDs from $institution_lineage gets back what you observed:

oa_fetch_test1$author[[1]][2,]$institution_lineage |> 
  strsplit(", ") |> 
  el(1) |> 
  oa_fetch(entity = "institutions") |> 
  subset(, c("id", "display_name"))
#> # A tibble: 2 × 2
#>   id                               display_name                           
#>   <chr>                            <chr>                                  
#> 1 https://openalex.org/I1329765538 Universities Space Research Association
#> 2 https://openalex.org/I58286723   Lunar and Planetary Institute

Ref: #155


Actually sorry that's not quite right. I still don't see "University of Arizona". I'm not sure whether the data structure allowed multiple institutions back when we first implemented this - @trangdata do you recall?

The structure for this "Malhotra" author is:

#> 'data.frame':    1 obs. of  12 variables:
#>  $ au_id                   : chr "https://openalex.org/A5003933592"
#>  $ au_display_name         : chr "Renu Malhotra"
#>  $ au_orcid                : chr "https://orcid.org/0000-0002-1226-3305"
#>  $ author_position         : chr "middle"
#>  $ is_corresponding        : logi FALSE
#>  $ au_affiliation_raw      : chr "Lunar and Planetary Laboratory, The University of Arizona, USA"
#>  $ institution_id          : chr "https://openalex.org/I58286723"
#>  $ institution_display_name: chr "Lunar and Planetary Institute"
#>  $ institution_ror         : chr "https://ror.org/01r4eh644"
#>  $ institution_country_code: chr "US"
#>  $ institution_type        : chr "facility"
#>  $ institution_lineage     : chr "https://openalex.org/I1329765538, https://openalex.org/I58286723"

@yhan818
Copy link
Contributor Author

yhan818 commented Aug 30, 2024

Thank you. It will be nice to have all the institutions available, given the number of cases like the above. My case shows about 10% of works.

There will be multiple ways to get it implemented, such as list(). or an additional fields

@trangdata
Copy link
Collaborator

Thank you for this conversation @yhan818 and @yjunechoe. I think OpenAlex used to provide only one affiliation of authors, and when they introduced more affiliations/institutions, we still stick with exporting only the first one for simplicity. But you're right, we could make these list columns.

openalexR/R/oa2df.R

Lines 222 to 236 in 774aff7

if (length(inst_idx) > 0 && any(inst_idx)) {
first_inst <- l_inst[inst_idx][[1]]
first_inst$lineage <- paste(first_inst$lineage, collapse = ", ")
} else {
first_inst <- empty_inst
}
first_inst <- prepend(first_inst, "institution")
aff_raw <- list(
au_affiliation_raw =
if (length(l$raw_affiliation_strings)) {
l$raw_affiliation_strings[[1]]
} else {
NA_character_
}
)

@trangdata
Copy link
Collaborator

trangdata commented Sep 8, 2024

OK so currently, we have the following columns for author, where institution_* refers to the first institution reported by OpenAlex.

oa_fetch_test1 <- openalexR::oa_fetch(entity = "works", id = "https://openalex.org/W4401226694")
oa_fetch_test1$author[[1]] |> 
  dplyr::select(au_affiliation_raw, starts_with("institution"))
#>                                                                                                                                                                      au_affiliation_raw
#> 1                                                                                                                 Department of Astronomy & Astrophysics, University of Toronto, Canada
#> 2                                                                                                                        Lunar and Planetary Laboratory, The University of Arizona, USA
#> 3 Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA
#>                    institution_id      institution_display_name
#> 1 https://openalex.org/I185261750         University of Toronto
#> 2  https://openalex.org/I58286723 Lunar and Planetary Institute
#> 3 https://openalex.org/I111979921       Northwestern University
#>             institution_ror institution_country_code institution_type
#> 1 https://ror.org/03dbr7087                       CA        education
#> 2 https://ror.org/01r4eh644                       US         facility
#> 3 https://ror.org/000e0be47                       US        education
#>                                                institution_lineage
#> 1                                  https://openalex.org/I185261750
#> 2 https://openalex.org/I1329765538, https://openalex.org/I58286723
#> 3                                  https://openalex.org/I111979921

Created on 2024-09-08 with reprex v2.0.2

The question is, do we want to include affiliations and/or institutions as a list column, such that:

oa_fetch_test1$author[[1]]$affiliations
# [[1]]
# [[1]]$raw_affiliation_string
# [1] "Department of Astronomy & Astrophysics, University of Toronto, Canada"
# 
# [[1]]$institution_ids
# [[1]]$institution_ids[[1]]
# [1] "https://openalex.org/I185261750"
# 
# 
# 
# [[2]]
# [[2]]$raw_affiliation_string
# [1] "Lunar and Planetary Laboratory, The University of Arizona, USA"
# 
# [[2]]$institution_ids
# [[2]]$institution_ids[[1]]
# [1] "https://openalex.org/I58286723"
# 
# [[2]]$institution_ids[[2]]
# [1] "https://openalex.org/I138006243"
# 
# 
# 
# [[3]]
# [[3]]$raw_affiliation_string
# [1] "Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA"
# 
# [[3]]$institution_ids
# [[3]]$institution_ids[[1]]
# [1] "https://openalex.org/I111979921"

oa_fetch_test1$author[[1]]$institutions
#> [[1]]
#> # A tibble: 1 × 6
#> id                              display_name          ror                       country_code type      lineage     
#> <chr>                           <chr>                 <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I185261750 University of Toronto https://ror.org/03dbr7087 CA           education <list [1]>  
#>   
#>   [[2]]
#> # A tibble: 2 × 6
#> id                              display_name                  ror                       country_code type      lineage     
#> <chr>                           <chr>                         <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I58286723  Lunar and Planetary Institute https://ror.org/01r4eh644 US           facility  <list [2]>  
#>   2 https://openalex.org/I138006243 University of Arizona         https://ror.org/03m2x1q45 US           education <list [1]>  
#>   
#>   [[3]]
#> # A tibble: 1 × 6
#> id                              display_name            ror                       country_code type      lineage     
#> <chr>                           <chr>                   <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I111979921 Northwestern University https://ror.org/000e0be47 US           education <list [1]>  

What do we think? @yjunechoe @yhan818 What do we want to keep for backward compatibility? (again, I think it's good to keep in mind this change from one institution to more was from OpenAlex, so maybe a breaking change is necessary). Also note that there may be a cost in performance to do all this concatenation when we include everything like the lineage list column above.

According to the documentation:

Each institutional affiliation that this author has claimed will be listed here: the raw affiliation string that we found, along with the OpenAlex Institution ID or IDs that we matched it to. [affiliations] is redundant with [institutions], but is useful if you need to know about what we used to match institutions.

@yhan818
Copy link
Contributor Author

yhan818 commented Sep 9, 2024

OpenAlex has changed some outputs quite heavily in 2024. It has new data model and added new entities (e.g. grants).

In general, maintaining backward compatibility is a good practice. For example, it will not break code developed using the current openAlexR.

Shall we add a new field (e.g. author's affiliations) and leave the old one untouched?

@trangdata
Copy link
Collaborator

@yhan818 I agree generally it's good practice to maintain backward compatibility, but we do have to balance that out with other factors like cost of maintenance, computation, complexity, etc. I have shared this view before. To sum up, as a third-party package, I think it's important we try to mirror how OpenAlex changes.

@rkrug
Copy link

rkrug commented Sep 10, 2024

To keep up with OpenAlex changes is a moving target, and openalexR will always be running behind. But one could do the following, to offer both:

  1. The default format of fetch is list as it is essentially the response coming from OpenAlex, or as an alternative raw json as returned per page saved to files (see Update regex's for performance #271, Update regex's for performance #271 (comment)). This would always be backwards compatible.
  2. Offer functions which convert, the list or json into a tibble which need to be called separately. This makes it possible to have backward compatible functions as well as follow new approaches at a later stage. Also using the son with e.g. duckDB would for example nut require any conversion.

The problem would be step on, i.e. changing a default value, which will break compatibility, but this could be introduces over a few version with deprecation warning.

@yhan818
Copy link
Contributor Author

yhan818 commented Sep 11, 2024

Agreed with both of you in principle. Given the changes with openAlex, it is not mature. So back-comparability may not be that important. I am fine with either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants