Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update regex's for performance #271

Closed
ds-jim opened this issue Sep 6, 2024 · 22 comments
Closed

Update regex's for performance #271

ds-jim opened this issue Sep 6, 2024 · 22 comments

Comments

@ds-jim
Copy link

ds-jim commented Sep 6, 2024

Performance converting large datasets to data.frame is very slow. I doubt this will transform matters but every little helps.

In utils.R you have two functions which would run faster as explicit find and replace rather than a regex replace.

Original

shorten_oaid <- function(id) {
  gsub("^https://openalex.org/", "", id)
}

shorten_orcid <- function(id) {
  gsub("^https://orcid.org/", "", id)
}

Suggestion

shorten_oaid <- function(id) {
  gsub("https://openalex.org/", "", id, fixed = TRUE)
}

shorten_orcid <- function(id) {
  gsub("https://orcid.org/", "", id, fixed = TRUE)
}
@trangdata
Copy link
Collaborator

@ds-jim Thanks James, I will incorporate your suggestion. How large are your datasets? Could you provide some benchmark?

@yjunechoe
Copy link
Collaborator

Just to reference the last time we talked seriously about performance - #129

@ds-jim
Copy link
Author

ds-jim commented Sep 6, 2024

@trangdata I'm looking at publication lists for entire conditions from 2000 onwards. I tried diabetes ~600k and it was still running 12+ hours later so I stopped and went smaller. I next tried a title only search for malaria which has 76,000 publications and after 30minutes it hadn't completed but I spotted the message about truncating authors so ran the search again but with output = "list" and it completed in 18 minutes. I suspect the memory requirment is the issue in trying to "flatten" all that data. I'm using a 16Gb M1 MacBook.

@yjunechoe
Copy link
Collaborator

yjunechoe commented Sep 7, 2024

First, some general points:

  • IMO output= "list" is indeed the way to go if you're fetching thousands of works in a single query. You can then incrementally process the output afterwards, and at that point you have the options to e.g., parallelize.
  • The function that handles the data frame conversion for publications is works2df(). This function is indeed a bit expensive to run, as it reconstructs the output = "list" result into a specific "tidy" data frame. We put a lot of thought into how the various pieces of information scattered around the JSON should be organized into columns, and the complexity of works2df() reflects this fact

That said, I agree that things could be faster, especially for works2df(). See the full profiling results here: https://rpubs.com/yjunechoe/works2df_profvis1


I benchmarked converting a 1000 works from output = "list" using works2df():

res1_list <- oa_fetch(identifier = "W2741809807", output = "list")
res1000_list <- rep(list(res1_list), 1000)

profvis::profvis({
  works2df(res1000_list)
})

Overall, it takes 4.5 seconds to process 1000 works which isn't to bad, but note that if you're doing 10s and 100s of thousands at a time, performance will be compounded by pressures on memory.

Digging closer into profvis, I see:

image

The top three most expensive calls are process_topis(), mapply(..., subs_na), and subs_na(). The performance cost of the 2nd and 3rd (accounting for ~25% of total execution time) are both driven by subs_na() and in turn by replace_w_na(). This was already a target of optimization back in #132 but I think we're seeing problems again here as OpenAlex is returning larger and more nested JSONs (this matters because, IIRC, we're visiting every node to replace NULL with NA). I feel like this is a very low-level optimization problem for which base R tools are just inadequate - I'll just float the idea of using {rrapply} for this again: #129 (comment)

But the greatest offender is process_topics() (~half of total execution time), which I'm kind of surprised by. Zooming into process_topics(), we actually see subs_na() show up again taking up ~10% of execution time. tibble::rownames_to_column() stands out to me as the obvious low hanging fruit here (~25% of total execution time) - we can just write our own util that does the same in base R.

image

The rest seem too trivial or scattered to act on. The next low hanging fruit would probably be optimizing this line of cbind() code in process_topics(), but it already only takes up <1% of execution time.

cbind(i = i, topic[extra], relev_df)


I think a good first step would be factoring out tibble::rownames_to_column(). Then we can see if we can squeeze more performance out of subs_na() (though I'm kind of ideas here)

@trangdata
Copy link
Collaborator

Thank you, June, for the thorough analysis! You're right, and process_topics was a function we recently added to accommodate the new topics entity in OpenAlex, and I haven't profiled it. So this is super helpful! Do you want to submit a PR to grab these low hanging fruits first? 🙏🏽

yjunechoe added a commit to yjunechoe/openalexR that referenced this issue Sep 8, 2024
@rkrug
Copy link

rkrug commented Sep 8, 2024

I just saw https://www.r-bloggers.com/2024/09/json-null-values-and-as_tibble/ - possibly you saw it as well. I do not know if it helps, just wanted to drop it here.

@yjunechoe
Copy link
Collaborator

Thanks @rkrug for the reference - the post you shared is a good approach in principle but from a quick glance I see that it involves a nested iteration of purrr::map_dfr() which in turn calls purrr::map(). I'm cautious of solutions which don't explicitly and intentionally optimize for performance - a quick benchmark shows that it's about 10 times slower than what we have currently via subs_na():

data_json <- '
[
  {
    "name": "Tim",
    "age": 34,
    "hobby": "footbal"
  },
  {
    "name": "Tom",
    "age": null,
    "hobby": "baseball"
  },    
  {
    "name": "Shelly",
    "age": 21,
    "hobby": null
  }  
]
'
parsed_json <- rjson::fromJSON(data_json)
library(purrr)

fix_NULL <- function(variables) {
  variables <- variables %>%
    map(~ ifelse(is.null(.x), NA, .x))
  as.data.frame(variables)
}

bench::mark(
  current = openalexR:::subs_na(parsed_json, type = "rbind_df")[[1]],
  purrr = map_dfr(parsed_json, ~ fix_NULL(.x)),
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current     227.4µs  318.5µs     2750.   93.26KB    12.8 
#> 2 purrr         2.9ms   3.82ms      247.    7.28MB     6.28

I think addressing the performance of subs_na() is going to be difficult and we'd need to think very seriously about this, as what we have is already very optimized as far as base R goes. We may have to reach into C code, but I don't have good intuitions for that at the moment.

@rkrug
Copy link

rkrug commented Sep 8, 2024 via email

@yjunechoe
Copy link
Collaborator

Yes, there are some other packages for faster reading of json based on C/C++, like yyjsonr and rcppsimdjson. I'm a little hesitant to jump straight into integrating these tools though - I think we can squeeze a lot more mileage in performance from just promoting better practices for making large queries (ex: try to chunk them up, download as results are returned, separate the task of getting query results and converting them into data frames, etc.)

And just for completeness, back on the note of subs_na() - previously I'd floated a suggestion for using {rrapply}, which is implemented in C. I'm not yet 100% sold on adding this extra dependency for maintenance reasons, but an obvious performance improvement like this would be ideal

parsed_json_many <- rep(parsed_json, 1000)
bench::mark(
  cur = openalexR:::subs_na(parsed_json_many, type = "rbind_df")[[1]],
  purrr = map_dfr(parsed_json_many, ~ fix_NULL(.x)),
  rrapply = rrapply(rrapply(parsed_json_many, is.null, \(...) NA), how = "bind")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cur         57.26ms  85.26ms    13.2       353KB     7.54
#> 2 purrr         1.27s    1.27s     0.790     484KB     3.16
#> 3 rrapply      6.12ms   6.71ms   131.        486KB     3.98

trangdata added a commit that referenced this issue Sep 8, 2024
* fix test

* use fixed for regex; #271

* factor our tibble::rownames_to_column()

* Revert "closes #265"

This reverts commit 12dde58.

---------

Co-authored-by: let20 <trang.le@bms.com>
@trangdata
Copy link
Collaborator

trangdata commented Sep 9, 2024

@trangdata I'm looking at publication lists for entire conditions from 2000 onwards. I tried diabetes ~600k and it was still running 12+ hours later so I stopped and went smaller. I next tried a title only search for malaria which has 76,000 publications and after 30minutes it hadn't completed but I spotted the message about truncating authors so ran the search again but with output = "list" and it completed in 18 minutes. I suspect the memory requirment is the issue in trying to "flatten" all that data. I'm using a 16Gb M1 MacBook.

@ds-jim What do you plan to analyze with these records? A few things that might help speed up your call:

  • options = list(select = ...) argument of oa_fetch() allows you to specify the fields you're interested in getting. For example, if you only need publication_year and publication_date, you can specify that in your call. (my example)
  • Like June suggested, output = "list" might be the way to go if you don't need the data.frame output or plan to convert it to data.frame after
  • This might be overkill, but oa_generate() allows you to build the query and get one record at a time, if you worry about memory.

@rkrug
Copy link

rkrug commented Sep 9, 2024

I have modified oa_request() so that it saves each page after it is retrieved. This accomplished the same as @trangdata suggested oa_generate() but in a, imho, easier way. I used it several times, downloading up to 4.5 million records. You can find it at https://github.com/IPBES-Data/IPBES.R/blob/dev/R/oa_request.R. Please ask if you have questions or want to follow that approach.

@yjunechoe
Copy link
Collaborator

I wanna emphasize the options = list(select = ...) + output = "list" combo even further. The performance of large queries benefits greatly from being careful and intentional about what kinds of information you care about, and (safely) making assumptions that let you take shortcuts. The lesson is: optimizations in code (like gsub with fixed vs. perl) can only get you so far - more mileage can be gained from planning ahead.

For example, if you only care about scalar fields like id, cited_by_count, etc., it suffices to specify them in select, fetch using output = "list", and combine into data frame afterwards with something more performant, like data.table::rbindlist().

The following code fetches and processes 10,000 works objects in 40 seconds.

library(dplyr)
malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>% 
  filter(display_name == "Malaria") %>% 
  pull(id)
malaria_topic
#> [1] "https://openalex.org/T10091"

# Scalar fields
select_fields <- c("id", "cited_by_count", "display_name")
system.time({
  res <- oa_fetch(
    topics.id = malaria_topic,
    entity = "works",
    verbose = TRUE,
    options = list(sample = 10000, seed = 1,
                   select = select_fields),
    output = "list"
  )
})
#> Requesting url: https://api.openalex.org/works?filter=topics.id%3Ahttps%3A%2F%2Fopenalex.org%2FT10091&sample=10000&seed=1&select=id%2Ccited_by_count%2Cdisplay_name
#> Getting 50 pages of results with a total of 10000 records...
#>   OpenAlex downloading [=====================] 100% eta:  0s
#>    user  system elapsed 
#>    0.45    0.04   41.39 

system.time({
  res_df <- data.table::rbindlist(res)
})
#>    user  system elapsed 
#>    0.00    0.00    0.02

res_df
#> Rows: 10,000
#> Columns: 3
#> $ id             <chr> "https://openalex.org/W2331399312", "https://openalex.org/W3213308635", …
#> $ cited_by_count <int> 1, 0, 3, 28, 0, 155, 0, 15, 1, 2, 15, 76, 8, 1, 5, 0, 68, 1, 3, 0, 57, 4…
#> $ display_name   <chr> "Antimalarial Activity of Zincke’s Salts", "1104 Update on the study of …

@rkrug
Copy link

rkrug commented Sep 9, 2024 via email

@ds-jim
Copy link
Author

ds-jim commented Sep 9, 2024

Hi @trangdata,

That's super helpful thank you. I'd completely missed the select option within the function (sorry all), there's ton's of data coming back from Open Alex that I don't need, as @yjunechoe points out "select" with "list" is the ticket for larger data sources.

@rkrug
I think your solution might help for the intermediate sizes but I have quite large area's I wanting to asses. I have all the results I need saved in list format as RDS. Even still, extracting these is proving difficult so my next stop was going to be converting the list back to JSON and loading into Postgres. I've previously found the JSON field there is handled extremely swiftly, although I'll give DuckDB a little try as there's less overhead required!

I think an option for output = "json" would be helpful for those of us 1 step shy of importing the whole snapshot into BigQuery. However, the conversion from list to JSON should be fairly trivial.

@yjunechoe
Copy link
Collaborator

yjunechoe commented Sep 9, 2024

@rkrug Is there a context where you'd want a raw json as opposed to a list? I can't imagine wanting a json instead of an R list unless you want to consume the data elsewhere like Javascript. Sorry - I somehow completely missed the point about databases. I think that's reasonable but more thoughts from me below:

Also, I raised this before #205 (comment), but 1 thing to be careful of is that for large records that comes in batches via pages, a raw json output will actually not be a single json but a character vector of chucked json strings, and there's no way to merge them without parsing them first. Given that, I think output="json" would be at best misleading. IMO output="list" is as "raw" as we can get while returning a single query result.

Lastly, for parsing we use {jsonlite}, which is round-trippable, so if the JSON format is important users can always recover it post-hoc (or, just do a simple GET on the output of oa_query()).

@rkrug
Copy link

rkrug commented Sep 10, 2024

Also, I raised this before #205 (comment), but 1 thing to be careful of is that for large records that comes in batches via pages, a raw json output will actually not be a single json but a character vector of chucked json strings, and there's no way to merge them without parsing them first. Given that, I think output="json" would be at best misleading. IMO output="list" is as "raw" as we can get while returning a single query result.

Just for clarification, as I would OpenAlex to return a valid json when called to return a certain page. I save in my modified implementation of oa_request() each page as a list (https://github.com/IPBES-Data/IPBES.R/blob/72d354e348ad225a7ddffe7a583ba708624afa0c/R/oa_request.R#L120-L130) which works. In a second step I convert these to parquet format (with several data standardisations in-between). So I assume, that the returned json must be valid. So saving it as json, should also be valid json.

If you are referring to individual records - Isn't the paging done per record by OpenAlex? and when looking at the code, I would say each page is returned as a n individual entry in data (https://github.com/IPBES-Data/IPBES.R/blob/72d354e348ad225a7ddffe7a583ba708624afa0c/R/oa_request.R#L128)?

The nice thing about openalexR is the implementation of the paging - by using oa_query() one has to do this manually. So quite bit of additional work which can be avoided.

Concerning databases: offering the option of saving the json per page, and a function which uses duckDB to read the json files (returning a database object which can be fed into a dplyr pipeline and read by using collect at the end should really be the easiest and relatively straight forward method to make all conversions unnecessary as they are done by duckDB (in theory...).

@yjunechoe
Copy link
Collaborator

We should really move this to a separate issue and not clutter this one up so our discussion on this doesn't get more scattered than it already is, but just to quickly respond:

I'm not opposed to a output = "raw/json" in theory and I don't have any concerns about OpenAlex returning "invalid" JSON. It's just that the exact structure of the output and how to best ingest the JSON(s) depends on how the query is constructed, which is an implementational detail that's better hidden from users entirely - this is the primary goal of {openalexR} as an API package that returns R data frames.

My main thought is that between output = "list" and oa_query(), {openalexR} already provides enough of a lower-level interface for power users (amongst API packages, we already go above and beyond in providing these interfaces). The power power-users who'd like the raw JSON can simply recruit an additional, more powerful tool like {httr} to hit the output of oa_query() directly and grab its JSON contents from there - and then these power power users can also add their other preferred strategies on top like retry-ing, or maybe the database language provides remote streaming of JSON and you'd prefer that instead. All in all, we would rather encourage power users to take control of their own advanced strategies using what we already offer (namely, via oa_query()), rather than bake the individually preferred strategies into the package itself. As I've said before, I prefer not to chase after ...

... a small net-positive for power users (who, in their diversity, may not even agree with the kind of [low-level features we implement] requiring them to write their own custom behaviors anyways)

I have to admit that the database feature is new to me, so maybe I could be convinced that the JSON file format could be of interest to a broader audience. But I wonder - how useful is the data after such an automatic ingestion into databases? The best parts of {openalexR} are data-processing functions like works2df() - why would one opt out of this (or data processing in R entirely)? And if a user wants to opt out of JSON processing, why not opt out a little earlier at oa_query() and take control from there? As a developer, these questions about the certainty and generality of a feature's usefulness are what I'm most concerned about, not so much their technical feasibility and theoretical usecases. As I've stated before, my philosophy is always that development should be driven by the needs of the average user, especially because of our limited resources.

@rkrug
Copy link

rkrug commented Sep 10, 2024

Thanks for your patience.

I will draw up an example of what I am thinking about using my modified version of oa_request() and also add how to process it further - but it will likely not be this week. I will post it in a new thread.

@yjunechoe
Copy link
Collaborator

For what it's worth, I think you're doing a great service by digging these issues very deeply! My concerns about "should the ability to do X be integrated into the package" are entirely separate from "should users get to know that they can do X" - your own implementation serves as a great example for the power users who want to get more out of {openalexR}. If you want to share your approach here and link to your implementation for other power users to tweak and adopt, I'd be very happy with that.

@rkrug
Copy link

rkrug commented Sep 10, 2024

Thanks. I will keep you posted.

@yhan818
Copy link
Contributor

yhan818 commented Sep 20, 2024

Hi, I ran your code to fetch all the fields.

malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>%
filter(display_name == "Malaria") %>%
pull(id)
malaria_topic
#> [1] "https://openalex.org/T10091"

system.time({
res <- oa_fetch(
topics.id = malaria_topic,
entity = "works",
verbose = TRUE,
options = list(sample = 1000, seed = 1),
output = "list"
)
})

It took 76 seconds to pull out all the fields (40 seconds to pull out the select fields)

I used the following code to fetch 10,000 records. It took 260 seconds). The only add-on is the batch_identifiers array. Any thoughts?

system.time({
batch_identifiers <-org_ref_works_cited[1:10000]
res <-oa_fetch(identifier=batch_identifiers,
entity = "works",
options= list(sample=10000, seed=1),
output="list")
})

@trangdata
Copy link
Collaborator

Hi Yan — I've moved your question to #278.

Hi all — I'm closing this issue because the discussion is getting too fragmented. Please feel free to open new specific issues if you think there are points we have not fully resolved. 🙏🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants