-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update regex's for performance #271
Comments
@ds-jim Thanks James, I will incorporate your suggestion. How large are your datasets? Could you provide some benchmark? |
Just to reference the last time we talked seriously about performance - #129 |
@trangdata I'm looking at publication lists for entire conditions from 2000 onwards. I tried diabetes ~600k and it was still running 12+ hours later so I stopped and went smaller. I next tried a title only search for malaria which has 76,000 publications and after 30minutes it hadn't completed but I spotted the message about truncating authors so ran the search again but with |
First, some general points:
That said, I agree that things could be faster, especially for I benchmarked converting a 1000 works from res1_list <- oa_fetch(identifier = "W2741809807", output = "list")
res1000_list <- rep(list(res1_list), 1000)
profvis::profvis({
works2df(res1000_list)
}) Overall, it takes 4.5 seconds to process 1000 works which isn't to bad, but note that if you're doing 10s and 100s of thousands at a time, performance will be compounded by pressures on memory. Digging closer into profvis, I see: The top three most expensive calls are But the greatest offender is The rest seem too trivial or scattered to act on. The next low hanging fruit would probably be optimizing this line of Line 166 in 774aff7
I think a good first step would be factoring out |
Thank you, June, for the thorough analysis! You're right, and |
I just saw https://www.r-bloggers.com/2024/09/json-null-values-and-as_tibble/ - possibly you saw it as well. I do not know if it helps, just wanted to drop it here. |
Thanks @rkrug for the reference - the post you shared is a good approach in principle but from a quick glance I see that it involves a nested iteration of data_json <- '
[
{
"name": "Tim",
"age": 34,
"hobby": "footbal"
},
{
"name": "Tom",
"age": null,
"hobby": "baseball"
},
{
"name": "Shelly",
"age": 21,
"hobby": null
}
]
'
parsed_json <- rjson::fromJSON(data_json)
library(purrr)
fix_NULL <- function(variables) {
variables <- variables %>%
map(~ ifelse(is.null(.x), NA, .x))
as.data.frame(variables)
}
bench::mark(
current = openalexR:::subs_na(parsed_json, type = "rbind_df")[[1]],
purrr = map_dfr(parsed_json, ~ fix_NULL(.x)),
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current 227.4µs 318.5µs 2750. 93.26KB 12.8
#> 2 purrr 2.9ms 3.82ms 247. 7.28MB 6.28 I think addressing the performance of |
I did not read through it in detail - so this is likely the case. I remember seeing a json package using a c libra for handling json - maybe that would be an approach to start at the returned json level? Again, haven’t looked at it in detail. Sent from my iPhoneOn 8 Sep 2024, at 09:57, June Choe ***@***.***> wrote:
Thanks @rkrug for the reference - the post you shared is a good approach in principle but from a quick glance I see that it involves a nested iteration of purrr::map_dfr() which in turn calls purrr::map(). I'm cautious of solutions which don't explicitly and intentionally optimize for performance - a quick benchmark shows that it's about 10 times slower than what we have currently via subs_na():
data_json <- '
[
{
"name": "Tim",
"age": 34,
"hobby": "footbal"
},
{
"name": "Tom",
"age": null,
"hobby": "baseball"
},
{
"name": "Shelly",
"age": 21,
"hobby": null
}
]
'
parsed_json <- rjson::fromJSON(data_json)
library(purrr)
fix_NULL <- function(variables) {
variables <- variables %>%
map(~ ifelse(is.null(.x), NA, .x))
as.data.frame(variables)
}
bench::mark(
current = openalexR:::subs_na(parsed_json, type = "rbind_df")[[1]],
purrr = map_dfr(parsed_json, ~ fix_NULL(.x)),
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current 227.4µs 318.5µs 2750. 93.26KB 12.8
#> 2 purrr 2.9ms 3.82ms 247. 7.28MB 6.28
I think addressing the performance of subs_na() is going to be difficult and we'd need to think very seriously about this, as what we have is already very optimized as far as base R goes. We may have to reach into C code, but I don't have good intuitions for that at the moment.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes, there are some other packages for faster reading of json based on C/C++, like yyjsonr and rcppsimdjson. I'm a little hesitant to jump straight into integrating these tools though - I think we can squeeze a lot more mileage in performance from just promoting better practices for making large queries (ex: try to chunk them up, download as results are returned, separate the task of getting query results and converting them into data frames, etc.) And just for completeness, back on the note of parsed_json_many <- rep(parsed_json, 1000)
bench::mark(
cur = openalexR:::subs_na(parsed_json_many, type = "rbind_df")[[1]],
purrr = map_dfr(parsed_json_many, ~ fix_NULL(.x)),
rrapply = rrapply(rrapply(parsed_json_many, is.null, \(...) NA), how = "bind")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 cur 57.26ms 85.26ms 13.2 353KB 7.54
#> 2 purrr 1.27s 1.27s 0.790 484KB 3.16
#> 3 rrapply 6.12ms 6.71ms 131. 486KB 3.98 |
@ds-jim What do you plan to analyze with these records? A few things that might help speed up your call:
|
I have modified |
I wanna emphasize the For example, if you only care about scalar fields like The following code fetches and processes 10,000 works objects in 40 seconds. library(dplyr)
malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>%
filter(display_name == "Malaria") %>%
pull(id)
malaria_topic
#> [1] "https://openalex.org/T10091"
# Scalar fields
select_fields <- c("id", "cited_by_count", "display_name")
system.time({
res <- oa_fetch(
topics.id = malaria_topic,
entity = "works",
verbose = TRUE,
options = list(sample = 10000, seed = 1,
select = select_fields),
output = "list"
)
})
#> Requesting url: https://api.openalex.org/works?filter=topics.id%3Ahttps%3A%2F%2Fopenalex.org%2FT10091&sample=10000&seed=1&select=id%2Ccited_by_count%2Cdisplay_name
#> Getting 50 pages of results with a total of 10000 records...
#> OpenAlex downloading [=====================] 100% eta: 0s
#> user system elapsed
#> 0.45 0.04 41.39
system.time({
res_df <- data.table::rbindlist(res)
})
#> user system elapsed
#> 0.00 0.00 0.02
res_df
#> Rows: 10,000
#> Columns: 3
#> $ id <chr> "https://openalex.org/W2331399312", "https://openalex.org/W3213308635", …
#> $ cited_by_count <int> 1, 0, 3, 28, 0, 155, 0, 15, 1, 2, 15, 76, 8, 1, 5, 0, 68, 1, 3, 0, 57, 4…
#> $ display_name <chr> "Antimalarial Activity of Zincke’s Salts", "1104 Update on the study of … |
I completely agree.
One question: To make that whole process even more flexible, It could be an option to return not the list, but the json. This could be stored, and would open many possibilities, even accessing it via duckDB, which can read collections of json files as a dataset. This would also make the conversion easier, as duckDB could do the job.
Easiest option: specifying ja=son as output format, saves each page into a directory. This would catch many use cases, increase speed and be a useful feature for power users.
… On 9 Sep 2024, at 17:22, June Choe ***@***.***> wrote:
I wanna emphasize the options = list(select = ...) + output = "list" combo even further. The performance of large queries benefits greatly from being careful and intentional about what kinds of information you care about, and (safely) making assumptions that let you take shortcuts. The lesson is: optimizations in code (like gsub with fixed vs. perl) can only get you so far - more mileage can be gained from planning ahead.
For example, if you only care about scalar fields like id, cited_by_count, etc., it suffices to specify them in select, fetch using output = "list", and combine into data frame afterwards with something more performant, like data.table::rbindlist().
The following code fetches and processes 10,000 works objects in 40 seconds.
library(dplyr)
malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>%
filter(display_name == "Malaria") %>%
pull(id)
malaria_topic
#> [1] "https://openalex.org/T10091"
# Scalar fields
select_fields <- c("id", "cited_by_count", "display_name")
system.time({
res <- oa_fetch(
topics.id = malaria_topic,
entity = "works",
verbose = TRUE,
options = list(sample = 10000, seed = 1,
select = select_fields),
output = "list"
)
})
#> Requesting url: https://api.openalex.org/works?filter=topics.id%3Ahttps%3A%2F%2Fopenalex.org%2FT10091&sample=10000&seed=1&select=id%2Ccited_by_count%2Cdisplay_name
#> Getting 50 pages of results with a total of 10000 records...
#> OpenAlex downloading [=====================] 100% eta: 0s
#> user system elapsed
#> 0.45 0.04 41.39
system.time({
res_df <- data.table::rbindlist(res)
})
#> user system elapsed
#> 0.00 0.00 0.02
res_df
#> Rows: 10,000
#> Columns: 3
#> $ id <chr> "https://openalex.org/W2331399312", "https://openalex.org/W3213308635", …
#> $ cited_by_count <int> 1, 0, 3, 28, 0, 155, 0, 15, 1, 2, 15, 76, 8, 1, 5, 0, 68, 1, 3, 0, 57, 4…
#> $ display_name <chr> "Antimalarial Activity of Zincke’s Salts", "1104 Update on the study of …
—
Reply to this email directly, view it on GitHub <#271 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADW6BAYGON6STJZIN2SHGTZVW4KXAVCNFSM6AAAAABNY3IRWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZYGQYTMMRUGY>.
You are receiving this because you were mentioned.
--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
Orcid ID: 0000-0002-7490-0066
Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y19-M-72
Winterthurerstrasse 190
8075 Zürich
Switzerland
Office: +41 (0)44 635 47 64
Cell: +41 (0)78 630 66 57
email: ***@***.***
***@***.***
PGP: 0x0F52F982
|
Hi @trangdata, That's super helpful thank you. I'd completely missed the @rkrug I think an option for |
@rkrug Also, I raised this before #205 (comment), but 1 thing to be careful of is that for large records that comes in batches via pages, a raw json output will actually not be a single json but a character vector of chucked json strings, and there's no way to merge them without parsing them first. Given that, I think Lastly, for parsing we use |
Just for clarification, as I would OpenAlex to return a valid json when called to return a certain page. I save in my modified implementation of If you are referring to individual records - Isn't the paging done per record by OpenAlex? and when looking at the code, I would say each page is returned as a n individual entry in The nice thing about openalexR is the implementation of the paging - by using Concerning databases: offering the option of saving the json per page, and a function which uses duckDB to read the json files (returning a database object which can be fed into a dplyr pipeline and read by using |
We should really move this to a separate issue and not clutter this one up so our discussion on this doesn't get more scattered than it already is, but just to quickly respond: I'm not opposed to a My main thought is that between
I have to admit that the database feature is new to me, so maybe I could be convinced that the JSON file format could be of interest to a broader audience. But I wonder - how useful is the data after such an automatic ingestion into databases? The best parts of |
Thanks for your patience. I will draw up an example of what I am thinking about using my modified version of |
For what it's worth, I think you're doing a great service by digging these issues very deeply! My concerns about "should the ability to do X be integrated into the package" are entirely separate from "should users get to know that they can do X" - your own implementation serves as a great example for the power users who want to get more out of |
Thanks. I will keep you posted. |
Hi, I ran your code to fetch all the fields. malaria_topic <- oa_fetch(entity = "topics", search = "malaria") %>% system.time({ It took 76 seconds to pull out all the fields (40 seconds to pull out the select fields) I used the following code to fetch 10,000 records. It took 260 seconds). The only add-on is the batch_identifiers array. Any thoughts? system.time({ |
Hi Yan — I've moved your question to #278. Hi all — I'm closing this issue because the discussion is getting too fragmented. Please feel free to open new specific issues if you think there are points we have not fully resolved. 🙏🏽 |
Performance converting large datasets to data.frame is very slow. I doubt this will transform matters but every little helps.
In utils.R you have two functions which would run faster as explicit find and replace rather than a regex replace.
Original
Suggestion
The text was updated successfully, but these errors were encountered: