-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up unpack_nested_data #42 #51
Conversation
Codecov Report
@@ Coverage Diff @@
## master #51 +/- ##
==========================================
+ Coverage 70.93% 71.75% +0.82%
==========================================
Files 3 3
Lines 516 531 +15
==========================================
+ Hits 366 381 +15
Misses 150 150
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @wdearden208 . Please see the comments I left for some things to address. I will review again when you've addressed those.
Also, @austin3dickey I'm going to let you be the final judge on this one, since this function is your masterpiece and you know the nuances of dealing with nested ES results
R/elasticsearch_parsers.R
Outdated
# Create the unpacked data.table by replicating the originally unpacked | ||
# columns by the number of rows in each entry in the original unpacked column | ||
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
n <- pmax(purrr::map_int(listDT, NROW), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TILpmax
R/elasticsearch_parsers.R
Outdated
@@ -391,7 +391,8 @@ unpack_nested_data <- function(chomped_df, col_to_unpack) { | |||
futile.logger::flog.fatal(msg) | |||
stop(msg) | |||
} | |||
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 1) { | |||
if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != | |||
1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put this 1) {
back on the line above, so the line ends with the opening {
R/elasticsearch_parsers.R
Outdated
# If we tried to unpack an empty column, fail | ||
if (nrow(newDT) == 0) { | ||
# Check for empty column | ||
if (all(lengths(listDT) == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- TIL
lengths()
- out of curiosity, what does this new version catch that
nrow(newDT)
didn't? It's not obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails faster, I like it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think this will fail if all the data.tables have 0 rows, just if they all have 0 columns. Was this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I fixed it.
R/elasticsearch_parsers.R
Outdated
msg <- "The column given to unpack_nested_data had no data in it." | ||
futile.logger::flog.fatal(msg) | ||
stop(msg) | ||
} | ||
|
||
# Fix the ID because we may have removed some empty elements due to that bug | ||
newDT[, .id := oldIDs[.id]] | ||
listDT[lengths(listDT) == 0] <- NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh maybe I'm getting this. Will this catch the cases where you have a cell with value list()
? I like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will
R/elasticsearch_parsers.R
Outdated
} else if (all(is_atomic)) { | ||
newDT <- data.table::as.data.table(unlist(listDT)) | ||
} else { | ||
msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like if we gave a more informative error message here. If I got this error as a user I wouldn't know what to do next.
Please do two things:
- write a unit test using
expect_error()
that reaches this line - propose an error message format that might be able to give a user a bit more guidance on where to go next
R/elasticsearch_parsers.R
Outdated
# Create the unpacked data.table by replicating the originally unpacked | ||
# columns by the number of rows in each entry in the original unpacked column | ||
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
n <- pmax(purrr::map_int(listDT, NROW), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please use a more descriptive variable name than n
?
R/elasticsearch_parsers.R
Outdated
# columns by the number of rows in each entry in the original unpacked column | ||
group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack)) | ||
n <- pmax(purrr::map_int(listDT, NROW), 1) | ||
rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not obvious to me, staring at this line, what it actually does. Can you please add a comment and choose a more informative name than rest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ..groupvars
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what does drop = FALSE
do?
@wdearden208 I am going to approve this PR, looks great! Appreciate all the effort you put into this However, I'm not going to merge. @austin3dickey you have the final 👍 / 👎 on this one |
@wdearden208 now that we merged that logging PR, you need to rebase this onto |
c14c972
to
4f6cfbc
Compare
handle mixed atomic/data frame column and better explanations in comments
4f6cfbc
to
ada870c
Compare
I just updated this! It put me as a non-author committer, so you will still get credit for the contribution |
R/elasticsearch_parsers.R
Outdated
# Merge | ||
outDT[, .id := .I] | ||
outDT <- newDT[outDT, on = ".id"] | ||
is_df <- purrr::map_lgl(listDT, is.data.frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add #' @importFrom purrr map_lgl
to the Roxygen of this function to be explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
R/elasticsearch_parsers.R
Outdated
} else { | ||
msg <- paste0("Each row in column ", col_to_unpack, " must be a data frame or a vector.") | ||
futile.logger::flog.fatal(msg) | ||
stop(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use log_fatal
here instead of these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
R/elasticsearch_parsers.R
Outdated
|
||
# Find column name to use for NA vectors | ||
first_df <- min(which(is_df)) | ||
col_name <- names(listDT[[first_df]])[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, using the first column name here seems pretty arbitrary. I wonder if we should surface "which column should I put atomic data in, if it's a mixture of atomic and data.tables" to the user.
To that point, have you ever seen a payload that would result in listDT
being a mixture of atomic and data.tables? I wonder if this whole section is too much complexity for a use case I've never seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I dropped the logic for handling mixed atomic/data.table columns but I still needed to grab the column name to give NA
entries. I got an error if I didn't because in the "handle NAs" test it created a second x
column with all NA
.
R/elasticsearch_parsers.R
Outdated
outDT <- outDT[, !c(".id", col_to_unpack), with = FALSE] | ||
# Create the unpacked data.table by replicating the originally unpacked | ||
# columns by the number of rows in each entry in the original unpacked column | ||
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using pmax here? I thought purrr::map_int()
is guaranteed to return one integer vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to use pmax
for the cases where the entry in col_to_unpack
is empty, so the number of rows is 0. I need to rep
it one time because I don't want to drop that row.
R/elasticsearch_parsers.R
Outdated
# columns by the number of rows in each entry in the original unpacked column | ||
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) | ||
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack | ||
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, though perhaps you should test out my former strategy of using the built-in idcol
argument of rbindlist
. That's essentially replicating the logic rep(1:nrow(chomped_df), times_to_replicate)
but I suspect it will be faster since it's data.table haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried benchmarking a few versions of this function with a large listDT
that was full of data.tables (no atomic):
- the old function
- the old function with
lapply(listDT, data.table::as.data.table)
commented out - this version
(3) was twice as fast as (1), but (2) was twice as fast as (3). I suspect the difference is all the copies you're making in this chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think these changes lost the speedup. The main speedup was from dropping data.table::as.data.table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try out this update's speed. For what it's worth I tried out this function where I used the same merge as the original function instead of rep
and it ended up being about 20% slower. I think that makes sense because I'm still using a data.table [
. So it shouldn't copy the data any more than a data.table merge would copy the data. It's basically going through the same algorithm as merging in the data.table
with .id
being the same as times_to_replicate
. Furthermore, I think it makes sense this way since we don't have to deal with creating a .id
column which could cause a problem if chomped_df
has a .id
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack
, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table)
commented out: 69 ms - Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE
back to rbindlist
and merged after rbindlist
like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol
as an option. If you're worried about the .id
column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar
instead of idcol = TRUE
to automatically name that ID column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lookin' good; I love the initiative to get rid of lapply(listDT, data.table::as.data.table)
. But I don't think we need to treat for the case where there's a mixture of atomic and data.frames in the list (feel free to disagree here). And there are a few more speedups you can achieve. Take a look whenever you can!
bbd0997
to
20410ca
Compare
20410ca
to
f9b4516
Compare
R/elasticsearch_parsers.R
Outdated
# columns by the number of rows in each entry in the original unpacked column | ||
times_to_replicate <- pmax(purrr::map_int(listDT, NROW), 1) | ||
# Replicate the rows of the data.table by entries of times_to_replicate but drop col_to_unpack | ||
replicatedDT <- chomped_df[rep(1:nrow(chomped_df), times_to_replicate)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for your patience here. No idea why your build is failing. It may just be transient.
Just tested this update's speed on my test dataset again (which has 1000 data frames in the col_to_unpack
, no atomic or lists or NAs). The numbers below are the median times each function took on my computer.
- Old function: 153 ms
- Old function with the
lapply(listDT, data.table::as.data.table)
commented out: 69 ms - Your first version: 97 ms
- Your second version: 89 ms
- A hybrid: 17 ms
For the hybrid, I built off your second version. I changed the top to:
inDT <- data.table::copy(chomped_df)
listDT <- inDT[[col_to_unpack]]
inDT[, (col_to_unpack) := NULL]
inDT[, .id := .I]
then added the idcol = TRUE
back to rbindlist
and merged after rbindlist
like so:
outDT <- inDT[newDT, on = ".id"]
outDT[, .id := NULL]
return(outDT)
So I still think there's value in researching idcol
as an option. If you're worried about the .id
column restriction, we could have that join column be a random UUID or something for safety. You can, for example, use idcol = myUUIDvar
instead of idcol = TRUE
to automatically name that ID column.
Whoops I commented on an outdated file just now. |
For those keeping track at home, this final version outperforms the original function by 9x and MAJOR props to @wdearden . Thanks for the idea and execution. |
Reference #42