Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new with optimization to allow avoid [ overhead #4488

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented May 25, 2020

closes #3735 and #4485
from 27s down to 5s, together with #4484
and 13s down to 5s for this PR alone

library(data.table)
allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
#   user  system elapsed 
#  3.384   0.007   3.392
allIterations <- as.data.table(allIterations)
setDTthreads(1)
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))
#   user  system elapsed 
#  5.432   0.121   5.554 

Once the general idea and API will be approved, the following items should be added:

  • extra escapes of with and j processing in downstream code in [.data.table
  • manual
  • more tests
  • news

@jangorecki jangorecki added the WIP label May 25, 2020
@jangorecki jangorecki requested a review from mattdowle May 25, 2020 14:09
@jangorecki jangorecki linked an issue May 25, 2020 that may be closed by this pull request
@codecov
Copy link

codecov bot commented May 25, 2020

Codecov Report

Merging #4488 into master will decrease coverage by 0.11%.
The diff coverage is 75.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4488      +/-   ##
==========================================
- Coverage   99.60%   99.48%   -0.12%     
==========================================
  Files          72       73       +1     
  Lines       13918    13988      +70     
==========================================
+ Hits        13863    13916      +53     
- Misses         55       72      +17     
Impacted Files Coverage Δ
R/with.R 74.50% <74.50%> (ø)
R/data.table.R 99.78% <78.94%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47a83f...45952d1. Read the comment docs.

nc = length(x)
if (isFALSE(w[["i"]]) && !missing(i)) i = with_i(i, len=nr, verbose=verbose)
if (isFALSE(w[["j"]]) && !missing(j)) j = with_j(j, len=nc, x=x, verbose=verbose)
if ((isFALSE(w[["i"]]) && missing(j)) || (isFALSE(w[["j"]]) && missing(i)) || (isFALSE(w[["i"]]) && isFALSE(w[["j"]]))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit hard to figure out what this branch is for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from body of this branch one can see that branch is for early return and escape rest of [ processing

@MichaelChirico
Copy link
Member

basic API looks good. Do you know if with=FALSE now is faster than an equivalent with=TRUE query? I worry if we build this as "faster row selectionandwith=FALSE` is not "faster column selection" it would be confusing

@jangorecki
Copy link
Member Author

jangorecki commented May 25, 2020

Do you know if with=FALSE now is faster than an equivalent with=TRUE query?

if it is identical(with, FALSE) then new optimization is totally escaped, to stay backward compatible, so is not faster. For with=c(j=FALSE) it will be faster, but won't be replacement of with=FALSE because it will not handle !"a" or a:b. It will be really non-NSE.

I worry if we build this as "faster row selection" and with=FALSE` is not "faster column selection" it would be confusing.

The "faster" is not that much relevant here, I would call it low overhead, which eventually can be faster when looping many times. Otherwise speed difference is insignificant.

@MichaelChirico
Copy link
Member

OK. Still if "low-overhead-ness" differs between with=FALSE and with = c(i=FALSE), it may be confusing.

@jangorecki
Copy link
Member Author

with=FALSE is not really low-overhead, with=c(j=FALSE) is. Agree it deserves good documentation. The good thing is that change is fully backward compatible, so use of with=FALSE won't be affected.

@ColeMiller1
Copy link
Contributor

Most options are lists, including discussion of nomatch = NULL discusses the use of a list. For consistency, I wonder if with = .(i = TRUE, j = FALSE) would be more data.tableish.

For some parts, maybe we could include `.` = function(...) list(...) so we do not have to do NSE.

@jangorecki
Copy link
Member Author

jangorecki commented May 26, 2020

@ColeMiller1 to make this NSE (so the argument can be evaluated in separate line and result passed to with) we would need to export . which is not a good idea. Such "consistency" doesn't make much sense, if we go that way we will ended up using allow.cartesian=.(TRUE) as well. Let's use that when it can solve a new problem or overcome existing limitation.
Eventually accepting with=list(i=FALSE, j=FALSE) could make sense to more explicitly express difference vs non-optimized with interface.

@ColeMiller1

This comment has been minimized.

if (isFALSE(w[["i"]]) && !missing(i)) i = with_i(i, len=nr, verbose=verbose)
if (isFALSE(w[["j"]]) && !missing(j)) j = with_j(j, len=nc, x=x, verbose=verbose)
if ((isFALSE(w[["i"]]) && missing(j)) || (isFALSE(w[["j"]]) && missing(i)) || (isFALSE(w[["i"]]) && isFALSE(w[["j"]]))) {
if (missing(i)) i = seq_len(nr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seq_len(nr) is only needed on the which logic branch. Otherwise, we could do i = NULL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but eventually subsetDT could not copy when NULL provided, although it does copy now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point being that if missing(i), this would realize an integer vector seq_len(nr) when it is unnecessary except for the which() branch.

Regarding allowing shallow copies in CsubsetDT, that would be a nice feature. A data.frame does not appear to make a copy when selecting columns. At least that's what memory profiling using bench::mark suggests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

i argument could get with=FALSE Selecting from data.table by row is very slow
3 participants