new with optimization to allow avoid [ overhead #4488

jangorecki · 2020-05-25T14:09:32Z

closes #3735 and #4485
from 27s down to 5s, together with #4484
and 13s down to 5s for this PR alone

library(data.table)
allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
#   user  system elapsed 
#  3.384   0.007   3.392
allIterations <- as.data.table(allIterations)
setDTthreads(1)
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))
#   user  system elapsed 
#  5.432   0.121   5.554

Once the general idea and API will be approved, the following items should be added:

extra escapes of with and j processing in downstream code in [.data.table
manual
more tests
news

codecov · 2020-05-25T15:13:10Z

Codecov Report

Merging #4488 into master will decrease coverage by 0.11%.
The diff coverage is 75.71%.

@@            Coverage Diff             @@
##           master    #4488      +/-   ##
==========================================
- Coverage   99.60%   99.48%   -0.12%     
==========================================
  Files          72       73       +1     
  Lines       13918    13988      +70     
==========================================
+ Hits        13863    13916      +53     
- Misses         55       72      +17

Impacted Files	Coverage Δ
R/with.R	`74.50% <74.50%> (ø)`
R/data.table.R	`99.78% <78.94%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47a83f...45952d1. Read the comment docs.

MichaelChirico · 2020-05-25T15:35:46Z

R/data.table.R

+    nc = length(x)
+    if (isFALSE(w[["i"]]) && !missing(i)) i = with_i(i, len=nr, verbose=verbose)
+    if (isFALSE(w[["j"]]) && !missing(j)) j = with_j(j, len=nc, x=x, verbose=verbose)
+    if ((isFALSE(w[["i"]]) && missing(j)) || (isFALSE(w[["j"]]) && missing(i)) || (isFALSE(w[["i"]]) && isFALSE(w[["j"]]))) {


a bit hard to figure out what this branch is for

from body of this branch one can see that branch is for early return and escape rest of [ processing

MichaelChirico · 2020-05-25T15:42:40Z

basic API looks good. Do you know if with=FALSE now is faster than an equivalent with=TRUE query? I worry if we build this as "faster row selectionandwith=FALSE` is not "faster column selection" it would be confusing

jangorecki · 2020-05-25T18:28:50Z

Do you know if with=FALSE now is faster than an equivalent with=TRUE query?

if it is identical(with, FALSE) then new optimization is totally escaped, to stay backward compatible, so is not faster. For with=c(j=FALSE) it will be faster, but won't be replacement of with=FALSE because it will not handle !"a" or a:b. It will be really non-NSE.

I worry if we build this as "faster row selection" and with=FALSE` is not "faster column selection" it would be confusing.

The "faster" is not that much relevant here, I would call it low overhead, which eventually can be faster when looping many times. Otherwise speed difference is insignificant.

MichaelChirico · 2020-05-26T02:29:59Z

OK. Still if "low-overhead-ness" differs between with=FALSE and with = c(i=FALSE), it may be confusing.

jangorecki · 2020-05-26T10:19:51Z

with=FALSE is not really low-overhead, with=c(j=FALSE) is. Agree it deserves good documentation. The good thing is that change is fully backward compatible, so use of with=FALSE won't be affected.

ColeMiller1 · 2020-05-26T11:05:14Z

Most options are lists, including discussion of nomatch = NULL discusses the use of a list. For consistency, I wonder if with = .(i = TRUE, j = FALSE) would be more data.tableish.

For some parts, maybe we could include `.` = function(...) list(...) so we do not have to do NSE.

jangorecki · 2020-05-26T12:18:38Z

@ColeMiller1 to make this NSE (so the argument can be evaluated in separate line and result passed to with) we would need to export . which is not a good idea. Such "consistency" doesn't make much sense, if we go that way we will ended up using allow.cartesian=.(TRUE) as well. Let's use that when it can solve a new problem or overcome existing limitation.
Eventually accepting with=list(i=FALSE, j=FALSE) could make sense to more explicitly express difference vs non-optimized with interface.

R/data.table.R

ColeMiller1 · 2020-06-04T03:13:00Z

R/data.table.R

+    if (isFALSE(w[["i"]]) && !missing(i)) i = with_i(i, len=nr, verbose=verbose)
+    if (isFALSE(w[["j"]]) && !missing(j)) j = with_j(j, len=nc, x=x, verbose=verbose)
+    if ((isFALSE(w[["i"]]) && missing(j)) || (isFALSE(w[["j"]]) && missing(i)) || (isFALSE(w[["i"]]) && isFALSE(w[["j"]]))) {
+      if (missing(i)) i = seq_len(nr)


This seq_len(nr) is only needed on the which logic branch. Otherwise, we could do i = NULL

Yes, but eventually subsetDT could not copy when NULL provided, although it does copy now.

The main point being that if missing(i), this would realize an integer vector seq_len(nr) when it is unnecessary except for the which() branch.

Regarding allowing shallow copies in CsubsetDT, that would be a nice feature. A data.frame does not appear to make a copy when selecting columns. At least that's what memory profiling using bench::mark suggests.

new with optimization to allow avoid [ overhead

c33e98b

jangorecki added the WIP label May 25, 2020

jangorecki requested a review from mattdowle May 25, 2020 14:09

jangorecki linked an issue May 25, 2020 that may be closed by this pull request

i argument could get with=FALSE #4485

Open

jangorecki added 2 commits May 25, 2020 15:43

better escape of new with

469b9eb

downstream code not yet aware of new opt, return with to default

45952d1

MichaelChirico reviewed May 25, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

ColeMiller1 reviewed Jun 2, 2020

View reviewed changes

R/data.table.R Show resolved Hide resolved

ColeMiller1 reviewed Jun 4, 2020

View reviewed changes

jangorecki mentioned this pull request Jun 16, 2023

[.data.table is very slow with a single column #5650

Open

jangorecki mentioned this pull request Oct 6, 2023

[.data.table is very slow with a single integer #5636

Open

tdhock mentioned this pull request Oct 9, 2023

#5636 DorisAmoakohene/data.table_test#2

Open

MichaelChirico removed the WIP label Feb 19, 2024

MichaelChirico marked this pull request as draft February 19, 2024 04:19

This was referenced Jul 17, 2024

adding an atime test case; new with optimization to allow avoid [ overhead #PR4488 #6289

Closed

adding an atime test case; new with optimization to allow avoid [ overhead #PR4488 #6290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new with optimization to allow avoid [ overhead #4488

new with optimization to allow avoid [ overhead #4488

jangorecki commented May 25, 2020 •

edited

Loading

codecov bot commented May 25, 2020

MichaelChirico May 25, 2020

jangorecki May 25, 2020

MichaelChirico commented May 25, 2020

jangorecki commented May 25, 2020 •

edited

Loading

MichaelChirico commented May 26, 2020

jangorecki commented May 26, 2020

ColeMiller1 commented May 26, 2020

jangorecki commented May 26, 2020 •

edited

Loading

This comment has been minimized.

ColeMiller1 Jun 4, 2020

jangorecki Jun 4, 2020

ColeMiller1 Jun 5, 2020

new with optimization to allow avoid [ overhead #4488

Are you sure you want to change the base?

new with optimization to allow avoid [ overhead #4488

Conversation

jangorecki commented May 25, 2020 • edited Loading

codecov bot commented May 25, 2020

Codecov Report

MichaelChirico May 25, 2020

Choose a reason for hiding this comment

jangorecki May 25, 2020

Choose a reason for hiding this comment

MichaelChirico commented May 25, 2020

jangorecki commented May 25, 2020 • edited Loading

MichaelChirico commented May 26, 2020

jangorecki commented May 26, 2020

ColeMiller1 commented May 26, 2020

jangorecki commented May 26, 2020 • edited Loading

This comment has been minimized.

ColeMiller1 Jun 4, 2020

Choose a reason for hiding this comment

jangorecki Jun 4, 2020

Choose a reason for hiding this comment

ColeMiller1 Jun 5, 2020

Choose a reason for hiding this comment

jangorecki commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

jangorecki commented May 26, 2020 •

edited

Loading