Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throttle threads for iterated small data tasks #4484

Merged
merged 10 commits into from
Jun 18, 2020
Merged

throttle threads for iterated small data tasks #4484

merged 10 commits into from
Jun 18, 2020

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented May 24, 2020

Closes #3175, see #3175 (comment)


Closes #3205 (the 2nd part which was outstanding)


Closes #3438: rerun repeated uniqueN test

before patch

# 20 threads
   user  system elapsed 
478.054   1.334  24.300 

# 1 thread
   user  system elapsed 
  7.968   0.348   8.317 

# 40 threads
    user   system  elapsed 
1353.752    1.711   34.369 

after patch

# 20 threads
   user  system elapsed 
 10.021   0.382   8.518 

# 1 thread
   user  system elapsed 
  8.008   0.323   8.331 

# 40 threads
   user  system elapsed 
 12.542   0.376   8.279 

partially address #3735:

library(data.table)
allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
allIterations <- as.data.table(allIterations)
setDTthreads(1)
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
setDTthreads(40)
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))

output before patch

# DF
   user  system elapsed 
  3.143   0.005   3.150 
# DT: setDTthreads(1)
   user  system elapsed 
 12.958   0.134  13.093 
# DT: setDTthreads(40)
    user   system  elapsed 
1098.068    0.747   27.851 

after patch

# DF
   user  system elapsed 
  3.129   0.020   3.149
# DT: setDTthreads(1)
   user  system elapsed 
 12.505   0.103  12.610 
# DT: setDTthreads(40)
   user  system elapsed 
 12.663   0.088  12.751 

partially address "slow uniqueN" #3739

library(data.table)
set.seed(1000)
# small vector, uniqueN() is slower
x <- sample(1:1e5, size = 1e2, replace = TRUE)
microbenchmark::microbenchmark(
  uniqueN(x),
  length(unique(x))
)

before patch

Unit: microseconds
              expr    min      lq     mean  median      uq     max neval
        uniqueN(x) 42.426 45.1665 55.85915 46.0855 47.1960 953.592   100
 length(unique(x))  6.498  7.7030  9.03348  8.6160  9.9365  28.824   100

after patch

Unit: microseconds
              expr    min      lq     mean  median      uq    max neval
        uniqueN(x) 14.978 15.7665 17.01930 16.2505 16.7810 77.220   100
 length(unique(x))  5.011  6.0760  6.84729  6.7140  7.3435 22.196   100

slow uniqueN by group #1120

library(data.table)
irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]
microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  times = 2
)

before patch

Unit: milliseconds
                                                expr      min       lq     mean   median       uq      max neval
        irisdt[, uniqueN(Sepal.Width), Sepal.Length] 343.0611 343.0611 391.6741 391.6741 440.2871 440.2871     2
 irisdt[, length(unique(Sepal.Width)), Sepal.Length] 127.9050 127.9050 128.5702 128.5702 129.2355 129.2355     2

after patch

Unit: milliseconds
                                                expr       min        lq      mean    median        uq       max neval
        irisdt[, uniqueN(Sepal.Width), Sepal.Length] 103.63772 103.63772 107.19231 107.19231 110.74689 110.74689     2
 irisdt[, length(unique(Sepal.Width)), Sepal.Length]  49.09571  49.09571  52.49689  52.49689  55.89808  55.89808     2

@codecov
Copy link

codecov bot commented May 24, 2020

Codecov Report

Merging #4484 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4484   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          73       73           
  Lines       14103    14108    +5     
=======================================
+ Hits        14048    14053    +5     
  Misses         55       55           
Impacted Files Coverage Δ
src/between.c 100.00% <ø> (ø)
src/cj.c 100.00% <ø> (ø)
src/coalesce.c 100.00% <ø> (ø)
src/fifelse.c 100.00% <ø> (ø)
src/froll.c 100.00% <ø> (ø)
src/frollR.c 100.00% <ø> (ø)
src/frolladaptive.c 100.00% <ø> (ø)
src/nafill.c 100.00% <ø> (ø)
src/reorder.c 100.00% <ø> (ø)
src/subset.c 100.00% <ø> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12586af...5485d59. Read the comment docs.

src/subset.c Outdated Show resolved Hide resolved
@jangorecki jangorecki changed the title subsetDT escape openmp, #3735 escape openmp: subsetDT, between, forder May 25, 2020
@jangorecki jangorecki added this to the 1.12.9 milestone May 26, 2020
@mattdowle
Copy link
Member

Related: #3205 (comment)

@jangorecki

This comment has been minimized.

@jangorecki jangorecki changed the title escape openmp: subsetDT, between, forder escape openmp Jun 2, 2020
R/openmp-utils.R Outdated Show resolved Hide resolved
src/data.table.h Outdated Show resolved Hide resolved
src/openmp-utils.c Outdated Show resolved Hide resolved
@mattdowle mattdowle changed the title escape openmp throttle threads for iterated small data tasks Jun 13, 2020
@mattdowle
Copy link
Member

mattdowle commented Jun 13, 2020

I just couldn't grasp the types so I changed this PR to what clicks in my head, at least. See what you think. The names OMP_ROWS and OMP_BATCH didn't click because in some cases that are not directly to do with rows, you do want to throttle. I just find it easier to see a bool flag (true/false) telling me directly whether the throttle should be applied or not in that region, so I don't have the mental hop via the type codes.
While I was changing the getDTthreads() calls, I also changed some of them to be throttle=false, where Jan had OMP_ROWS; i.e. I'm aware I didn't just change the API. Those ones seemed more like batches to me.
The default throttle of 1024 I just guesstimated. The throttle value means one thread for nrow<=1024, two threads for nrow<=2048, etc, subject to the same limits as before (percent of num_procs, etc).

Copy link
Member Author

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! much simpler approach than mine

src/openmp-utils.c Outdated Show resolved Hide resolved
src/openmp-utils.c Outdated Show resolved Hide resolved
src/openmp-utils.c Outdated Show resolved Hide resolved
src/types.c Outdated Show resolved Hide resolved
src/frollR.c Outdated Show resolved Hide resolved
src/nafill.c Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member Author

PR in action, just today https://stackoverflow.com/questions/62347788/multi-threaded-data-table-in-r-much-slower-than-one-using-single-thread?noredirect=1#comment110287890_62347788

@jangorecki
Copy link
Member Author

jangorecki commented Jun 13, 2020

@mattdowle I updated timings in first post. It is interesting to see that length(unique( in the last example did speed up as well. Probably forder on Sepal.Length to find groups escapes some overhead as well.
One issue is definitely resolved #3438.
Three others I mentioned as "partially addressed", but I think we should define further actions to be taken in each, if we don't have any idea for those, we could eventually close them with the status "not perfect but much better ~ good enough".
Does this PR already covers #3205? If not then maybe avoiding stress-test threading/batchSize could give more speed?

@jangorecki jangorecki linked an issue Jun 13, 2020 that may be closed by this pull request
@jangorecki
Copy link
Member Author

jangorecki commented Jun 17, 2020

Just to ensure this PR addresses the problem on Windows platform as well.

before patch

# 10 threads
   user  system elapsed
 264.71  257.06  291.49

# 1 thread
   user  system elapsed
  25.14   23.88   49.01

# 20 threads
   user  system elapsed
 609.39  609.09  621.81

after patch

# 10 threads
   user  system elapsed
  22.77   23.63   46.50

# 1 thread
   user  system elapsed
  25.35   24.16   49.50

# 20 threads
   user  system elapsed
  25.81   26.43   52.75

Note that issue #4527 is about single threaded overhead, which on windows seems to be much bigger than on linux. Here problem is the overhead caused by many threads.

@mattdowle mattdowle merged commit 89830e9 into master Jun 18, 2020
@mattdowle mattdowle deleted the subsetDT1 branch June 18, 2020 08:56
@jangorecki jangorecki mentioned this pull request Aug 25, 2020
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants