Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Closed
ColeMiller1 opened this issue Jun 5, 2020 · 32 comments · Fixed by #4558
Closed

Comments

@ColeMiller1
Copy link
Contributor

OpenMP support may cause issues with Windows performance, especially with many repeated calls.

library(data.table)
setDTthreads(1L)
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1

system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
##1.12.8 
##   user  system elapsed 
##  17.09    4.48   22.14 

##master
##   user  system elapsed 
##  17.50    4.53   22.51 

##master w/ OpenMP lines deleted from subset.c
##   user  system elapsed 
##  12.37    0.01   12.85 

Note, I would expect using setDTthreads(1L) would minimize any impacts to performance but that does not appear to be the case on Windows 10.

@jangorecki

This comment has been minimized.

@ColeMiller1

This comment has been minimized.

@jangorecki

This comment has been minimized.

@ColeMiller1

This comment has been minimized.

@jangorecki

This comment has been minimized.

@jangorecki

This comment has been minimized.

@ColeMiller1

This comment has been minimized.

@ColeMiller1

This comment has been minimized.

@jangorecki

This comment has been minimized.

@ColeMiller1

This comment has been minimized.

@jangorecki

This comment has been minimized.

@ColeMiller1
Copy link
Contributor Author

Still working on this - tried the throttle threads PR but it still did not work.

What is an example where OpenMP should shine? Using 1.12.8 and x = sample(1e7L), order(x) is still 3x faster on 1 thread than forder(x). Note, using multiple threads helps but it still does not catch up to order(x).

library(data.table)

x = sample(1e7L)

setDTthreads(1L)

 bench::mark(data.table:::forder(x),
 order(x),
 min_iterations = 10L)

### A tibble: 2 x 13
##  expression               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
 ## <bch:expr>             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
##1 data.table:::forder(x) 923ms  929ms      1.07    38.1MB    0.459     7     3
##2 order(x)               354ms  355ms      2.79    38.1MB    1.19      7     3

setDTthreads(2L)

##  expression               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##  <bch:expr>             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
##1 data.table:::forder(x) 614ms  657ms      1.50    38.1MB    0.376     8     2
##2 order(x)               359ms  368ms      2.73    38.1MB    1.17      7     3

@ColeMiller1
Copy link
Contributor Author

And I just tried replacing $(SHLIB_OPENMP_CFLAGS) / @openmp_cflags@ with -fno-openmp. This worked and there are no performance penalties in subsetDT.

@jangorecki
Copy link
Member

jangorecki commented Jun 13, 2020

Could you please try this? Makevars as -O3 -mtune=native so openmp will work.
Should not require more than 1.2 GB of mem.

remotes::install_gitlab("jangorecki/data.table@omp-overhead")
library(data.table)
setDTthreads(0L)
getDTthreads()
fcopy = data.table:::fcopy
x = sample.int(1e8L)
system.time(a<-fcopy(x, 1L)) ## schedule(static) num_threads(1)
system.time(a<-fcopy(x, 2L)) ## if (1<0) schedule(static) num_threads(getDTthreads())
system.time(a<-fcopy(x, 3L)) ## schedule(dynamic) num_threads(1)
system.time(a<-fcopy(x, 4L)) ## if (1<0) schedule(dynamic) num_threads(getDTthreads())

timings on linux

> getDTthreads()
[1] 40
> system.time(a<-fcopy(x, 1L))                                                                                                                       
fcopy: integer length 100000000 took 0.200s
   user  system elapsed 
  0.068   0.132   0.201 
> system.time(a<-fcopy(x, 2L))                                                                                                                       
fcopy: integer length 100000000 took 0.210s
   user  system elapsed 
  0.131   0.095   0.226 
> system.time(a<-fcopy(x, 3L))                                                                                                                       
fcopy: integer length 100000000 took 1.659s
   user  system elapsed 
  1.480   0.180   1.659 
> system.time(a<-fcopy(x, 4L))                                                                                                                       
fcopy: integer length 100000000 took 1.722s
   user  system elapsed 
  1.574   0.147   1.722 

@ColeMiller1
Copy link
Contributor Author

I copied your Makevars suggestions - it still compiled with -02 and without the mtune-native. Here is the performance:

> setDTthreads(0L)
> getDTthreads()
[1] 4
> fcopy = data.table:::fcopy
> x = sample.int(1e8L)
> system.time(a<-fcopy(x, 1L)) ## schedule(static) num_threads(1)
fcopy: integer length 100000000 took 0.145s
   user  system elapsed 
   0.11    0.05    0.16 
> system.time(a<-fcopy(x, 2L)) ## if (1<0) schedule(static) num_threads(getDTthreads())
fcopy: integer length 100000000 took 0.151s
   user  system elapsed 
   0.15    0.08    0.24 
> system.time(a<-fcopy(x, 3L)) ## schedule(dynamic) num_threads(1)
fcopy: integer length 100000000 took 3.523s
   user  system elapsed 
   3.42    0.01    3.52 
> system.time(a<-fcopy(x, 4L)) ## if (1<0) schedule(dynamic) num_threads(getDTthreads())
fcopy: integer length 100000000 took 3.546s
   user  system elapsed 
   3.47    0.03    3.53 

@jangorecki
Copy link
Member

and just to double check

system.time(for (i in 1:10) a<-fcopy(x, 1L))
system.time(for (i in 1:10) a<-fcopy(x, 2L))

@ColeMiller1
Copy link
Contributor Author

> system.time(for (i in 1:10) a<-fcopy(x, 1L))
fcopy: integer length 100000000 took 0.207s
fcopy: integer length 100000000 took 0.256s
fcopy: integer length 100000000 took 0.160s
fcopy: integer length 100000000 took 0.160s
fcopy: integer length 100000000 took 0.146s
fcopy: integer length 100000000 took 0.142s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.141s
fcopy: integer length 100000000 took 0.138s
fcopy: integer length 100000000 took 0.156s
   user  system elapsed 
   1.06    1.03    2.22 
> system.time(for (i in 1:10) a<-fcopy(x, 2L))
fcopy: integer length 100000000 took 0.136s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.144s
fcopy: integer length 100000000 took 0.153s
fcopy: integer length 100000000 took 0.185s
fcopy: integer length 100000000 took 0.145s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.157s
fcopy: integer length 100000000 took 0.146s
fcopy: integer length 100000000 took 0.145s
   user  system elapsed 
   1.14    0.89    2.04 

@jangorecki
Copy link
Member

jangorecki commented Jun 14, 2020

I would say, that there is no extra overhead on windows caused by the way how we escape openmp. Any degrade of performance must be caused by other factor. You can try running this last two calls having -fno-openmp. If difference will be visible, then I think we should just document it.

On linux -fno-openmp

   user  system elapsed 
  1.240   1.836   3.077 
   user  system elapsed 
  1.140   1.728   2.869 

-fopenmp

   user  system elapsed 
  1.049   1.681   2.730 
   user  system elapsed 
  0.797   1.587   2.383 

@ColeMiller1

This comment has been minimized.

@jangorecki
Copy link
Member

I just replace @openmp_cflags@ with -fno-openmp in Makevars.in. For windows replace $(SHLIB_OPENMP_CFLAGS) in Makevars.win.
Just checked and putting -fno-openmp in ~/.R/Makevars is ignored, so the above seems to be the most straightforward way.

@ColeMiller1
Copy link
Contributor Author

ColeMiller1 commented Jun 14, 2020

It took some time to figure out how to clone your gitlab repo:

On Windows -fno-openmp

   user  system elapsed 
   1.10    0.77    1.90 
   user  system elapsed 
   1.11    0.94    2.13 

On Windows -fopenmp

   user  system elapsed 
   0.96    0.87    1.89
   user  system elapsed 
   1.04    0.89    2.07

I agree for this use case that there is no appreciable difference.

Here is a case where there is a difference. The implication is that while no one will be subsetting a row at a time, the ideal API for subsetting by group is DT[, .SD[1:2], grp] and this overhead can really add up as the number of groups increase.

library(data.table)
mat = matrix(0L, nrow =1000L, ncol= 100L)

DT = as.data.table(mat)
DF = as.data.frame(mat)

##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##   0.59    0.38    1.00 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##   1.12    0.02    1.19 

##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
 ##  0.19    0.01    0.22 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##   1.01    0.00    1.03 

edit: using mat = matrix(0L, nrow = 1e5L, ncol = 10L)

##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##  19.96    4.00   24.56 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##  12.09    0.00   12.21 


##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##  14.64    0.01   14.77 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
 ##  user  system elapsed 
 ## 12.10    0.00   12.26 

@jangorecki
Copy link
Member

jangorecki commented Jun 14, 2020

I see it does have impact then. This is on linux, having nrow scaled up to 1e5 and reduce cols to 10. So it seems to be Windows specific overhead.

library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat)
##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#   user  system elapsed 
# 11.864   0.286  12.150 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#   user  system elapsed 
#   6.82    0.00    6.82 
##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#   user  system elapsed 
# 12.148   0.008  12.157 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#   user  system elapsed 
#  7.292   0.000   7.293 

@jangorecki jangorecki changed the title OpenMP Performance with Windows 10 Windows 10 is faster with -fno-openmp than setDTthreads(1) Jun 14, 2020
@jangorecki jangorecki changed the title Windows 10 is faster with -fno-openmp than setDTthreads(1) Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls Jun 14, 2020
@mattdowle
Copy link
Member

mattdowle commented Jun 15, 2020

Great info @ColeMiller1, many thanks for investigating here. Seems like the new throttle works on Linux but not as well on WIndows. So that's something that needs further work then, agree.

Can I clear up a few sentences on this one.

OpenMP support may cause issues with Windows performance, especially with many repeated calls.

Why the especially? This issue is just about repeated calls, and repeated calls on small data, right?

What is an example where OpenMP should shine?

Any of the examples in https://h2oai.github.io/db-benchmark/. They are single calls on large data that take more than a few seconds, and in some cases minutes, to run.

In generally, [.data.table doesn't like being repeated (it has overhead). That's why we have set() low-overhead alternative and why we use [[ and $ with data.table for ordinary operations. We'd like to one day to get to a state where the user doesn't need to know (for example by reworking [.data.table) but that hasn't happened yet.

And in general data.table doesn't like doing things one row at a time, which is true in R is general, and Python too. Always work to do to make it easier for user but it's just not recommended practice in high level languages like R.

So yet we do want to work on one-row-at-a-time benchmarks like this one, but I wonder how high it should be on the list.

@ColeMiller1
Copy link
Contributor Author

I agree that single row subsetting itself is not a large concern - the original example is related to #3735. But that's where I found another Windows user had slower subsets than Linux users for many repeated calls.

Why should we care about this? Operations by group can be affected, especially if when we subset by group (e.g. dt[, .SD[1:2], grp]. For Windows users, there can be negative impacts even with setDTthreads(1L).

I hope my choice of words doesn't make us overlook that there are opportunities to make data.table the best tool for both large data and small data regardless of platform. In working towards #852, I have a branch that takes down 14s to 5.3s for matrix(0L, nrow =1e5L, ncol= 10L) without OpenMP. With OpenMP, it goes from 24.5s to 13.6 s. So, 80us at a time can add up with optimizations.

FWIW - this may be centered around where the OpenMP loop is. While it is extremely buggy, putting the pragma omp parallel for ... in the subsetDT instead of the subsetVectorRaw changes the time on this branch to 8.3s. The main difference being that whereas Linux has no overhead, Windows seems to have overhead anytime the process is forked and unforked. As subsetDT is currently written, every column will get a new set of teams to go out and subset the vector.

@jangorecki
Copy link
Member

jangorecki commented Jun 16, 2020

@MichaelChirico
Copy link
Member

Just hit a work issue where it was pretty natural to do rowwise [:

The basic idea is I need to collapse some columns into a metadata column, e.g.

NN = 1e5
DT = data.table(ID = 1:NN, V1 = rnorm(NN), V2=rnorm(NN))

I tried:

DT[ , metadata := lapply(seq_len(.N), function(ii) .SD[ii]), .SDcols = c('V1', 'V2')]

And it's indeed rather slow.

transpose essentially works in this case because V1 and V2 are the same type, but generally they could be different types (-> coercion) and include non-transpose-compatible types (e.g. list)

@mattdowle mattdowle added this to the 1.12.11 milestone Jun 18, 2020
@ColeMiller1
Copy link
Contributor Author

ColeMiller1 commented Jun 18, 2020

@MichaelChirico thanks for the example. Using by would help speed this up some (maybe a dedicated nest() feature would be better):

library(data.table) ##1.12.8
setDTthreads(1L)
NN = 1e4
DT = data.table(ID = 1:NN, V1 = rnorm(NN), V2=rnorm(NN))

system.time(DT[ , metadata := lapply(seq_len(.N), function(ii) .SD[ii]), .SDcols = c('V1', 'V2')])
#>    user  system elapsed 
#>    1.86    0.12    1.98
system.time(DT[ , metadata2 := .(.(copy(.SD))), by = ID, .SDcols = c('V1', 'V2')])
#>    user  system elapsed 
#>    0.26    0.02    0.28
all.equal(DT$metadata, DT$metadata2)
#> [1] TRUE

I had to use copy() because otherwise the result has the locked attribute.

@MichaelChirico
Copy link
Member

The copy() part should be fixed in dev right?

@ColeMiller1
Copy link
Contributor Author

@MichaelChirico I just downloaded master - does not appear to be fixed. TBH, I'm not aware of an issue related to nesting .SD by group:

DT[, metadata3 := .(.(.SD)), by = ID, .SDcols = c("V1", "V2")]

attributes(DT$metadata3[[1L]])
##$row.names
##[1] 1

##$class
##[1] "data.table" "data.frame"

##$.internal.selfref
##<pointer: 0x05772498>

##$names
##[1] "V1" "V2"

##$.data.table.locked
##[1] TRUE

all.equal(DT$metadata2[[1L]], DT$metadata3[[1L]])
##[1] "Datasets has different number of (non-excluded) attributes: target 2, current 3"

@jangorecki
Copy link
Member

This can be observed on Linux as well:
this branch removes overhead: https://github.com/Rdatatable/data.table/compare/dogroups-overhead?expand=1
while this branch still has the overhead: https://github.com/Rdatatable/data.table/compare/dogroups-overhead2?expand=1

@jangorecki
Copy link
Member

@ColeMiller1 could confirm if #4558 resolves this issue?

@ColeMiller1
Copy link
Contributor Author

@jangorecki sorry for the delay. Yes! (I even like that I do not have to add setDTthreads(1L) at the top!)

library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat)

system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#>    user  system elapsed 
#>   13.42    0.43   14.15
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#>    user  system elapsed 
#>   12.66    0.03   12.97

The system.time is still a little higher - it would be great if we could add an argument to subsetDT about whether the indices need to be checked. From [.data.table, we already checked once so there's no need to convert zeros and negative IDs again.

But, that's small overall and could be addressed by a follow-up issue. The subsetting performance was the only place I noticed the OpenMP issue. Please feel free to close this when the PR is merged.

Thanks for all of your help!

@jangorecki jangorecki linked a pull request Jun 21, 2020 that will close this issue
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants