Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insanely slow printing for very large dataframe #422

Closed
randomgambit opened this issue Jun 12, 2018 · 28 comments
Closed

insanely slow printing for very large dataframe #422

randomgambit opened this issue Jun 12, 2018 · 28 comments

Comments

@randomgambit
Copy link

Hello the tibble team

I have this dataset that takes about 1GB in size (on disk) it contains approximately 900k rows for 300 columns. The columns contain text and numbers.

Loading and processing the data is easy, as I have about 128GB ram on my machine. However, unexpectedly, just running a head(2) and asking to show all the colums takes forever on my machine (I am still waiting 15 minutes after)

options(tibble.width = Inf)
options(tibble.print_max = 1000, tibble.print_min =20)
options("digits.secs"=3)
options(pillar.sigfig = 7)

Doing the same thing in python pandas takes a few seconds. That is the bottleneck here? Are there any fixes?

Thanks!

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

Thanks. I couldn't replicate this when calling format() with a simple data frame of numbers. Is it format() or print() that takes so long? How long does it take with 200, 100, 50, 20 columns?

@randomgambit
Copy link
Author

hey @krlmlr have you tried with a random dataframe similar to mine?

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

Here's my reprex. What am I doing wrong?

library(tibble)
N <- 300
list <- setNames(seq_len(N), paste0("X", seq_len(N)))
df <- tibble(!!!list)

withr::with_options(
  list(tibble.width = Inf),
  system.time(format(df))
)
#>    user  system elapsed 
#>   1.501   0.000   1.512

Created on 2018-06-12 by the reprex package (v0.2.0).

@randomgambit
Copy link
Author

This is what I get

> withr::with_options(
+   list(tibble.width = Inf),
+   system.time(format(df))
+ )
   user  system elapsed 
   4.71    0.00    4.72 

But unless I am wrong, you dont have about 1 million rows here. I guess this is what is slowing down the printing?

@randomgambit
Copy link
Author

happy to run other tests on my original data (I cannot share it tho)

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

head(2) still has 1 million rows?

@randomgambit
Copy link
Author

nope, of course. but I dont know what happens internally. maybe some dark magic where the printing becomes crazy because it load the full data somewhat. I have no clue

@randomgambit
Copy link
Author

running dataframe %>% select(NAME) %>% head(2) works instantly.

running dataframe %>% head(2) takes forever

> object.size(dataframe)
2666556856 bytes

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

I'm at a loss, your data must have something that mine's lacking. Not even setting the "tibble.print_max" option to 1000 changed anything.

library(tibble)
N <- 300
M <- 1e6
list <- setNames(rep(list(seq_len(M)), N), paste0("X", seq_len(N)))
df <- tibble(!!!list)

df
#> # A tibble: 1,000,000 x 300
#>       X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
#>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1     1     1     1     1     1     1     1     1     1     1     1     1
#>  2     2     2     2     2     2     2     2     2     2     2     2     2
#>  3     3     3     3     3     3     3     3     3     3     3     3     3
#>  4     4     4     4     4     4     4     4     4     4     4     4     4
#>  5     5     5     5     5     5     5     5     5     5     5     5     5
#>  6     6     6     6     6     6     6     6     6     6     6     6     6
#>  7     7     7     7     7     7     7     7     7     7     7     7     7
#>  8     8     8     8     8     8     8     8     8     8     8     8     8
#>  9     9     9     9     9     9     9     9     9     9     9     9     9
#> 10    10    10    10    10    10    10    10    10    10    10    10    10
#> # ... with 999,990 more rows, and 288 more variables: X13 <int>,
#> #   X14 <int>, X15 <int>, X16 <int>, X17 <int>, X18 <int>, X19 <int>,
#> #   X20 <int>, X21 <int>, X22 <int>, X23 <int>, X24 <int>, X25 <int>,
#> #   X26 <int>, X27 <int>, X28 <int>, X29 <int>, X30 <int>, X31 <int>,
#> #   X32 <int>, X33 <int>, X34 <int>, X35 <int>, X36 <int>, X37 <int>,
#> #   X38 <int>, X39 <int>, X40 <int>, X41 <int>, X42 <int>, X43 <int>,
#> #   X44 <int>, X45 <int>, X46 <int>, X47 <int>, X48 <int>, X49 <int>,
#> #   X50 <int>, X51 <int>, X52 <int>, X53 <int>, X54 <int>, X55 <int>,
#> #   X56 <int>, X57 <int>, X58 <int>, X59 <int>, X60 <int>, X61 <int>,
#> #   X62 <int>, X63 <int>, X64 <int>, X65 <int>, X66 <int>, X67 <int>,
#> #   X68 <int>, X69 <int>, X70 <int>, X71 <int>, X72 <int>, X73 <int>,
#> #   X74 <int>, X75 <int>, X76 <int>, X77 <int>, X78 <int>, X79 <int>,
#> #   X80 <int>, X81 <int>, X82 <int>, X83 <int>, X84 <int>, X85 <int>,
#> #   X86 <int>, X87 <int>, X88 <int>, X89 <int>, X90 <int>, X91 <int>,
#> #   X92 <int>, X93 <int>, X94 <int>, X95 <int>, X96 <int>, X97 <int>,
#> #   X98 <int>, X99 <int>, X100 <int>, X101 <int>, X102 <int>, X103 <int>,
#> #   X104 <int>, X105 <int>, X106 <int>, X107 <int>, X108 <int>,
#> #   X109 <int>, X110 <int>, X111 <int>, X112 <int>, …

withr::with_options(
  list(tibble.width = Inf, tibble.print_max = 1000),
  system.time(format(df))
)
#>    user  system elapsed 
#>   3.189   0.000   3.189

Created on 2018-06-12 by the reprex package (v0.2.0).

@randomgambit
Copy link
Author

damn. what can I do you help you out here? My data has a LOT of NAs for instance. could that cause the issue? are there any functions I can run that can give you some insight into the data I have? On a side note, I remember having a similar issue with another large dataset of prices. So there must be a broad reason why this is failing...

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

Thanks. Can you give me typical mean, s/d and NA share for the columns? It would be easiest if you could share a code that generates a data frame that shows this behavior.

@randomgambit
Copy link
Author

i can try. however most of the columns are of chr type, not numeric

@krlmlr
Copy link
Member

krlmlr commented Jun 12, 2018

That sure helps. I'll try myself too, but it might take some time.

@randomgambit
Copy link
Author

hello @krlmlr so I ran some summary stats and definitely a feature of the dataframe is to contain a lot of missing values for many of the (either character or numeric) columns.

Is there any easy way to create a random tibble with character and numeric columns with a lot of NAs? I can try generate a few to reproduce the error

@randomgambit
Copy link
Author

@krlmlr I was able to reproduce the error!!! seems to be related to the NAs
Try that on your machine


N <- 350
M <- 1e6

list1 <- setNames(rep(list(rep(NA, M)), N), paste0("X", seq_len(N)))
list2 <- setNames(rep(list(rep('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', M)), N), paste0("X", seq_len(N)))


df1 <- tibble(!!!list1)
df2 <- tibble(!!!list2)


df <- bind_rows(df2, df1)

df %>% head()

@randomgambit
Copy link
Author

@krlmlr have you tried yet?

@randomgambit
Copy link
Author

@krlmlr are you still there? :)

@randomgambit
Copy link
Author

you have to set

> options(tibble.width = Inf)
> options(tibble.print_max = 1000, tibble.print_min =20)
> options("digits.secs"=3)
> options(pillar.sigfig = 7)

to make your computer implode

@hadley
Copy link
Member

hadley commented Jun 27, 2018

@randomgambit Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that we can easily re-run in a local session.

@krlmlr you should be asking for a reprex first in these situations to avoid a lot of back and forth

@randomgambit
Copy link
Author

Hello @hadley @krlmlr here is the reprex

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.4
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

options(tibble.width = Inf)
options(tibble.print_max = 1000, tibble.print_min =20)
options("digits.secs"=3)
options(pillar.sigfig = 7)

N <- 350
M <- 1e6

list1 <- setNames(rep(list(rep(NA, M)), N), paste0("X", seq_len(N)))
list2 <- setNames(rep(list(rep('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', M)), N), paste0("X", seq_len(N)))


df1 <- tibble(!!!list1)
df2 <- tibble(!!!list2)


df <- bind_rows(df2, df1)

Created on 2018-06-27 by the reprex package (v0.2.0).

@randomgambit
Copy link
Author

if I dare try to add the df %>% head() to the reprex, it never renders. This is some crazy bug my dear fellow R gurus :)

@krlmlr
Copy link
Member

krlmlr commented Jul 2, 2018

@hadley: The routine that assigns pillars to tiers gets very slow if too many tiers are permitted (option(tibble.width = Inf)). I could rewrite the routine to make it faster, but for now I suggest to limit the output to 6 tiers or so (the example prints after ~0.6 seconds then).

@hadley
Copy link
Member

hadley commented Jul 2, 2018

How hard would it be to switch to simple greedy layout if >2 tiers?

@randomgambit
Copy link
Author

guys do i get a medal for finding thus nice bug??? kidding :)

@krlmlr
Copy link
Member

krlmlr commented Jul 3, 2018

I reimplemented the code that assigns pillars to tiers, I found it difficult to understand despite the comments.

A tibble with 500 columns now prints in ~4 seconds on my system (with tibble.print_min = 10). Now the coloring and formatting is the bottleneck. You need to install the latest development version of pillar.

Because print() is implemented as cat(format()), no output is seen until all output is ready to be printed.

Bottom line, we currently can't really recommend options(tibble.width = Inf, tibble.print_max = 1000, tibble.print_min = 20) for performance reasons. I suggest limiting tibble.width and perhaps tibble.print_max.

@randomgambit
Copy link
Author

thanks!

@krlmlr
Copy link
Member

krlmlr commented Jul 3, 2018

Thank you for reporting and helping trace the problem!

@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants