insanely slow printing for very large dataframe #422

randomgambit · 2018-06-12T17:19:56Z

Hello the tibble team

I have this dataset that takes about 1GB in size (on disk) it contains approximately 900k rows for 300 columns. The columns contain text and numbers.

Loading and processing the data is easy, as I have about 128GB ram on my machine. However, unexpectedly, just running a head(2) and asking to show all the colums takes forever on my machine (I am still waiting 15 minutes after)

options(tibble.width = Inf)
options(tibble.print_max = 1000, tibble.print_min =20)
options("digits.secs"=3)
options(pillar.sigfig = 7)

Doing the same thing in python pandas takes a few seconds. That is the bottleneck here? Are there any fixes?

Thanks!

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-06-12T20:31:48Z

Thanks. I couldn't replicate this when calling format() with a simple data frame of numbers. Is it format() or print() that takes so long? How long does it take with 200, 100, 50, 20 columns?

randomgambit · 2018-06-12T20:35:39Z

hey @krlmlr have you tried with a random dataframe similar to mine?

krlmlr · 2018-06-12T20:46:56Z

Here's my reprex. What am I doing wrong?

library(tibble)
N <- 300
list <- setNames(seq_len(N), paste0("X", seq_len(N)))
df <- tibble(!!!list)

withr::with_options(
  list(tibble.width = Inf),
  system.time(format(df))
)
#>    user  system elapsed 
#>   1.501   0.000   1.512

Created on 2018-06-12 by the reprex package (v0.2.0).

randomgambit · 2018-06-12T20:51:30Z

This is what I get

> withr::with_options(
+   list(tibble.width = Inf),
+   system.time(format(df))
+ )
   user  system elapsed 
   4.71    0.00    4.72

But unless I am wrong, you dont have about 1 million rows here. I guess this is what is slowing down the printing?

randomgambit · 2018-06-12T21:05:09Z

happy to run other tests on my original data (I cannot share it tho)

krlmlr · 2018-06-12T21:06:10Z

head(2) still has 1 million rows?

randomgambit · 2018-06-12T21:07:44Z

nope, of course. but I dont know what happens internally. maybe some dark magic where the printing becomes crazy because it load the full data somewhat. I have no clue

randomgambit · 2018-06-12T21:10:55Z

running dataframe %>% select(NAME) %>% head(2) works instantly.

running dataframe %>% head(2) takes forever

> object.size(dataframe)
2666556856 bytes

krlmlr · 2018-06-12T21:52:15Z

I'm at a loss, your data must have something that mine's lacking. Not even setting the "tibble.print_max" option to 1000 changed anything.

library(tibble)
N <- 300
M <- 1e6
list <- setNames(rep(list(seq_len(M)), N), paste0("X", seq_len(N)))
df <- tibble(!!!list)

df
#> # A tibble: 1,000,000 x 300
#>       X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
#>    <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1     1     1     1     1     1     1     1     1     1     1     1     1
#>  2     2     2     2     2     2     2     2     2     2     2     2     2
#>  3     3     3     3     3     3     3     3     3     3     3     3     3
#>  4     4     4     4     4     4     4     4     4     4     4     4     4
#>  5     5     5     5     5     5     5     5     5     5     5     5     5
#>  6     6     6     6     6     6     6     6     6     6     6     6     6
#>  7     7     7     7     7     7     7     7     7     7     7     7     7
#>  8     8     8     8     8     8     8     8     8     8     8     8     8
#>  9     9     9     9     9     9     9     9     9     9     9     9     9
#> 10    10    10    10    10    10    10    10    10    10    10    10    10
#> # ... with 999,990 more rows, and 288 more variables: X13 <int>,
#> #   X14 <int>, X15 <int>, X16 <int>, X17 <int>, X18 <int>, X19 <int>,
#> #   X20 <int>, X21 <int>, X22 <int>, X23 <int>, X24 <int>, X25 <int>,
#> #   X26 <int>, X27 <int>, X28 <int>, X29 <int>, X30 <int>, X31 <int>,
#> #   X32 <int>, X33 <int>, X34 <int>, X35 <int>, X36 <int>, X37 <int>,
#> #   X38 <int>, X39 <int>, X40 <int>, X41 <int>, X42 <int>, X43 <int>,
#> #   X44 <int>, X45 <int>, X46 <int>, X47 <int>, X48 <int>, X49 <int>,
#> #   X50 <int>, X51 <int>, X52 <int>, X53 <int>, X54 <int>, X55 <int>,
#> #   X56 <int>, X57 <int>, X58 <int>, X59 <int>, X60 <int>, X61 <int>,
#> #   X62 <int>, X63 <int>, X64 <int>, X65 <int>, X66 <int>, X67 <int>,
#> #   X68 <int>, X69 <int>, X70 <int>, X71 <int>, X72 <int>, X73 <int>,
#> #   X74 <int>, X75 <int>, X76 <int>, X77 <int>, X78 <int>, X79 <int>,
#> #   X80 <int>, X81 <int>, X82 <int>, X83 <int>, X84 <int>, X85 <int>,
#> #   X86 <int>, X87 <int>, X88 <int>, X89 <int>, X90 <int>, X91 <int>,
#> #   X92 <int>, X93 <int>, X94 <int>, X95 <int>, X96 <int>, X97 <int>,
#> #   X98 <int>, X99 <int>, X100 <int>, X101 <int>, X102 <int>, X103 <int>,
#> #   X104 <int>, X105 <int>, X106 <int>, X107 <int>, X108 <int>,
#> #   X109 <int>, X110 <int>, X111 <int>, X112 <int>, …

withr::with_options(
  list(tibble.width = Inf, tibble.print_max = 1000),
  system.time(format(df))
)
#>    user  system elapsed 
#>   3.189   0.000   3.189

Created on 2018-06-12 by the reprex package (v0.2.0).

randomgambit · 2018-06-12T21:55:41Z

damn. what can I do you help you out here? My data has a LOT of NAs for instance. could that cause the issue? are there any functions I can run that can give you some insight into the data I have? On a side note, I remember having a similar issue with another large dataset of prices. So there must be a broad reason why this is failing...

krlmlr · 2018-06-12T21:58:15Z

Thanks. Can you give me typical mean, s/d and NA share for the columns? It would be easiest if you could share a code that generates a data frame that shows this behavior.

randomgambit · 2018-06-12T21:59:31Z

i can try. however most of the columns are of chr type, not numeric

krlmlr · 2018-06-12T22:01:14Z

That sure helps. I'll try myself too, but it might take some time.

randomgambit · 2018-06-13T12:20:32Z

hello @krlmlr so I ran some summary stats and definitely a feature of the dataframe is to contain a lot of missing values for many of the (either character or numeric) columns.

Is there any easy way to create a random tibble with character and numeric columns with a lot of NAs? I can try generate a few to reproduce the error

randomgambit · 2018-06-14T16:39:25Z

@krlmlr I was able to reproduce the error!!! seems to be related to the NAs
Try that on your machine


N <- 350
M <- 1e6

list1 <- setNames(rep(list(rep(NA, M)), N), paste0("X", seq_len(N)))
list2 <- setNames(rep(list(rep('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', M)), N), paste0("X", seq_len(N)))


df1 <- tibble(!!!list1)
df2 <- tibble(!!!list2)


df <- bind_rows(df2, df1)

df %>% head()

randomgambit · 2018-06-18T00:54:24Z

@krlmlr have you tried yet?

randomgambit · 2018-06-21T11:41:57Z

@krlmlr are you still there? :)

randomgambit · 2018-06-23T01:24:26Z

you have to set

> options(tibble.width = Inf)
> options(tibble.print_max = 1000, tibble.print_min =20)
> options("digits.secs"=3)
> options(pillar.sigfig = 7)

to make your computer implode

hadley · 2018-06-27T18:36:30Z

@randomgambit Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that we can easily re-run in a local session.

@krlmlr you should be asking for a reprex first in these situations to avoid a lot of back and forth

randomgambit · 2018-06-27T18:41:46Z

Hello @hadley @krlmlr here is the reprex

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.4
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

options(tibble.width = Inf)
options(tibble.print_max = 1000, tibble.print_min =20)
options("digits.secs"=3)
options(pillar.sigfig = 7)

N <- 350
M <- 1e6

list1 <- setNames(rep(list(rep(NA, M)), N), paste0("X", seq_len(N)))
list2 <- setNames(rep(list(rep('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', M)), N), paste0("X", seq_len(N)))


df1 <- tibble(!!!list1)
df2 <- tibble(!!!list2)


df <- bind_rows(df2, df1)

Created on 2018-06-27 by the reprex package (v0.2.0).

randomgambit · 2018-06-27T19:25:41Z

if I dare try to add the df %>% head() to the reprex, it never renders. This is some crazy bug my dear fellow R gurus :)

krlmlr · 2018-07-02T20:14:25Z

@hadley: The routine that assigns pillars to tiers gets very slow if too many tiers are permitted (option(tibble.width = Inf)). I could rewrite the routine to make it faster, but for now I suggest to limit the output to 6 tiers or so (the example prints after ~0.6 seconds then).

hadley · 2018-07-02T21:02:22Z

How hard would it be to switch to simple greedy layout if >2 tiers?

randomgambit · 2018-07-03T22:56:23Z

guys do i get a medal for finding thus nice bug??? kidding :)

krlmlr · 2018-07-03T22:59:28Z

I reimplemented the code that assigns pillars to tiers, I found it difficult to understand despite the comments.

A tibble with 500 columns now prints in ~4 seconds on my system (with tibble.print_min = 10). Now the coloring and formatting is the bottleneck. You need to install the latest development version of pillar.

Because print() is implemented as cat(format()), no output is seen until all output is ready to be printed.

Bottom line, we currently can't really recommend options(tibble.width = Inf, tibble.print_max = 1000, tibble.print_min = 20) for performance reasons. I suggest limiting tibble.width and perhaps tibble.print_max.

randomgambit · 2018-07-03T23:14:06Z

thanks!

krlmlr · 2018-07-03T23:15:52Z

Thank you for reporting and helping trace the problem!

github-actions · 2020-12-10T00:40:50Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

krlmlr added the performance 🏎️ label Jun 28, 2018

krlmlr closed this as completed in r-lib/pillar@3a9c392 Jul 3, 2018

github-actions bot locked and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insanely slow printing for very large dataframe #422

insanely slow printing for very large dataframe #422

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 13, 2018

randomgambit commented Jun 14, 2018

randomgambit commented Jun 18, 2018

randomgambit commented Jun 21, 2018

randomgambit commented Jun 23, 2018

hadley commented Jun 27, 2018 •

edited

Loading

randomgambit commented Jun 27, 2018

randomgambit commented Jun 27, 2018

krlmlr commented Jul 2, 2018

hadley commented Jul 2, 2018

randomgambit commented Jul 3, 2018

krlmlr commented Jul 3, 2018

randomgambit commented Jul 3, 2018

krlmlr commented Jul 3, 2018

github-actions bot commented Dec 10, 2020

insanely slow printing for very large dataframe #422

insanely slow printing for very large dataframe #422

Comments

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 12, 2018

krlmlr commented Jun 12, 2018

randomgambit commented Jun 13, 2018

randomgambit commented Jun 14, 2018

randomgambit commented Jun 18, 2018

randomgambit commented Jun 21, 2018

randomgambit commented Jun 23, 2018

hadley commented Jun 27, 2018 • edited Loading

randomgambit commented Jun 27, 2018

randomgambit commented Jun 27, 2018

krlmlr commented Jul 2, 2018

hadley commented Jul 2, 2018

randomgambit commented Jul 3, 2018

krlmlr commented Jul 3, 2018

randomgambit commented Jul 3, 2018

krlmlr commented Jul 3, 2018

github-actions bot commented Dec 10, 2020

hadley commented Jun 27, 2018 •

edited

Loading