-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insanely slow printing for very large dataframe #422
Comments
Thanks. I couldn't replicate this when calling |
hey @krlmlr have you tried with a random dataframe similar to mine? |
Here's my reprex. What am I doing wrong? library(tibble)
N <- 300
list <- setNames(seq_len(N), paste0("X", seq_len(N)))
df <- tibble(!!!list)
withr::with_options(
list(tibble.width = Inf),
system.time(format(df))
)
#> user system elapsed
#> 1.501 0.000 1.512
|
This is what I get
But unless I am wrong, you dont have about 1 million rows here. I guess this is what is slowing down the printing? |
happy to run other tests on my original data (I cannot share it tho) |
|
nope, of course. but I dont know what happens internally. maybe some dark magic where the printing becomes crazy because it load the full data somewhat. I have no clue |
|
I'm at a loss, your data must have something that mine's lacking. Not even setting the library(tibble)
N <- 300
M <- 1e6
list <- setNames(rep(list(seq_len(M)), N), paste0("X", seq_len(N)))
df <- tibble(!!!list)
df
#> # A tibble: 1,000,000 x 300
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 3 3 3 3 3 3 3 3 3 3 3 3 3
#> 4 4 4 4 4 4 4 4 4 4 4 4 4
#> 5 5 5 5 5 5 5 5 5 5 5 5 5
#> 6 6 6 6 6 6 6 6 6 6 6 6 6
#> 7 7 7 7 7 7 7 7 7 7 7 7 7
#> 8 8 8 8 8 8 8 8 8 8 8 8 8
#> 9 9 9 9 9 9 9 9 9 9 9 9 9
#> 10 10 10 10 10 10 10 10 10 10 10 10 10
#> # ... with 999,990 more rows, and 288 more variables: X13 <int>,
#> # X14 <int>, X15 <int>, X16 <int>, X17 <int>, X18 <int>, X19 <int>,
#> # X20 <int>, X21 <int>, X22 <int>, X23 <int>, X24 <int>, X25 <int>,
#> # X26 <int>, X27 <int>, X28 <int>, X29 <int>, X30 <int>, X31 <int>,
#> # X32 <int>, X33 <int>, X34 <int>, X35 <int>, X36 <int>, X37 <int>,
#> # X38 <int>, X39 <int>, X40 <int>, X41 <int>, X42 <int>, X43 <int>,
#> # X44 <int>, X45 <int>, X46 <int>, X47 <int>, X48 <int>, X49 <int>,
#> # X50 <int>, X51 <int>, X52 <int>, X53 <int>, X54 <int>, X55 <int>,
#> # X56 <int>, X57 <int>, X58 <int>, X59 <int>, X60 <int>, X61 <int>,
#> # X62 <int>, X63 <int>, X64 <int>, X65 <int>, X66 <int>, X67 <int>,
#> # X68 <int>, X69 <int>, X70 <int>, X71 <int>, X72 <int>, X73 <int>,
#> # X74 <int>, X75 <int>, X76 <int>, X77 <int>, X78 <int>, X79 <int>,
#> # X80 <int>, X81 <int>, X82 <int>, X83 <int>, X84 <int>, X85 <int>,
#> # X86 <int>, X87 <int>, X88 <int>, X89 <int>, X90 <int>, X91 <int>,
#> # X92 <int>, X93 <int>, X94 <int>, X95 <int>, X96 <int>, X97 <int>,
#> # X98 <int>, X99 <int>, X100 <int>, X101 <int>, X102 <int>, X103 <int>,
#> # X104 <int>, X105 <int>, X106 <int>, X107 <int>, X108 <int>,
#> # X109 <int>, X110 <int>, X111 <int>, X112 <int>, …
withr::with_options(
list(tibble.width = Inf, tibble.print_max = 1000),
system.time(format(df))
)
#> user system elapsed
#> 3.189 0.000 3.189 Created on 2018-06-12 by the reprex package (v0.2.0). |
damn. what can I do you help you out here? My data has a LOT of NAs for instance. could that cause the issue? are there any functions I can run that can give you some insight into the data I have? On a side note, I remember having a similar issue with another large dataset of prices. So there must be a broad reason why this is failing... |
Thanks. Can you give me typical mean, s/d and NA share for the columns? It would be easiest if you could share a code that generates a data frame that shows this behavior. |
i can try. however most of the columns are of chr type, not numeric |
That sure helps. I'll try myself too, but it might take some time. |
hello @krlmlr so I ran some summary stats and definitely a feature of the dataframe is to contain a lot of missing values for many of the (either character or numeric) columns. Is there any easy way to create a random tibble with character and numeric columns with a lot of NAs? I can try generate a few to reproduce the error |
@krlmlr I was able to reproduce the error!!! seems to be related to the
|
@krlmlr have you tried yet? |
@krlmlr are you still there? :) |
you have to set
to make your computer implode |
@randomgambit Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that we can easily re-run in a local session. @krlmlr you should be asking for a reprex first in these situations to avoid a lot of back and forth |
Hello @hadley @krlmlr here is the library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.4
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
options(tibble.width = Inf)
options(tibble.print_max = 1000, tibble.print_min =20)
options("digits.secs"=3)
options(pillar.sigfig = 7)
N <- 350
M <- 1e6
list1 <- setNames(rep(list(rep(NA, M)), N), paste0("X", seq_len(N)))
list2 <- setNames(rep(list(rep('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', M)), N), paste0("X", seq_len(N)))
df1 <- tibble(!!!list1)
df2 <- tibble(!!!list2)
df <- bind_rows(df2, df1) Created on 2018-06-27 by the reprex package (v0.2.0). |
if I dare try to add the |
@hadley: The routine that assigns pillars to tiers gets very slow if too many tiers are permitted ( |
How hard would it be to switch to simple greedy layout if >2 tiers? |
guys do i get a medal for finding thus nice bug??? kidding :) |
I reimplemented the code that assigns pillars to tiers, I found it difficult to understand despite the comments. A tibble with 500 columns now prints in ~4 seconds on my system (with Because Bottom line, we currently can't really recommend |
thanks! |
Thank you for reporting and helping trace the problem! |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary. |
Hello the
tibble
teamI have this dataset that takes about 1GB in size (on disk) it contains approximately 900k rows for 300 columns. The columns contain text and numbers.
Loading and processing the data is easy, as I have about 128GB ram on my machine. However, unexpectedly, just running a
head(2)
and asking to show all the colums takes forever on my machine (I am still waiting 15 minutes after)Doing the same thing in python
pandas
takes a few seconds. That is the bottleneck here? Are there any fixes?Thanks!
The text was updated successfully, but these errors were encountered: