feat: Generate descriptive statistics for a ibis table #8459

jitingxu1 · 2024-02-26T18:26:14Z

Is your feature request related to a problem?

I was a panda user, I used pandas dataframe describe() function a lot to get a sense of the data. I found ibis have the info() function, but it does not return enough information.

Describe the solution you'd like

Option 1:
Pandas dataframe describe -describe()

Analyzes both numeric and object series,
Numeric columns:

count, mean, std, min, max
user defined percentiles, like [0.1, 0.5, 0.9]

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

df = pd.DataFrame({'categorical': pd.Categorical(['d', 'e', 'f']),
                   'numeric': [1, 2, 3],
                   'object': ['a', 'b', 'c']
                   })
df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2024-02-26T23:20:42Z

I would look at implementing this for a few backends.

We explicitly chose not to add the quantiles because most of our backends do not support a quantile reduction.

cpcloud · 2024-02-28T16:48:42Z

@jitingxu1 It might be helpful if you explore the specifics here.

Since Ibis doesn't have anything like an object type, there are unanswered questions that need to be answered to achieve something like describe:

How can we store the min/max of a timestamp along with numeric data? Converting timestamps to numeric values seems really odd there.
What should we do about percentiles?
How about strings? Technically min and max can be computed for those, but backend support varies a lot for that.

amoeba · 2024-03-01T01:48:20Z

Sorry for being the "well, in R..." but, well, in R we have a neat package called skimr that produces some nice output and does some things that I think are smart. Here's an example:

> tbl(con, "example_df") |> 
  skimr::skim()

── Data Summary ────────────────────────
                           Values                
Name                       tbl(con, "example_df")
Number of rows             1000                  
Number of columns          3                     
_______________________                          
Column type frequency:                           
  character                1                     
  numeric                  2                     
________________________                         
Group variables            None                  

── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 letters               0             1   1   1     0       26          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate    mean      sd p0   p25    p50    p75  p100 hist 
1 numbers               0             1    12.9    7.43  1    6     12     19     26 ▇▆▆▆▅
2 dates                 0             1 10034.  5838.    0 5325. 10256. 14989. 19723 ▇▇▇▇▆

What I like about it is that it:

Groups columns into categories and shows a summary that's appropriate for that type
Shows higher-level details like column type frequency
Uses sparklines :)

However it does seem to convert complex types like dates into numbers which is still useful but harder to interpret. It's also not possible to see what it'll do with list columns because

Code you can run to produce the above output

# install.packages(c("dplyr", "dbplyr", "RSQLite", "skimr"))

library(dplyr, warn.conflicts = FALSE)
library(dbplyr)

nrows <- 1000
example_df <- data.frame(
  numbers = sample(seq_along(LETTERS), nrows, replace = TRUE),
  letters = sample(LETTERS, nrows, replace = TRUE),
  dates = sample(seq(as.Date("1970-01-01"), as.Date("2024-01-01"), length.out = length(LETTERS)), nrows, replace = TRUE))

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, example_df)
tbl(con, "example_df") |> 
  skimr::skim()

Example with list columns (duckdb)

 tbl(con, "starwars") |>
+   collect() |> skimr::skim()
── Data Summary ────────────────────────
                           Values                      
Name                       collect(tbl(con, "starwar...
Number of rows             87                          
Number of columns          14                          
_______________________                                
Column type frequency:                                 
  character                8                           
  list                     3                           
  numeric                  3                           
________________________                               
Group variables            None                        

── Variable type: character ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 name                  0         1       3  21     0       87          0
2 hair_color            5         0.943   4  13     0       11          0
3 skin_color            0         1       3  19     0       31          0
4 eye_color             0         1       3  13     0       15          0
5 sex                   4         0.954   4  14     0        4          0
6 gender                4         0.954   8   9     0        2          0
7 homeworld            10         0.885   4  14     0       48          0
8 species               4         0.954   3  14     0       37          0

── Variable type: list ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate n_unique min_length max_length
1 films                 0             1       24          1          7
2 vehicles              0             1       11          0          2
3 starships             0             1       16          0          5

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist 
1 height                6         0.931 175.   34.8 66 167   180 191    264 ▂▁▇▅▁
2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁
3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

lostmygithubaccount · 2024-03-01T13:25:09Z

this could also be useful with #8369 -- I'd personally love to see a package like ydata-profiling that uses Ibis to generate a bunch of statistics on the table + columns and visualizes them nicely, with the key differentiation being that it can run on 20+ backends!

jitingxu1 added the feature Features or general enhancements label Feb 26, 2024

github-project-automation bot added this to Ibis planning and roadmap Feb 26, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Feb 26, 2024

deepyaman mentioned this issue Mar 1, 2024

feat: support more parts of end-to-end ML workflow ibis-project/ibis-ml#19

Closed

13 tasks

jitingxu1 self-assigned this Mar 7, 2024

jitingxu1 moved this from backlog to todo in Ibis planning and roadmap Mar 7, 2024

jitingxu1 moved this from todo to cooking in Ibis planning and roadmap Mar 21, 2024

jitingxu1 mentioned this issue Mar 22, 2024

feat(api): add describe method to compute summary stats of table expressions #8739

Merged

jitingxu1 moved this from cooking to review in Ibis planning and roadmap Mar 29, 2024

cpcloud closed this as completed in #8739 Apr 15, 2024

github-project-automation bot moved this from review to done in Ibis planning and roadmap Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Generate descriptive statistics for a ibis table #8459

feat: Generate descriptive statistics for a ibis table #8459

jitingxu1 commented Feb 26, 2024

cpcloud commented Feb 26, 2024

cpcloud commented Feb 28, 2024 •

edited

Loading

amoeba commented Mar 1, 2024

lostmygithubaccount commented Mar 1, 2024

feat: Generate descriptive statistics for a ibis table #8459

feat: Generate descriptive statistics for a ibis table #8459

Comments

jitingxu1 commented Feb 26, 2024

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Feb 26, 2024

cpcloud commented Feb 28, 2024 • edited Loading

amoeba commented Mar 1, 2024

lostmygithubaccount commented Mar 1, 2024

cpcloud commented Feb 28, 2024 •

edited

Loading