Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Generate descriptive statistics for a ibis table #8459

Closed
1 task done
jitingxu1 opened this issue Feb 26, 2024 · 4 comments · Fixed by #8739
Closed
1 task done

feat: Generate descriptive statistics for a ibis table #8459

jitingxu1 opened this issue Feb 26, 2024 · 4 comments · Fixed by #8739
Assignees
Labels
feature Features or general enhancements

Comments

@jitingxu1
Copy link
Contributor

Is your feature request related to a problem?

I was a panda user, I used pandas dataframe describe() function a lot to get a sense of the data. I found ibis have the info() function, but it does not return enough information.

Describe the solution you'd like

Option 1:
Pandas dataframe describe -describe()

Analyzes both numeric and object series,
Numeric columns:

  • count, mean, std, min, max
  • user defined percentiles, like [0.1, 0.5, 0.9]

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

df = pd.DataFrame({'categorical': pd.Categorical(['d', 'e', 'f']),
                   'numeric': [1, 2, 3],
                   'object': ['a', 'b', 'c']
                   })
df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jitingxu1 jitingxu1 added the feature Features or general enhancements label Feb 26, 2024
@cpcloud
Copy link
Member

cpcloud commented Feb 26, 2024

I would look at implementing this for a few backends.

We explicitly chose not to add the quantiles because most of our backends do not support a quantile reduction.

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2024

@jitingxu1 It might be helpful if you explore the specifics here.

Since Ibis doesn't have anything like an object type, there are unanswered questions that need to be answered to achieve something like describe:

  1. How can we store the min/max of a timestamp along with numeric data? Converting timestamps to numeric values seems really odd there.
  2. What should we do about percentiles?
  3. How about strings? Technically min and max can be computed for those, but backend support varies a lot for that.

@amoeba
Copy link
Contributor

amoeba commented Mar 1, 2024

Sorry for being the "well, in R..." but, well, in R we have a neat package called skimr that produces some nice output and does some things that I think are smart. Here's an example:

> tbl(con, "example_df") |> 
  skimr::skim()

── Data Summary ────────────────────────
                           Values                
Name                       tbl(con, "example_df")
Number of rows             1000                  
Number of columns          3                     
_______________________                          
Column type frequency:                           
  character                1                     
  numeric                  2                     
________________________                         
Group variables            None                  

── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 letters               0             1   1   1     0       26          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate    mean      sd p0   p25    p50    p75  p100 hist 
1 numbers               0             1    12.9    7.43  1    6     12     19     26 ▇▆▆▆▅
2 dates                 0             1 10034.  5838.    0 5325. 10256. 14989. 19723 ▇▇▇▇▆

What I like about it is that it:

  • Groups columns into categories and shows a summary that's appropriate for that type
  • Shows higher-level details like column type frequency
  • Uses sparklines :)

However it does seem to convert complex types like dates into numbers which is still useful but harder to interpret. It's also not possible to see what it'll do with list columns because

Code you can run to produce the above output
# install.packages(c("dplyr", "dbplyr", "RSQLite", "skimr"))

library(dplyr, warn.conflicts = FALSE)
library(dbplyr)

nrows <- 1000
example_df <- data.frame(
  numbers = sample(seq_along(LETTERS), nrows, replace = TRUE),
  letters = sample(LETTERS, nrows, replace = TRUE),
  dates = sample(seq(as.Date("1970-01-01"), as.Date("2024-01-01"), length.out = length(LETTERS)), nrows, replace = TRUE))

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, example_df)
tbl(con, "example_df") |> 
  skimr::skim()
Example with list columns (duckdb)
 tbl(con, "starwars") |>
+   collect() |> skimr::skim()
── Data Summary ────────────────────────
                           Values                      
Name                       collect(tbl(con, "starwar...
Number of rows             87                          
Number of columns          14                          
_______________________                                
Column type frequency:                                 
  character                8                           
  list                     3                           
  numeric                  3                           
________________________                               
Group variables            None                        

── Variable type: character ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 name                  0         1       3  21     0       87          0
2 hair_color            5         0.943   4  13     0       11          0
3 skin_color            0         1       3  19     0       31          0
4 eye_color             0         1       3  13     0       15          0
5 sex                   4         0.954   4  14     0        4          0
6 gender                4         0.954   8   9     0        2          0
7 homeworld            10         0.885   4  14     0       48          0
8 species               4         0.954   3  14     0       37          0

── Variable type: list ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate n_unique min_length max_length
1 films                 0             1       24          1          7
2 vehicles              0             1       11          0          2
3 starships             0             1       16          0          5

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist 
1 height                6         0.931 175.   34.8 66 167   180 191    264 ▂▁▇▅▁
2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁
3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

@lostmygithubaccount
Copy link
Member

this could also be useful with #8369 -- I'd personally love to see a package like ydata-profiling that uses Ibis to generate a bunch of statistics on the table + columns and visualizes them nicely, with the key differentiation being that it can run on 20+ backends!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants