feat: sampling from table expressions #7139

NickCrews · 2023-09-12T20:49:46Z

Is your feature request related to a problem?

I want to make my workflows reproducible. Particularly, I am sampling some data points with table.order_by(ibis.random()).head(100) (which is a beautiful, very pleased sampling is so concise.)

But this isn't reproducible between runs.

Describe the solution you'd like

There are some significant API-impedance-matching challenges:

Duckdb has a global setseed() function
Snowflake uses a random(seed) API
BigQuery doesn't appear to have anything like this

So I'm not sure if this should be a Connection.set_seed() function (and a ibis.set_seed() that delegates to the default backend) or a per-call ibis.random(seed) API. Perhaps we

support both
if someone calls Connection.set_seed(), then that creates a global random state
for each ibis.random(seed=None) call:
- if seed is passed in
  - If the backend doesn't support, error.
  - use that.
- else
  - if global seed has been set and the backend supports it, use that (updating it at the same time?)
  - otherwise, use no seed

What might be tricky here is the seed for each random call is determined at Operation translation time, rather than at execution time, so perhaps they will happen in different orders? Or if you translate an expression multiple times then you are modifying the global state? IDK haven't really thought this through.

What version of ibis are you running?

NA

What backend(s) are you using, if any?

personally, duckdb

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-09-14T13:04:33Z

Thanks for the issue!

It sounds the thing you really want is reproducible sampling, regardless of how that's implemented. I understand you're using ORDER BY RANDOM() to achieve that at the moment.

I think instead of implementing support for seeding random number generators, we should add support for a table_expr.sample(...) method. In cases where a backend doesn't support TABLESAMPLE in any form, we fall back to ORDER BY RANDOM().

DuckDB (and Snowflake) have a REPEATABLE (<seed>) syntax that you can use per table expression to get repeatable result sets:

D create or replace table t as select x from range(10) _(x);
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘

NickCrews · 2023-09-14T18:35:50Z

Yes, that is the more specific task I'm trying to do. I think that makes sense to begin with a more limited Table.sample(n: int | float, *, method: str | None, seed: int | None = None) -> Table API. Maybe add in the more general set the random seed API later.

NickCrews added the feature Features or general enhancements label Sep 12, 2023

cpcloud changed the title ~~feat: Set random seed~~ feat: sampling from table expressions Oct 11, 2023

cpcloud mentioned this issue Oct 17, 2023

feat(api): Add Table.sample #7377

Merged

jcrist closed this as completed in #7377 Oct 17, 2023

NickCrews mentioned this issue Jan 21, 2024

feat: make it possible to pass a seed parameter to ibis.random #8054

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sampling from table expressions #7139

feat: sampling from table expressions #7139

NickCrews commented Sep 12, 2023 •

edited

Loading

cpcloud commented Sep 14, 2023 •

edited

Loading

NickCrews commented Sep 14, 2023

feat: sampling from table expressions #7139

feat: sampling from table expressions #7139

Comments

NickCrews commented Sep 12, 2023 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Sep 14, 2023 • edited Loading

NickCrews commented Sep 14, 2023

NickCrews commented Sep 12, 2023 •

edited

Loading

cpcloud commented Sep 14, 2023 •

edited

Loading