You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to make my workflows reproducible. Particularly, I am sampling some data points with table.order_by(ibis.random()).head(100) (which is a beautiful, very pleased sampling is so concise.)
But this isn't reproducible between runs.
Describe the solution you'd like
There are some significant API-impedance-matching challenges:
So I'm not sure if this should be a Connection.set_seed() function (and a ibis.set_seed() that delegates to the default backend) or a per-call ibis.random(seed) API. Perhaps we
support both
if someone calls Connection.set_seed(), then that creates a global random state
for each ibis.random(seed=None) call:
if seed is passed in
If the backend doesn't support, error.
use that.
else
if global seed has been set and the backend supports it, use that (updating it at the same time?)
otherwise, use no seed
What might be tricky here is the seed for each random call is determined at Operation translation time, rather than at execution time, so perhaps they will happen in different orders? Or if you translate an expression multiple times then you are modifying the global state? IDK haven't really thought this through.
What version of ibis are you running?
NA
What backend(s) are you using, if any?
personally, duckdb
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
It sounds the thing you really want is reproducible sampling, regardless of how that's implemented. I understand you're using ORDER BY RANDOM() to achieve that at the moment.
I think instead of implementing support for seeding random number generators, we should add support for a table_expr.sample(...) method. In cases where a backend doesn't support TABLESAMPLE in any form, we fall back to ORDER BY RANDOM().
DuckDB (and Snowflake) have a REPEATABLE (<seed>) syntax that you can use per table expression to get repeatable result sets:
D create or replace table t as select x from range(10) _(x);
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│ x │
│ int64 │
├───────┤
│ 5 │
│ 9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│ x │
│ int64 │
├───────┤
│ 5 │
│ 9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│ x │
│ int64 │
├───────┤
│ 5 │
│ 9 │
└───────┘
Yes, that is the more specific task I'm trying to do. I think that makes sense to begin with a more limited Table.sample(n: int | float, *, method: str | None, seed: int | None = None) -> Table API. Maybe add in the more general set the random seed API later.
cpcloud
changed the title
feat: Set random seed
feat: sampling from table expressions
Oct 11, 2023
Is your feature request related to a problem?
I want to make my workflows reproducible. Particularly, I am sampling some data points with
table.order_by(ibis.random()).head(100)
(which is a beautiful, very pleased sampling is so concise.)But this isn't reproducible between runs.
Describe the solution you'd like
There are some significant API-impedance-matching challenges:
setseed()
functionrandom(seed)
APISo I'm not sure if this should be a
Connection.set_seed()
function (and aibis.set_seed()
that delegates to the default backend) or a per-callibis.random(seed)
API. Perhaps weConnection.set_seed()
, then that creates a global random stateibis.random(seed=None)
call:What might be tricky here is the seed for each random call is determined at Operation translation time, rather than at execution time, so perhaps they will happen in different orders? Or if you translate an expression multiple times then you are modifying the global state? IDK haven't really thought this through.
What version of ibis are you running?
NA
What backend(s) are you using, if any?
personally, duckdb
Code of Conduct
The text was updated successfully, but these errors were encountered: