Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sampling from table expressions #7139

Closed
1 task done
NickCrews opened this issue Sep 12, 2023 · 2 comments · Fixed by #7377
Closed
1 task done

feat: sampling from table expressions #7139

NickCrews opened this issue Sep 12, 2023 · 2 comments · Fixed by #7377
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Sep 12, 2023

Is your feature request related to a problem?

I want to make my workflows reproducible. Particularly, I am sampling some data points with table.order_by(ibis.random()).head(100) (which is a beautiful, very pleased sampling is so concise.)

But this isn't reproducible between runs.

Describe the solution you'd like

There are some significant API-impedance-matching challenges:

So I'm not sure if this should be a Connection.set_seed() function (and a ibis.set_seed() that delegates to the default backend) or a per-call ibis.random(seed) API. Perhaps we

  • support both
  • if someone calls Connection.set_seed(), then that creates a global random state
  • for each ibis.random(seed=None) call:
    • if seed is passed in
      • If the backend doesn't support, error.
      • use that.
    • else
      • if global seed has been set and the backend supports it, use that (updating it at the same time?)
      • otherwise, use no seed

What might be tricky here is the seed for each random call is determined at Operation translation time, rather than at execution time, so perhaps they will happen in different orders? Or if you translate an expression multiple times then you are modifying the global state? IDK haven't really thought this through.

What version of ibis are you running?

NA

What backend(s) are you using, if any?

personally, duckdb

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Sep 12, 2023
@cpcloud
Copy link
Member

cpcloud commented Sep 14, 2023

Thanks for the issue!

It sounds the thing you really want is reproducible sampling, regardless of how that's implemented. I understand you're using ORDER BY RANDOM() to achieve that at the moment.

I think instead of implementing support for seeding random number generators, we should add support for a table_expr.sample(...) method. In cases where a backend doesn't support TABLESAMPLE in any form, we fall back to ORDER BY RANDOM().

DuckDB (and Snowflake) have a REPEATABLE (<seed>) syntax that you can use per table expression to get repeatable result sets:

D create or replace table t as select x from range(10) _(x);
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘
D select * from t tablesample reservoir(20%) repeatable (3);
┌───────┐
│   x   │
│ int64 │
├───────┤
│     5 │
│     9 │
└───────┘

@NickCrews
Copy link
Contributor Author

Yes, that is the more specific task I'm trying to do. I think that makes sense to begin with a more limited Table.sample(n: int | float, *, method: str | None, seed: int | None = None) -> Table API. Maybe add in the more general set the random seed API later.

@cpcloud cpcloud changed the title feat: Set random seed feat: sampling from table expressions Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants