-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row-wise random sampling in LazyFrames #3933
Comments
Could you clarify what this request would add on top of the existing
|
Would be good to highlight that on the distributions side there has been an open request on adding weights: #2661 |
Let me see if I can help. Since it may help to demonstrate using data, let's start with this dataset of 100 records. import polars as pl
nbr_obs = 100
df = pl.DataFrame({
'row_nr': pl.arange(0, nbr_obs, eager=True),
})
df
The
|
Hi @zundertj and @cbilot, thanks for the prompt replies! I am specifically talking about 1). I was only looking at the documentation for |
I had not realized that in Polars Unfortunately, I cannot find a way to achieve this. The closest I have got without constructing the series in advance is this (using from random import random
# need to wrap, as apply supplies the value of the column whilst random.random does not take one
def random2(x):
return random()
# this adds an additional column with +/- 70% True values
df.lazy().with_column((pl.first().apply(random2) < 0.7) .alias("_sample")).collect()
# but the same expression cannot be used by filter
df.lazy().filter(pl.first().apply(random2) < 0.7).collect()
|
I am also interested in this (row-wise sampling on lazy frames). @zundertj I was wondering why df.lazy().with_column((pl.first().apply(random2) < 0.7).alias("_sample")).filter(pr.col('_sample')).collect() would not suffice - I guess Edit: I didn't realize that |
Please separate the API for uniform sampling and weighted sampling. In Spark, if you have 1,000,000 rows in S3 and sample 100 rows, it downloads the entire dataset. |
I am interested in this feature. |
Me too, I was going to implement it with pyarrow, but then thought I'd try DuckDB or Polars. DuckDB has it but it's slow. Going coding I guess. |
still takes the whole data set.... see #8664
|
I would like to have it too! |
if you are comfortable with using hash functions for randomness, a workaround is something like
to filter to 10% of rows, etc. and can change the obviously this is kind of ugly |
I am also looking forward to implementing the |
A potential way to stream this is by pushing a tuple of a random uniform value and the row index into a heap, where the heap is sorted on the random value and constrained to a maximum size. It's not exactly the same as An example is here. It does still require scanning the full column though so each item gets a chance to be included in the resulting set. However, in the case where the number of elements is known in advance (or easily obtainable), the set of row indices to retain could be computed prior to fetching the column at the expense of storing those indices in advance. |
This would be great |
Row-wise random sampling
Suppose
lf
is apl.LazyFrame
, then one popular transformation on the lazy frame is to select rows with some preset probability parametrized by a bernoulli distribution with parameterp
. In SQL this is expressed as:which corresponds to a random sample with probability 10%. I would like a similar API on the row-level for
LazyFrame
s, and I imagine it to look something like:Optional arguments may include:
distribution="bernoulli"
, but I don't see any other type ofTABLESAMPLE
distributions out there except forSYSTEM
from trino which is storage-aware.The text was updated successfully, but these errors were encountered: