Bump Polars 0.37 #861

lkarthee · 2024-02-16T18:53:07Z

No description provided.

lkarthee · 2024-02-16T18:58:55Z

lib/explorer/series.ex

+    categories = categories |> distinct() |> cast(:category)
+    apply_series(series, :categorise, [categories])
+  end
+


Moved string series and list of strings categorise to here.

instead of throwing errors, it applies distinct to categories series

nil check is missing - should we add nil_count check ?

instead of throwing errors, it applies distinct to categories series

I don't think we should do this because we are mapping indexes into the list. If you remove duplicates, the indexes are shifted, and the result changes. Is there a reason we removed the Rust code responsible for this?

Ok.

https://pola.rs/posts/polars-string-type/

This caused errors in Series.categorise and encoding of strings/binary.

Hit a wall with getting RevMapping to work in Series.Categorise - so tried if it can be fixed in elixir

we can do it in Elixir, but we need to get the unique_count and raise if different than size, and the nil count and raise.

lkarthee · 2024-02-16T19:00:38Z

native/explorer/src/encoding.rs

+        s.str()?.into_iter().map(|option| option.encode(env))
+    ))
+}
+


i am not confident of generic_string_series_to_list and generic_binary_series_to_list. Feel free to make it better. I have a feeling that I messed up something here.

@josevalim this code ok ?

I have no idea, we will need to wait for @philss' review. :)

I am not sure about one thing performance - earlier binaries and strings were created from buffers using binary and subbinary. Now because of the new arrow implementation, they are exposing BinaryArrayView which gives values. So not sure if there is way to get earlier implementation working or new one is as good as old one.

I think code-wise it looks good! But I think we need a benchmark to be sure about performance.

No problem! I will take care of it. Thanks, and safe travels!

Hey! Sorry for the delay. I could find something interesting about this.

For the test, I'm trying to compare how much time it takes to encode a binary series, and the MB of RAM it uses. I'm using a custom allocator called PeakAlloc for this.

The patch of the diffs I made to measure this is located in https://gist.github.com/philss/3f519ce2587461aa61d7fd9f77e3ea1f (only the Cargo.lock is out to avoid conflicts).

Timing and memory usage - main branch

Running on main without calling GC:

$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}") end)'

Results:

Begin: 30.768875 MB of RAM. End: 30.769852 MB of RAM. done in: 0.099455 Begin: 61.288383 MB of RAM. End: 61.288383 MB of RAM. done in: 0.101318 Begin: 3082.6296 MB of RAM. End: 3082.6296 MB of RAM. done in: 7.900625 Begin: 3082.6296 MB of RAM. End: 3082.6296 MB of RAM. done in: 2.738504 Begin: 61.288383 MB of RAM. End: 61.288383 MB of RAM. done in: 0.083659

We can see above that the memory grows, but soon it drops again when we try with a smaller file. It looks like the GC takes a while to run and drop the encoded content.

Timing and memory usage - this PR branch

$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); :erlang.garbage_collect(); Process.sleep(1_000) end)'

Even with GC runs (I tried before without it), it does not drop after a while:

Begin: 39.60874 MB of RAM. End: 39.60972 MB of RAM. done in: 0.106566 Begin: 78.96812 MB of RAM. End: 78.96812 MB of RAM. done in: 0.130126 Begin: 4662.26 MB of RAM. End: 4662.26 MB of RAM. done in: 9.069385 Begin: 4701.618 MB of RAM. End: 4701.618 MB of RAM. done in: 0.042475 Begin: 4740.9766 MB of RAM. End: 4740.9766 MB of RAM. done in: 0.049214 Begin: 4741.016 MB of RAM. End: 4741.016 MB of RAM. done in: 0.323967 Begin: 4741.0557 MB of RAM. End: 4741.0557 MB of RAM. done in: 0.327083

And if we try to load the DF without encoding the series, it is dropped after a while. See below:

MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"]; Explorer.Series.from_list([1, 2]) |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); Process.sleep(1_000) end)'

Begin: 39.609077 MB of RAM. End: 39.610054 MB of RAM. done in: 0.026808 Begin: 78.96885 MB of RAM. End: 78.96885 MB of RAM. done in: 0.02043 Begin: 4662.2607 MB of RAM. End: 4662.2607 MB of RAM. done in: 0.453696 Begin: 4701.6196 MB of RAM. End: 4701.6196 MB of RAM. done in: 0.012698 Begin: 4740.9785 MB of RAM. End: 4740.9785 MB of RAM. done in: 0.015301 Begin: 4741.0186 MB of RAM. End: 4741.0186 MB of RAM. done in: 3.74e-4 Begin: 0.2911501 MB of RAM. End: 0.2911501 MB of RAM. done in: 0.27713

So I suspect the new to_list/1 implementation for binary series is leaking something.
My idea is to investigate more the new APIs and see if the problem is in the Polars side or ours.

PS: The parquet files used in this experiment were made by creating a DF with a column called bins that contains random bytes (using :crypto.strong_rand_bytes(24)). The size of the files is the following:

-rw-r--r--. 1 philip philip 2.7G Feb 22 16:07 tmp/big.parquet (100_000_000 rows) -rw-r--r--. 1 philip philip 27M Feb 22 15:49 tmp/medium.parquet (100_000 rows) -rw-r--r--. 1 philip philip 28K Feb 22 15:48 tmp/small.parquet (1_000 rows)

OK, I'm pretty sure that this is the expected behavior, since Polars changed its internal string/binary series representation. According to the blog post "Why we have rewritten the string data type", they may store binaries in memory for more time (even requiring GC), and they may require more space as well. So what we had previously is not possible to achieve in this new model, for what I could understand.

So that resolves last pending item in the review ?

Yeah, I would say it does. We are just deciding if we are going to release a version before merging this one.

lkarthee · 2024-02-16T19:02:10Z

native/explorer/src/expressions.rs

@@ -602,7 +602,7 @@ pub fn expr_last(expr: ExExpr) -> ExExpr {

 #[rustler::nif]
 pub fn expr_format(exprs: Vec<ExExpr>) -> ExExpr {
-    ExExpr::new(concat_str(ex_expr_to_exprs(exprs), ""))
+    ExExpr::new(concat_str(ex_expr_to_exprs(exprs), "", true)) //TODO: ignore_nulls


ignore_nulls need to be true is what I felt reading docs. let me know if it needs to be false.

I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests).

also should we add new param ignore_nulls just like polars ?

#861 (comment)

also should we add new param ignore_nulls just like polars ?

I would be fine leaving this as a TODO for this PR. After reading:

feat: Add ignore_nulls for pl.concat_str pola-rs/polars#13877

I can see the value in having it. But the additional testing/documentation would necessitate a lot more work which we could instead do in a follow-up.

I agree with @billylanchantin. We can do it in a follow-up.

lkarthee · 2024-02-16T19:04:25Z

datasets/iris.csv

@@ -149,4 +149,3 @@ sepal_length,sepal_width,petal_length,petal_width,species
 6.5,3.0,5.2,2.0,Iris-virginica
 6.2,3.4,5.4,2.3,Iris-virginica
 5.9,3.0,5.1,1.8,Iris-virginica
-


empty line is causing an extra row to be added to data frame and its related tests are failing..

lkarthee · 2024-02-16T19:05:05Z

native/explorer/src/encoding.rs

+        // }
+        // DataType::Binary => {
+        //     generic_binary_series_to_list(&s.resource, s.binary()?.downcast_iter(), env)
+        // }


will delete comments after review..

josevalim · 2024-02-16T20:21:16Z

native/explorer/src/expressions.rs

@@ -1051,7 +1055,7 @@ pub fn expr_second(expr: ExExpr) -> ExExpr {
 pub fn expr_join(expr: ExExpr, sep: String) -> ExExpr {
    let expr = expr.clone_inner();

-    ExExpr::new(expr.list().join(sep.lit()))
+    ExExpr::new(expr.list().join(sep.lit(), true)) //TODO: ignore_nulls


I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests). Should join discard nulls? And what does it mean to ignore? does it mean they are removed or that they are considered an empty string?

In Elixir:

iex(2)> Enum.join [1, nil, 3], "," "1,,3"

https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.list.join.html

If there is a null value, the whole output becomes null.

Ignore null values (default).
If set to False, null values will be propagated. If the sub-list contains any null values, the output is None.

context:
pola-rs/polars#13701
pola-rs/polars#13877

pola-rs/polars#13877 (comment)
pola-rs/polars#13701 (comment)

Thank you. I definitely prefer to ignore nulls then. Having the whole thing return null is surprising and hard to debug. So let's add tests and remove the comment and we are good to go on this one. :)

lkarthee · 2024-02-18T16:31:06Z

native/explorer/src/series.rs

@@ -259,18 +259,6 @@ pub fn s_slice(series: ExSeries, offset: i64, length: usize) -> Result<ExSeries,
    Ok(ExSeries::new(series.slice(offset, length)))
 }

-#[rustler::nif(schedule = "DirtyCpu")]


shifted to elixir side for eager series. To align with ignore_nulls.

lib/explorer/polars_backend/series.ex

lib/explorer/series.ex

philss

After apply José's suggestions, 🚢

Co-authored-by: José Valim <jose.valim@gmail.com>

josevalim · 2024-02-25T09:13:36Z

💚 💙 💜 💛 ❤️

Bump Polars 0.37

4023c2e

lkarthee force-pushed the bump_polars_v0_37 branch from 64233ca to 4023c2e Compare February 16, 2024 18:55

lkarthee commented Feb 16, 2024

View reviewed changes

josevalim reviewed Feb 16, 2024

View reviewed changes

lkarthee added 2 commits February 18, 2024 21:00

series fixes

71850cc

string format fix

22e46f5

lkarthee commented Feb 18, 2024

View reviewed changes

cigrainger mentioned this pull request Feb 19, 2024

Rewrite from_list #863

Closed

5 tasks

josevalim reviewed Feb 24, 2024

View reviewed changes

lib/explorer/polars_backend/series.ex Outdated Show resolved Hide resolved

josevalim reviewed Feb 24, 2024

View reviewed changes

lib/explorer/series.ex Outdated Show resolved Hide resolved

josevalim approved these changes Feb 24, 2024

View reviewed changes

philss approved these changes Feb 24, 2024

View reviewed changes

Apply suggestions from code review

ad60f89

Co-authored-by: José Valim <jose.valim@gmail.com>

josevalim merged commit 7f168f5 into elixir-explorer:main Feb 25, 2024
4 checks passed

lkarthee deleted the bump_polars_v0_37 branch February 25, 2024 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Polars 0.37 #861

Bump Polars 0.37 #861

lkarthee commented Feb 16, 2024

lkarthee Feb 16, 2024

josevalim Feb 16, 2024

lkarthee Feb 16, 2024

lkarthee Feb 16, 2024

josevalim Feb 16, 2024

lkarthee Feb 16, 2024

lkarthee Feb 17, 2024

josevalim Feb 17, 2024

lkarthee Feb 17, 2024 •

edited

Loading

philss Feb 19, 2024

philss Feb 19, 2024

philss Feb 23, 2024

philss Feb 24, 2024

lkarthee Feb 24, 2024

philss Feb 24, 2024

lkarthee Feb 16, 2024 •

edited

Loading

josevalim Feb 16, 2024

lkarthee Feb 17, 2024 •

edited

Loading

lkarthee Feb 17, 2024

billylanchantin Feb 17, 2024

cigrainger Feb 19, 2024

lkarthee Feb 16, 2024

lkarthee Feb 16, 2024

josevalim Feb 16, 2024

lkarthee Feb 17, 2024 •

edited

Loading

josevalim Feb 17, 2024

billylanchantin Feb 17, 2024

cigrainger Feb 19, 2024

lkarthee Feb 18, 2024 •

edited

Loading

philss left a comment

josevalim commented Feb 25, 2024

Bump Polars 0.37 #861

Bump Polars 0.37 #861

Conversation

lkarthee commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Timing and memory usage - main branch

Timing and memory usage - this PR branch

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkarthee Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

philss left a comment

Choose a reason for hiding this comment

josevalim commented Feb 25, 2024

lkarthee Feb 17, 2024 •

edited

Loading

lkarthee Feb 16, 2024 •

edited

Loading

lkarthee Feb 17, 2024 •

edited

Loading

lkarthee Feb 17, 2024 •

edited

Loading

lkarthee Feb 18, 2024 •

edited

Loading