Add DataFrame.with_row_count/2 #833

iurimateus · 2024-01-20T14:53:32Z

Do note that with_row_count was renamed in polars/main to with_row_index (ref pola-rs/polars#13494), so perhaps we should wait for a new polars release?

iurimateus · 2024-01-20T14:54:19Z

lib/explorer/data_frame.ex

+  """
+  @doc type: :single
+  @spec with_row_count(df :: DataFrame.t(), opts :: Keyword.t()) :: DataFrame.t()
+  def with_row_count(%DataFrame{} = df, opts \\ []) do


Not sure if this should be with_row_count(%DataFrame{} = df, name, index)

iurimateus · 2024-01-20T14:55:25Z

test/explorer/data_frame_test.exs

@@ -4234,4 +4234,42 @@ defmodule Explorer.DataFrameTest do
             }
    end
  end
+
+  describe "with_row_count/2" do


seems redundant with doctests

josevalim

I agree that with_row_index is better, but we can use this name in Explorer and fallback to with_row_count in Polars. Although I’d like @cigrainger’s opinion on the name. We really don’t have functions starting with “with_”.

billylanchantin · 2024-01-20T15:54:36Z

I'm open to other names, but I'm personally good with with_row_index. It's what Polars is using and it's reminiscent of Enum.with_index.

cigrainger · 2024-01-20T16:53:50Z

I'm not a huge fan of with_ either. It's a bit of an interesting middle ground between the crazy indices in pandas and how in dplyr you just mutate(id = row_number()). Could we just call it row_index?

josevalim · 2024-01-20T16:57:19Z

I'm open to other names, but I'm personally good with with_row_index. It's what Polars is using and it's reminiscent of Enum.with_index.

This convinced me on going with with_row_index.

However...

in dplyr you just mutate(id = row_number())

Ideally we would implement it with the above, and then with_row_index is just a shortcut function. Do we have something like row_number() for lazy expressions in Polars?

iurimateus · 2024-01-20T17:15:13Z

in dplyr you just mutate(id = row_number())

Do we have something like row_number() for lazy expressions in Polars?

See pola-rs/polars#12420

There's int_range /pl.int_range(pl.len()) and the docs offers an alternative to with_row_index:

df.select(
    pl.int_range(pl.len(), dtype=pl.UInt32).alias("index"),
    pl.all(),
)

lib/explorer/data_frame.ex

josevalim · 2024-01-20T18:10:46Z

There's int_range /pl.int_range(pl.len()) and the docs offers an alternative to with_row_index:

Interesting. So what if we:

Add a Series.row_index()
Make it raise for the default backend saying it is only supported in dataframes
Implement it properly for expressions

Would that work? Perhaps we try that as a separate PR so we can compare them side by side?

cigrainger · 2024-01-20T19:53:01Z

I like that idea a lot @josevalim

iurimateus · 2024-01-24T10:11:48Z

I'll make a proper PR later. This works, but I don't know if it's the correct implementation.

I need some guidance for

Make it raise for the default backend saying it is only supported in dataframes

diff

commit 3d2a8bda28f8072e081b2fb248f72bad0bb76a34
Author: Iuri Mateus <20242530+iurimateus@users.noreply.github.com>
Date:   Tue Jan 23 09:17:16 2024 -0300

    wip

diff --git a/lib/explorer/backend/lazy_series.ex b/lib/explorer/backend/lazy_series.ex
index e6461a6..e99204a 100644
--- a/lib/explorer/backend/lazy_series.ex
+++ b/lib/explorer/backend/lazy_series.ex
@@ -141,7 +141,9 @@ defmodule Explorer.Backend.LazySeries do
     # List functions
     join: 2,
     lengths: 1,
-    member: 3
+    member: 3,
+    # ?
+    len: 0
   ]
 
   @comparison_operations [:equal, :not_equal, :greater, :greater_equal, :less, :less_equal]
@@ -1074,6 +1076,47 @@ defmodule Explorer.Backend.LazySeries do
     Backend.Series.new(data, :boolean)
   end
 
+  def len() do
+    data = new(:len, [], {:u, 32})
+
+    Backend.Series.new(data, {:u, 32})
+  end
+
+  def row_index() do
+    data = new(:row_index, [], {:u, 32})
+
+    Backend.Series.new(data, {:u, 32})
+  end
+
+  def int_range(start, end_, step \\ 1) do
+    # TODO check boundaries?
+
+    # ** (RuntimeError) Polars Error: lengths don't match: unable to add a column of length 2 to
+    # a DataFrame of height 3
+
+    # TODO Should we also accept a non lazy Series?
+    {start, end_} =
+      case {start, end_} do
+        {%Series{}, %Series{}} ->
+          {lazy_series!(start), lazy_series!(end_)}
+
+        {%Series{}, _} ->
+          {lazy_series!(start), end_}
+
+        {_, %Series{}} ->
+          {start, lazy_series!(end_)}
+
+        {v1, v2} when is_integer(v1) and is_integer(v2) ->
+          {v1, v2}
+
+        {_, _} ->
+          raise ArgumentError, "invalid arguments"
+      end
+
+    data = new(:int_range, [start, end_, step], {:s, 64})
+    Backend.Series.new(data, {:s, 64})
+  end
+
   @remaining_non_lazy_operations [
     at: 2,
     at_every: 2,
diff --git a/lib/explorer/polars_backend/expression.ex b/lib/explorer/polars_backend/expression.ex
index e2581a8..1dd50b2 100644
--- a/lib/explorer/polars_backend/expression.ex
+++ b/lib/explorer/polars_backend/expression.ex
@@ -155,6 +155,9 @@ defmodule Explorer.PolarsBackend.Expression do
     shift: 3,
     slice: 2,
     slice: 3,
+    row_index: 0,
+    len: 0,
+    int_range: 4,
     concat: 1,
     column: 1,
     correlation: 4,
@@ -290,6 +293,22 @@ defmodule Explorer.PolarsBackend.Expression do
     end
   end
 
+  def to_expr(%LazySeries{op: :len, args: []}) do
+    Native.expr_len()
+  end
+
+  def to_expr(%LazySeries{op: :row_index, args: []}) do
+    Native.expr_int_range(to_expr(0), Native.expr_len(), 1, {:u, 32})
+  end
+
+  def to_expr(%LazySeries{
+        op: :int_range,
+        args: [start, end_, step],
+        dtype: dtype
+      }) do
+    Native.expr_int_range(to_expr(start), to_expr(end_), step, dtype)
+  end
+
   for {op, arity} <- @all_expressions do
     args = Macro.generate_arguments(arity, __MODULE__)
 
diff --git a/lib/explorer/polars_backend/native.ex b/lib/explorer/polars_backend/native.ex
index 8557778..1f4b8c6 100644
--- a/lib/explorer/polars_backend/native.ex
+++ b/lib/explorer/polars_backend/native.ex
@@ -213,6 +213,9 @@ defmodule Explorer.PolarsBackend.Native do
   def expr_series(_series), do: err()
   def expr_string(_string), do: err()
 
+  # def expr_len(), do: err()
+  def expr_int_range(_start, _end, _step, _dtype), do: err()
+
   # LazyFrame
   def lf_collect(_df), do: err()
   def lf_describe_plan(_df, _optimized), do: err()
diff --git a/lib/explorer/series.ex b/lib/explorer/series.ex
index ca9bc66..c5b3121 100644
--- a/lib/explorer/series.ex
+++ b/lib/explorer/series.ex
@@ -5985,6 +5985,18 @@ defmodule Explorer.Series do
   def member?(%Series{dtype: dtype}, _value),
     do: dtype_error("member?/2", dtype, [{:list, :_}])
 
+  def len() do
+    Explorer.Backend.LazySeries.len()
+  end
+
+  def row_index() do
+    Explorer.Backend.LazySeries.row_index()
+  end
+
+  def int_range(start, end_, step \\ 1) do
+    Explorer.Backend.LazySeries.int_range(start, end_, step)
+  end
+
   # Escape hatch
 
   @doc """
diff --git a/native/explorer/Cargo.toml b/native/explorer/Cargo.toml
index 96fbc89..8765393 100644
--- a/native/explorer/Cargo.toml
+++ b/native/explorer/Cargo.toml
@@ -77,6 +77,7 @@ features = [
   "product",
   "peaks",
   "moment",
+  "range",
   "rank",
   "propagate_nans",
 ]
diff --git a/native/explorer/src/expressions.rs b/native/explorer/src/expressions.rs
index b433d8c..c731fe0 100644
--- a/native/explorer/src/expressions.rs
+++ b/native/explorer/src/expressions.rs
@@ -4,6 +4,7 @@
 // or an expression and returns an expression that is
 // wrapped in an Elixir struct.
 
+use polars::lazy::dsl;
 use polars::prelude::{
     col, concat_str, cov, pearson_corr, spearman_rank_corr, when, IntoLazy, LiteralValue,
     SortOptions,
@@ -1042,6 +1043,22 @@ pub fn expr_join(expr: ExExpr, sep: String) -> ExExpr {
     ExExpr::new(expr.list().join(sep.lit()))
 }
 
+#[rustler::nif]
+pub fn expr_int_range(start: ExExpr, end: ExExpr, step: i64, dtype: ExSeriesDtype) -> ExExpr {
+    let start = start.clone_inner();
+    let end = end.clone_inner();
+    let dtype = DataType::try_from(&dtype).unwrap();
+    let expr = dsl::int_range(start, end, step, dtype);
+
+    ExExpr::new(expr)
+}
+
+#[rustler::nif]
+pub fn expr_len() -> ExExpr {
+    let expr = dsl::count();
+    ExExpr::new(expr)
+}
+
 #[rustler::nif]
 pub fn expr_lengths(expr: ExExpr) -> ExExpr {
     let expr = expr.clone_inner();
diff --git a/native/explorer/src/lib.rs b/native/explorer/src/lib.rs
index 02f7c21..8409c46 100644
--- a/native/explorer/src/lib.rs
+++ b/native/explorer/src/lib.rs
@@ -166,6 +166,8 @@ rustler::init!(
         expr_float,
         expr_head,
         expr_integer,
+        expr_int_range,
+        expr_len,
         expr_peaks,
         expr_rank,
         expr_unary_not,
diff --git a/test/explorer/data_frame_test.exs b/test/explorer/data_frame_test.exs
index 8a52c9a..533d0ed 100644
--- a/test/explorer/data_frame_test.exs
+++ b/test/explorer/data_frame_test.exs
@@ -4234,4 +4234,45 @@ defmodule Explorer.DataFrameTest do
              }
     end
   end
+
+  describe "row_index/2" do
+    test "int_range" do
+      df = DF.new(a: [1, 3, 5], b: [2, 4, 6])
+
+      df1 = DF.mutate(df, x: len())
+
+      assert DF.to_columns(df1, atom_keys: true) == %{
+               a: [1, 3, 5],
+               b: [2, 4, 6],
+               x: [3, 3, 3]
+             }
+
+      df2 = DF.mutate(df, id: row_index()) |> DF.collect()
+
+      assert DF.to_columns(df2, atom_keys: true) == %{
+               a: [1, 3, 5],
+               b: [2, 4, 6],
+               id: [0, 1, 2]
+             }
+
+      df3 = DF.mutate(df, id: int_range(0, len()))
+      assert DF.to_columns(df3) == DF.to_columns(df2)
+
+      df4 = DF.mutate_with(df, fn _ -> [id: Series.row_index()] end)
+
+      assert DF.to_columns(df4, atom_keys: true) == %{
+               a: [1, 3, 5],
+               b: [2, 4, 6],
+               id: [0, 1, 2]
+             }
+
+      df5 = DF.mutate(df, x: int_range(0, 15, 5))
+
+      assert DF.to_columns(df5, atom_keys: true) == %{
+               a: [1, 3, 5],
+               b: [2, 4, 6],
+               x: [0, 5, 10]
+             }
+    end
+  end
 end

iurimateus · 2024-02-19T20:52:02Z

Superseded by #862

Add DataFrame.with_row_count/2

4005804

iurimateus commented Jan 20, 2024

View reviewed changes

josevalim reviewed Jan 20, 2024

View reviewed changes

iurimateus force-pushed the add-with-row-count branch from 85b1ed8 to 9b86a4f Compare January 20, 2024 15:44

fixup! Add DataFrame.with_row_count/2

61cde22

iurimateus force-pushed the add-with-row-count branch from 9b86a4f to 61cde22 Compare January 20, 2024 17:53

josevalim reviewed Jan 20, 2024

View reviewed changes

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved

fixup! fixup! Add DataFrame.with_row_count/2

091f739

cigrainger mentioned this pull request Jan 22, 2024

Support Series.map/2 with lists #835

Closed

fixup! fixup! fixup! Add DataFrame.with_row_count/2

6cb6cd2

iurimateus mentioned this pull request Feb 18, 2024

Series.row_index/1 #862

Merged

iurimateus closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrame.with_row_count/2 #833

Add DataFrame.with_row_count/2 #833

iurimateus commented Jan 20, 2024 •

edited

Loading

iurimateus Jan 20, 2024 •

edited

Loading

iurimateus Jan 20, 2024

josevalim left a comment

billylanchantin commented Jan 20, 2024

cigrainger commented Jan 20, 2024

josevalim commented Jan 20, 2024

iurimateus commented Jan 20, 2024 •

edited

Loading

josevalim commented Jan 20, 2024

cigrainger commented Jan 20, 2024

iurimateus commented Jan 24, 2024 •

edited

Loading

iurimateus commented Feb 19, 2024

Add DataFrame.with_row_count/2 #833

Add DataFrame.with_row_count/2 #833

Conversation

iurimateus commented Jan 20, 2024 • edited Loading

iurimateus Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

iurimateus Jan 20, 2024

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

billylanchantin commented Jan 20, 2024

cigrainger commented Jan 20, 2024

josevalim commented Jan 20, 2024

iurimateus commented Jan 20, 2024 • edited Loading

josevalim commented Jan 20, 2024

cigrainger commented Jan 20, 2024

iurimateus commented Jan 24, 2024 • edited Loading

iurimateus commented Feb 19, 2024

iurimateus commented Jan 20, 2024 •

edited

Loading

iurimateus Jan 20, 2024 •

edited

Loading

iurimateus commented Jan 20, 2024 •

edited

Loading

iurimateus commented Jan 24, 2024 •

edited

Loading