-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional temporal arithmetic #696
Additional temporal arithmetic #696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some specific call outs:
lib/explorer/series.ex
Outdated
defp maybe_swap_args_add([left, right]) do | ||
case {dtype(left), dtype(right)} do | ||
# `duration + date` is not supported by polars for some reason. | ||
{{:duration, _}, :date} -> [right, left] | ||
_ -> [left, right] | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, polars doesn't support duration + date
even though it does support duration + datetime
and date + duration
. I'm thinking this is an oversight? Regardless, I chose to swap the arguments rather than disallow because add/2
should definitely be commutative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any backend specific logic should be in the backend, not here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/explorer/series.ex
Outdated
# Polars currently returns inconsistent dtypes, e.g.: | ||
# * `integer * duration -> duration` when `integer` is a scalar | ||
# * `integer * duration -> integer` when `integer` is a series | ||
# We need to return duration in these cases, so we need an additional cast. | ||
|> cast(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still found the need to post-cast in these specific scenarios (same with divide). The root cause is the inconsistency I noted here: #683 (comment).
This means that there is currently a discrepancy between what the lazy operations return and this one. Example (continuing from above):
Explorer.Series.multiply(2, df["diff"])
#=> #Explorer.Series<
#=> Polars[1]
#=> duration[ms] [-2d]
#=> >
DF.mutate(df, mul: 2 * diff)
#=> #Explorer.DataFrame<
#=> Polars[1 x 4]
#=> aug_20 date [2023-08-20]
#=> aug_21 date [2023-08-21]
#=> diff duration[ms] [-1d]
#=> mul integer [-172800000]
#=> >
Thoughts about how to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to handle this in the backend. We also need to make sure we will handle this for polars lazy queries.
Quick question, does it help if we swap the arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it help if we swap the arguments?
No, I checked :(
From #683:
let diff = Series::new("diff", &dt2 - &dt1);
let ten = Series::new("diff", [10]);
println!("{:?}", Series::new("prod3", &ten * &diff)); // [864000000]
println!("{:?}", Series::new("prod4", &diff * &ten)); // [864000000]
We'd need to handle this in the backend. We also need to make sure we will handle this for polars lazy queries.
👍
lib/explorer/shared.ex
Outdated
@doc """ | ||
The return dtype for the basic arithmetic operations: add, subtract, multiply, and divide. | ||
|
||
This function assumes that the inputs have already had the highest precision enforced. | ||
""" | ||
def dtype_for_basic_arithmetic(operation, left_dtype, right_dtype) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seemed maybe out of place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. My suggestion is to move the cast_to_*
functions here and keep this one where it is used. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by: 093e9ce
test "fractional parts of floats work (roughly) as expected" do | ||
# This test is not exhaustive. Rather, its purpose is to give us a reasonable confidence that | ||
# multiplying durations by floats is fairly accurate. | ||
# | ||
# The exact answers we see here are subject to implementation details outside our control. | ||
# If we find that this test breaks unexpectedly (e.g. from a dependency update), then we may | ||
# wish to remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this test mostly because I wanted to see what we could expect from float * duration
. It is quite brittle and I'm happy to remove it.
lib/explorer/series.ex
Outdated
|
||
def subtract(left, right), do: basic_numeric_operation(:subtract, left, right) | ||
@doc false | ||
def cast_to_subtract(:integer, :integer), do: :integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid @doc false
functions because if the user does import Explorer.Series
, they will be imported into their namespace. One option is to underscore them, but my suggestion is to move them to Explorer.Shared
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by: 093e9ce
# Arg-swapping does NOT work. | ||
# [src/dataframe.rs:507] &mutations = [ | ||
# [(col("sub")) + ([(col("aug_20")) + (col("sub"))])].alias("add2"), | ||
# ] | ||
df2 = DF.mutate(df, add2: sub + (aug_20 + sub)) | ||
assert df2["add2"].dtype == :date | ||
assert Series.to_list(df2["add2"]) == [~D[2023-08-22]] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey so I'm a little stuck at this point.
My workaround for Explorer.Series.add(duration, date)
was to swap the arguments. I tried that same thing with expressions, but now I don't think that will work. There's no way to "see through" the parentheses in expressions like:
[(col("sub")) + ([(col("aug_20")) + (col("sub"))])]
meaning I miss situations where I ought to swap.
Perhaps there's some way to salvage this approach? Or perhaps there's another way to achieve what I'm going for? Any advice would be appreciated!
P.S., here's a pseudo-call-stack when calling DF.mutate/2
:
Click to show/hide
Explorer.DataFrame.mutate(df, mutations) # macro
Explorer.DataFrame.mutate_with(unquote(df), Explorer.Query.query(unquote(mutations)))
Explorer.Shared.apply_impl(df, :mutate_with, [df_out, column_pairs])
Explorer.PolarsBackend.DataFrame.mutate_with(df, out_df, column_pairs)
Explorer.PolarsBackend.Shared.apply_dataframe(df, out_df, :df_mutate_with_exprs, [exprs, df.groups])
Explorer.PolarsBackend.Native.df_mutate_with_exprs(df, exprs, groups)
native/explorer/src/dataframe.rs::df_mutate_with_exprs
I found it helpful when trying to figure out a good place to intervene.
args | ||
end | ||
|
||
{%Explorer.Backend.LazySeries{op: :column, args: [col_name]} = last, rest} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the best place to swap args is in expression.ex. In there we convert all lazy frame operations into Polars expressions, so you can override add
, multiply
, and friends in there to both swap and post-cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried there initially, but lazy series didn't seem to contain the dtype information I needed to make decisions about whether to swap, cast, etc.
E.g. if we add this custom to_expr
clause for add:
def to_expr(%LazySeries{op: :add, args: [left, right]}) do
left |> IO.inspect(label: :left, width: 120)
right |> IO.inspect(label: :right, width: 120)
Native.expr_add(to_expr(left), to_expr(right))
end
We see the following after running df1 = DF.mutate(df, add1: sub + aug_20)
:
left: %Explorer.Backend.LazySeries{op: :column, args: ["aug_20"], aggregation: false}
right: %Explorer.Backend.LazySeries{op: :column, args: ["sub"], aggregation: false}
I don't know the dtypes of those args, so I don't know if I should swap or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GAH. Does it work in expression.rs? Can we fetch the dtype there and swap there?
If not, we would need to start storing the dtype then... it is more work though, so it probably makes sense to break this apart, add the dtype field to lazy series, and then come back to this. If we indeed can't do it in Rust, then we should merge a subset of this PR that doesn't require the arg swapping, and work on the other fix later.
I guess there is always the option of fixing Polars too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work in expression.rs? Can we fetch the dtype there and swap there?
I don't think so? I'll try to check more thoroughly later today tho. Polars expressions weren't super well documented so I'll have to read some source code.
If we indeed can't do it in Rust, then we should merge a subset of this PR that doesn't require the arg swapping, and work on the other fix later.
Do you mean supporting Explorer.Series.add(duration, date)
a letting a lazy expression possibly panic? Or dropping duration + date
support altogether until we can address the issue?
I guess there is always the option of fixing Polars too.
You read my mind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean supporting Explorer.Series.add(duration, date) a letting a lazy expression possibly panic?
I meant adding a field called dtype to Explorer.Backend.LazySeries
. If we do this though, I would make it so we change all Explorer.Backends.Series callbacks to receive the output dtype as argument (or an out_series
, which has the correct result type, like we do for dataframes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant what should we do for this PR? Definitely coming back later and adding the dtype info we need to make decisions sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, let's only support the dtype pairs we can implement today (without changing polars or without changing lazy series). Or do you mean we cannot support any of them? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's keep the arg swap in Explorer.Series
for now. It is not ideal but it is semantically fine. We can add a TODO to them.
For the integer vs duration, which requires the post cast, we can pre-cast the integer to duration, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's keep the arg swap in Explorer.Series for now. It is not ideal but it is semantically fine. We can add a TODO to them.
👍
For the integer vs duration, which requires the post cast, we can pre-cast the integer to duration, right?
No, duration * duration
isn't supported. I think only way to do it is post-cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly think opening up a PR in Polars might be the least painful way accomplish this 😅
@josevalim Recap: Some of the new operations have fixes in the
We're gonna come back and add that dtype information in a 2nd pass. |
💚 💙 💜 💛 ❤️ |
@billylanchantin thank you so much. I will take a quick stab at the problem and try some ideas, I will let you know how it goes. :) |
Hi @billylanchantin, I tried a couple ideas, none of them worked, but I have a new insight. The LazySeries actually has the dtype indirectly, because the lazy series receives and returns
to
And then we unpack it when building the expression, which would also get access to the dtypes, allowing us to hopefully fix the problem. Then we could later change Series to compute the dtype and pass it as argument to callbacks, just so we don't have to compute it again in LazySeries :) |
Hi @josevalim, Sounds like a plan! One consideration: I ran into an issue where I needed to know both the input and output dtypes of the operations because of expressions like %LazySeries{
op: :add,
args: [
:duration,
%LazySeries{ # <-- this evaluates to :date
op: :add,
args: [:date, :duration]
}
]
} This never matched the Even if we didn't, though, we could annotate the expression tree with the output dtypes we know and leave the others as |
The goal is that such tree would look like this: %Series{
dtype: :foo,
data: %LazySeries{
op: :add,
args: [
%Series{
dtype: :bar,
data: %LazySeries{op: ...},
},
%Series{
dtype: :foo,
data: %LazySeries{op: ...},
}
]
}
} In fact, that's already how it looks like for the root node, but we unpack the inner nodes as we compose them. |
Description
This PR attempts to round off work started in #683 by adding support for the following operations:
date - date
date + duration
date - duration
duration + date
duration * integer
duration * float
duration / integer
duration / float
integer * duration
float * duration
Example
Notes
I handled both of those items as best I could, but I don't feel too strongly about my choices. Happy to run with any feedback :)