Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Ensure the cs.temporal() selector uses wildcard time zone matching for Datetime #13683

Merged
merged 3 commits into from
Mar 21, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 12, 2024

Closes #13665.

We weren't matching Datetime dtypes with time zones when using the cs.temporal() selector (or DATETIME_DTYPES and TEMPORAL_DTYPES dtype sets). Ensuring the match additionally uses the time_zone="*" wildcard fixes this.

Example

from datetime import datetime
import polars.selectors as cs
import polars as pl

df = pl.DataFrame(
    data = {"utc": [datetime(1950,7,5), datetime(2099,12,31)]},
    schema = {"utc": pl.Datetime(time_zone="UTC")},
).with_columns(
    idx = pl.int_range(0, 2),
    naive = pl.col("utc").dt.replace_time_zone(None),
    tokyo = pl.col("utc").dt.convert_time_zone("Asia/Tokyo"),
    hawaii = pl.col("utc").dt.convert_time_zone("US/Hawaii"),
)

Before: (missing datetime dtypes that have timezones)

df.select(cs.temporal())
# shape: (2, 1)
# ┌─────────────────────┐
# │ naive               │
# │ ---                 │
# │ datetime[μs]        │
# ╞═════════════════════╡
# │ 1950-07-05 00:00:00 │
# │ 2099-12-31 00:00:00 │
# └─────────────────────┘

After: (all datetime dtypes selected)

df.select(cs.temporal())
# shape: (2, 3)
# ┌─────────────────────┬──────────────────────────┬─────────────────────────┐
# │ naive               ┆ tokyo                    ┆ hawaii                  │
# │ ---                 ┆ ---                      ┆ ---                     │
# │ datetime[μs]        ┆ datetime[μs, Asia/Tokyo] ┆ datetime[μs, US/Hawaii] │
# ╞═════════════════════╪══════════════════════════╪═════════════════════════╡
# │ 1950-07-05 00:00:00 ┆ 1950-07-05 10:00:00 JDT  ┆ 1950-07-04 14:00:00 HST │
# │ 2099-12-31 00:00:00 ┆ 2099-12-31 09:00:00 JST  ┆ 2099-12-30 14:00:00 HST │
# └─────────────────────┴──────────────────────────┴─────────────────────────┘

@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels Jan 12, 2024
@alexander-beedie alexander-beedie changed the title fix(python): ensure the cs.temporal() selector wildcards Datetime the time_zone match fix(python): ensure the cs.temporal() selector wildcards Datetime time zone matches Jan 12, 2024
@alexander-beedie alexander-beedie force-pushed the wildcard-temporal-tz branch 6 times, most recently from 138f672 to f3e2b94 Compare January 12, 2024 19:14
@alexander-beedie alexander-beedie changed the title fix(python): ensure the cs.temporal() selector wildcards Datetime time zone matches fix(python): ensure the cs.temporal() selector uses wildcard time zone matching for Datetime Jan 12, 2024
@alexander-beedie alexander-beedie force-pushed the wildcard-temporal-tz branch 4 times, most recently from 2df8f7d to 1d6f359 Compare January 12, 2024 20:07
@stinodego
Copy link
Member

stinodego commented Jan 12, 2024

I don't know about this as Datetime(time_zone="*") is not a valid data type. You can't initialize a Series with that. So it shouldn't be part of our 'sets of types'.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 12, 2024

I don't know about this as Datetime(time_zone="*") is not a valid data type.

I did think about this, but if you're using a DataTypeGroup the chances are high that you're matching, in which case you don't want to fail to match valid dtypes. And you can't directly assign a group as part of a schema (or series). In case you do want only valid dtypes I added "_NO_WILDCARDS" variants for that case 👍

Having said all that, ideally you'd be using a selector for such a match, which has more control/precision, so I could have the wildcard logic live only there instead if you prefer?

@stinodego
Copy link
Member

stinodego commented Jan 12, 2024

I don't really know what you mean by matching - pl.Datetime("us", "UTC") in pl.TEMPORAL_DTYPES already 'matches' as it returns True by grace of the special __eq__ implementation.

If you mean matching by using the special is_ method then we should update that method with a special case for the time zone wildcards.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 12, 2024

I don't really know what you mean by matching - pl.Datetime("us", "UTC") in pl.TEMPORAL_DTYPES already 'matches' as it returns True by grace of the special __eq__ implementation.

eg: pl.col(TEMPORAL_DTYPES) ← without the fix this misses datetimes with timezones.

But I don't mind having the wildcard logic live only in selectors if there's too much potential for other issues. As you can tell from the number of force-pushes I repeatedly changed my mind about how best to tackle it 🤣

@stinodego
Copy link
Member

If those data type groups were internal I would be fine with it, but they are part of the public API (which I am not too sure about actually). So they shouldn't contain invalid types just because we need that in our internal logic. No user can really do anything with a Datetime(time_zone="*"), right?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 12, 2024

If those data type groups were internal I would be fine with it, but they are part of the public API (which I am not too sure about actually). So they shouldn't contain invalid types just because we need that in our internal logic. No user can really do anything with a Datetime(time_zone="*"), right?

Indeed; but my contention is that the majority use-case for DataTypeGroup is probably still pl.col(<group>) matching rather than iterating out dtypes that you could use to init a Series, and for this matching case you need the wildcard so you don't miss datetime columns. If you want the other, you can iterate over DATETIME_DTYPES_NO_WILDCARDS or TEMPORAL_DTYPES_NO_WILDCARDS instead.

Though... perhaps we should be pushing people towards selectors for this use-case, and eventually deprecate type-matching inside col 🤔

@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 12, 2024

This might be a somewhat wonky suggestion, but what if we were to distinguish a generic pl.Datetime from a pl.Datetime(time_zone=None)? The former would resolve to the latter when setting the dtype of a particular column (i.e. pl.Series(..., dtype=pl.Datetime) creates a dtype with no time zone), but for selection/matching, pl.Datetime by itself is the equivalent of Alex's "*", and acts as a supertype for all Datetimes. This can be done pretty easily with the (albeit somewhat ugly) marker default parameters:

class Datetime:
    _tz_supplied = False
    __marker = object()

    def __init__(time_unit=None, time_zone=__marker):
        self._tz_supplied = time_zone is __marker

...or something of the sort.

@MarcoGorelli
Copy link
Collaborator

I think a wildcard is quite nice here - '*' isn't the name of any time zone (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones), so there's no ambiguity

@knl
Copy link
Contributor

knl commented Feb 14, 2024

This change would help me a lot in some queries, as most of my datetime columns have TZ, thanks for fixing the issue! Is there any blocker to get this merged?

Copy link

codecov bot commented Feb 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.06%. Comparing base (97eff07) to head (0fc5009).
Report is 1 commits behind head on main.

❗ Current head 0fc5009 differs from pull request most recent head a4762c2. Consider uploading reports for the commit a4762c2 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13683      +/-   ##
==========================================
- Coverage   81.24%   81.06%   -0.19%     
==========================================
  Files        1348     1322      -26     
  Lines      175304   171366    -3938     
  Branches     2509     2461      -48     
==========================================
- Hits       142425   138912    -3513     
+ Misses      32399    31984     -415     
+ Partials      480      470      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want to avoid adding these "NO_WILDCARD" constants to the public API. I feel like we have to rethink things a bit more thoroughly and I don't want to add more stuff that we might have to deprecate later.

Let's just restore the DATETIME_DTYPES variable and explicitly specify the dtypes in that one parametric test, and then we can revisit this later.

@stinodego stinodego changed the title fix(python): ensure the cs.temporal() selector uses wildcard time zone matching for Datetime fix(python): Ensure the cs.temporal() selector uses wildcard time zone matching for Datetime Mar 21, 2024
@stinodego stinodego merged commit 8e92452 into pola-rs:main Mar 21, 2024
13 checks passed
@alexander-beedie alexander-beedie deleted the wildcard-temporal-tz branch March 21, 2024 10:33
@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

polars.selectors.temporal() doesn't include datetime columns
5 participants