Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars.selectors.temporal() doesn't include datetime columns #13665

Closed
2 tasks done
knl opened this issue Jan 12, 2024 · 5 comments · Fixed by #13683
Closed
2 tasks done

polars.selectors.temporal() doesn't include datetime columns #13665

knl opened this issue Jan 12, 2024 · 5 comments · Fixed by #13683
Assignees
Labels
A-temporal Area: date/time functionality bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@knl
Copy link
Contributor

knl commented Jan 12, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import polars.selectors as cs

xdf = (
    pl.DataFrame(data={'date_id': [20231212, 20240111, 20240112], 'value': [3, 7, 1]})
    .with_columns(
        date_id_datetime=pl.col('date_id').cast(pl.Utf8).str.to_datetime("%Y%m%d", time_zone="UTC", time_unit="ns").dt.offset_by('12h'),
        date_id_date=pl.col('date_id').cast(pl.Utf8).str.to_date("%Y%m%d"),
    )
)
print(xdf)
print(xdf.select(cs.temporal()))
shape: (3, 4)
┌──────────┬───────┬─────────────────────────┬──────────────┐
│ date_id  ┆ value ┆ date_id_datetime        ┆ date_id_date │
│ ---      ┆ ---   ┆ ---                     ┆ ---          │
│ i64      ┆ i64   ┆ datetime[ns, UTC]       ┆ date         │
╞══════════╪═══════╪═════════════════════════╪══════════════╡
│ 20231212 ┆ 3     ┆ 2023-12-12 12:00:00 UTC ┆ 2023-12-12   │
│ 20240111 ┆ 7     ┆ 2024-01-11 12:00:00 UTC ┆ 2024-01-11   │
│ 20240112 ┆ 1     ┆ 2024-01-12 12:00:00 UTC ┆ 2024-01-12   │
└──────────┴───────┴─────────────────────────┴──────────────┘
shape: (3, 1)
┌──────────────┐
│ date_id_date │
│ ---          │
│ date         │
╞══════════════╡
│ 2023-12-12   │
│ 2024-01-11   │
│ 2024-01-12   │
└──────────────┘

Log output

No response

Issue description

polars.selector.temporal() doesn't include datetime columns, as the example shows. It includes date columns, tho, so it is not clear what is the criteria for a column to be considered temporal.

Expected behavior

I would expect that date_id_datetime is also included, as with version 0.19.13.

shape: (3, 4)
┌──────────┬───────┬─────────────────────────┬──────────────┐
│ date_id  ┆ value ┆ date_id_datetime        ┆ date_id_date │
│ ---      ┆ ---   ┆ ---                     ┆ ---          │
│ i64      ┆ i64   ┆ datetime[ns, UTC]       ┆ date         │
╞══════════╪═══════╪═════════════════════════╪══════════════╡
│ 20231212 ┆ 3     ┆ 2023-12-12 12:00:00 UTC ┆ 2023-12-12   │
│ 20240111 ┆ 7     ┆ 2024-01-11 12:00:00 UTC ┆ 2024-01-11   │
│ 20240112 ┆ 1     ┆ 2024-01-12 12:00:00 UTC ┆ 2024-01-12   │
└──────────┴───────┴─────────────────────────┴──────────────┘
>>> print(xdf.select(cs.temporal()))
shape: (3, 2)
┌─────────────────────────┬──────────────┐
│ date_id_datetime        ┆ date_id_date │
│ ---                     ┆ ---          │
│ datetime[ns, UTC]       ┆ date         │
╞═════════════════════════╪══════════════╡
│ 2023-12-12 12:00:00 UTC ┆ 2023-12-12   │
│ 2024-01-11 12:00:00 UTC ┆ 2024-01-11   │
│ 2024-01-12 12:00:00 UTC ┆ 2024-01-12   │
└─────────────────────────┴──────────────┘

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             Linux-5.10.154-1.base.x86_64-x86_64-with-glibc2.17
Python:               3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               23.7.0
hvplot:               <not installed>
matplotlib:           3.7.2
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              10.0.1
pydantic:             1.10.11
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.19
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@knl knl added bug Something isn't working python Related to Python Polars labels Jan 12, 2024
@MarcoGorelli
Copy link
Collaborator

@alexander-beedie fancy taking a look?

@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Jan 12, 2024
@alexander-beedie alexander-beedie self-assigned this Jan 12, 2024
@mcrumiller
Copy link
Contributor

Looks like it's not selector specific: df.select(pl.col(pl.Datetime)) doesn't superset datetimes with time zones:

df = pl.DataFrame({
    "a": pl.Series([datetime(2024, 1, 1)], dtype=pl.Datetime("us", "UTC")),
})
print(df.select(pl.col(pl.Datetime)))
shape: (0, 0)
┌┐
╞╡
└┘

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 12, 2024

Looks like it's not selector specific: df.select(pl.col(pl.Datetime)) doesn't superset datetimes with time zones:

That one shouldn't, as it's equivalent to pl.Datetime(time_zone=None), so it won't match something that has a time zone. In that case you'd want pl.col(pl.Datetime(time_zone="*")) to match any non-null timezone.

About to commit a fix that covers DATETIME_DTYPES, TEMPORAL_DTYPES, and the cs.temporal() selector 👌

@alexander-beedie alexander-beedie added the accepted Ready for implementation label Jan 12, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 12, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 12, 2024

@alexander-beedie time_zone="*" doesn't catch None, should it?

df = pl.DataFrame({
    "a": pl.Series([datetime(2024, 1, 1)], dtype=pl.Datetime),
}).select(
    pl.col(pl.Datetime(time_zone="*"))
)
shape: (0, 0)
┌┐
╞╡
└┘

Also, shouldn't we enable * for time_unit as well? It looks like there's no way to select out generic Datetime columns without knowing the precise time unit:

df = pl.DataFrame({
    "a": pl.Series([datetime(2024, 1, 1)], dtype=pl.Datetime("ms")),
}).select(
    pl.col(pl.Datetime)
)
shape: (0, 0)
┌┐
╞╡
└┘

I know we can use polars.DATETIME_DTYPES, but that's a bit less obvious.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 12, 2024

@mcrumiller: selectors help insulate you from the lower-level matching logic:

  • cs.datetime() → any timeunit, any (or no) timezone
  • cs.datetime( time_zone="*" ) → any timeunit, any timezone (but must have a timezone)
  • cs.datetime( time_zone=None ) → any timeunit, no timezone (cannot have a timezone)
  • cs.datetime( ["ms","ns"], time_zone="UTC" ) → any col with UTC timezone and ns or ms precision
  • cs.datetime( time_zone=["UTC","Asia/Tokyo","Europe/London"] ) → any timeunit, one of the given timezones

Note: all of the expressions above can also be negated using ~.

@alexander-beedie alexander-beedie changed the title polars.selector.temporal() doesn't include datetime columns polars.selectors.temporal() doesn't include datetime columns Jan 12, 2024
@stinodego stinodego added P-high Priority: high and removed accepted Ready for implementation labels Jan 12, 2024
@ritchie46 ritchie46 added P-medium Priority: medium and removed P-high Priority: high labels Feb 29, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
6 participants