Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upsample not working if arguments are only sorted within group #18229

Closed
2 tasks done
thomascamminady opened this issue Aug 16, 2024 · 2 comments
Closed
2 tasks done

upsample not working if arguments are only sorted within group #18229

thomascamminady opened this issue Aug 16, 2024 · 2 comments
Labels
A-timeseries Area: date/time functionality bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@thomascamminady
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 1, 1),  # this is 2021-04-01 in the docs, i.e. sorted
            datetime(2021, 5, 1),
            datetime(2021, 6, 1),
        ],
        "groups": ["A", "B", "A", "B"],
        "values": [0, 1, 2, 3],
    }
)


df_upsampled = df.upsample(
    time_column="time", every="1mo", group_by="groups", maintain_order=True
)

Log output

---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
/var/folders/1c/6_s1_dhd2xngnxyrz3vnpqfr0000gq/T/ipykernel_51814/920196386.py in ?()
     15     }
     16 )#.sort("time")
     17 
     18 
---> 19 df_upsampled = df.upsample(
     20     time_column="time", every="1mo", group_by="groups", maintain_order=True
     21 )
     22 

~/Dev/performance_management_chart/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py in ?(*args, **kwargs)
     87         def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88             _rename_keyword_argument(
     89                 old_name, new_name, kwargs, function.__qualname__, version
     90             )
---> 91             return function(*args, **kwargs)

~/Dev/performance_management_chart/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, time_column, every, group_by, maintain_order)
   6419 
   6420         every = parse_as_duration_string(every)
   6421 
   6422         return self._from_pydf(
-> 6423             self._df.upsample(group_by, time_column, every, maintain_order)
   6424         )

InvalidOperationError: argument in operation 'upsample' is not sorted, please sort the 'expr/series/column' first

Issue description

I'm not sure if this is a bug or desired behavior, but it was somewhat unintuitive behavior.
I would like to upsample my data, but group_by some other variable (groups).
My assumption was that if I do sort('groups', 'date') and then upsample(...., group_by='groups') that this counts as sorted because it is sorted within each group.

Quoting from the upsample doc:

Result will be sorted by time_column (but note that if group_by columns are passed, it will only be sorted within each group).

So I would assume that this should work similarly for the input. In the MWE, although the time column isn't sorted, it is sorted within a group.

Expected behavior

I would think that something like this should always work

df = pl.DataFrame(.....).sort("groups","time")
df.upsample(time_column="time", every="1mo", group_by="groups")

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                2.0.1
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@thomascamminady thomascamminady added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 16, 2024
@MarcoGorelli
Copy link
Collaborator

thanks @thomascamminady for the report, looks like a bug

@MarcoGorelli MarcoGorelli added A-timeseries Area: date/time functionality P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Aug 16, 2024
@thomascamminady
Copy link
Contributor Author

thomascamminady commented Aug 21, 2024

Closed as per #18264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

2 participants