Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): infer ISO8601 datetimes #6357

Merged
merged 1 commit into from
Jan 23, 2023

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Jan 21, 2023

closes #6356

This contains the same formats as on master, but with plenty of added ones. No format has been removed (though I am suggesting to remove a couple in #6378)

Where I ran the benchmarks: https://www.kaggle.com/code/marcogorelli/polars-timing?scriptVersionId=117053341

[will fill out below when it completes]

parsing a single element:

from IPython import get_ipython
ipython = get_ipython()
import polars as pl

ipython.run_line_magic("timeit", "pl.Series(['1900/01/01 12:00:00 AM']).str.strptime(pl.Datetime)")

On this branch (with make build-release):

192 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
192 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
189 µs ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

On master (with make build-release, after having deleted all non-tracked files):

161 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
160 µs ± 7.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
173 µs ± 7.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

parsing multiple elements (with consistent format):

from IPython import get_ipython
from datetime import datetime
ipython = get_ipython()
import polars as pl

date_range = pl.date_range(low=datetime(1300, 1, 1), high=datetime(2300, 1, 1), interval="12h").dt.strftime('%Y/%m/%d %I:%M:%S %p')

ipython.run_line_magic("timeit", "date_range.str.strptime(pl.Datetime)")

On this branch (with make build-release):

295 ms ± 808 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
296 ms ± 820 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
302 ms ± 7.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On master (with make build-release, after having deleted non-tracked files):

295 ms ± 985 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
301 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
295 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 21, 2023
@MarcoGorelli MarcoGorelli force-pushed the infer-iso8601 branch 5 times, most recently from 658b260 to 1a3b7b7 Compare January 22, 2023 16:57
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 22, 2023 19:38
@MarcoGorelli MarcoGorelli marked this pull request as draft January 22, 2023 19:39
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 22, 2023 19:49
@MarcoGorelli MarcoGorelli marked this pull request as draft January 22, 2023 20:52
@ritchie46
Copy link
Member

I think the slowdown might be in the csv parsing, but I think it is worth it for now. We can likely speed that up, by caching formats in a lru manner. Let me know when you think it is good to go.

@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 23, 2023 08:25
@MarcoGorelli
Copy link
Collaborator Author

Sure, ready for review, thanks!

IIUC: for parsing a single element, this is a bit slower, because it needs to try all the YMD formats until it finds one that works in

} else if patterns::DATETIME_Y_M_D.iter().any(|fmt| {
NaiveDateTime::parse_from_str(val, fmt).is_ok()
|| NaiveDate::parse_from_str(val, fmt).is_ok()
}) {

but for parsing multiple elements with a consistent format, then it will reuse the latest guess in latest and so will be fast

Pattern::DateYMD => Ok(DatetimeInfer {
patterns: patterns::DATE_Y_M_D,
latest: patterns::DATE_Y_M_D[0],
transform: transform_date,
transform_bytes: transform_date_bytes,
fmt_len: 0,
logical_type: DataType::Date,
}),

and that's why the first benchmark above shows a slight slowdown, whereas for the second one (with multiple consistently-formatted elements) there is no noticeable difference

@ritchie46
Copy link
Member

That's indeed the tradeoff. Seems like a reasonable concession for the added ergonomics.

@ritchie46 ritchie46 merged commit c032584 into pola-rs:master Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infer ISO8601 datetimes
2 participants