feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

alexander-beedie · 2022-09-04T09:30:20Z

If the datetime_format parameter is not set by the caller, the CSVWriter defaults to nanosecond precision, even when not required. This patch allows the write_csv call to infer the maximum required output precision so that unnecessary decimals aren't written, and the precision of the output matches the precision of the frame.

(If the datetime_format parameter is set, it is of course respected, and no inference is done).

Example

from datetime import datetime, date
import polars as pl

# frame with millisecond-precision datetime cols
df = pl.DataFrame(
    data = {
        "colx": [datetime(2022,9,4,10,30,45,123000)],
        "coly": [datetime(2022,9,4,10,30,45,123000)],
    },
    columns = [
        ("colx", pl.Datetime("ms")),
        ("coly", pl.Datetime("ms")),
    ],
)
print( df.write_csv() )

Before (forces "ns" precision on output)

colx,coly
2022-09-04T10:30:45.123000000,2022-09-04T10:30:45.123000000

After (infers actual "ms" precision: 20% more compact)

colx,coly
2022-09-04T10:30:45.123,2022-09-04T10:30:45.123

FYI: I did consider trying per-column precision on output, but it felt like an unnecessarily heavyweight approach, and was more intrusive in the code; given that most frames are likely to have consistent time units this seemed more reasonable.

…infer the maximum required precision

py-polars/src/dataframe.rs

matteosantama · 2022-09-04T15:13:30Z

polars/polars-io/src/csv/write_impl.rs

+    // if datetime format not specified, infer the maximum required precision
+    if options.datetime_format.is_none() {
+        for col in df.get_columns() {
+            match col.dtype() {
+                DataType::Datetime(TimeUnit::Microseconds, _)
+                    if options.datetime_format.is_none() =>
+                {
+                    options.datetime_format = Some("%FT%H:%M:%S.%6f".to_string());
+                }
+                DataType::Datetime(TimeUnit::Nanoseconds, _) => {
+                    options.datetime_format = Some("%FT%H:%M:%S.%9f".to_string());
+                    break; // highest precision; no need to check further
+                }
+                _ => {}
+            }
+        }
+        // if still not set, no cols require higher precision than "ms" (or no datetime cols)
+        if options.datetime_format.is_none() {
+            options.datetime_format = Some("%FT%H:%M:%S.%3f".to_string());
+        }
+    }


@alexander-beedie you mention that your approach is a compromise since per-column formatting was too heavyweight. Accommodating per-column formatting would be pretty surgical, though -- you could simply change these lines to

None => { let dt = match tu { TimeUnit::Nanoseconds => temporal_conversions::timestamp_ns_to_datetime(v), TimeUnit::Microseconds => temporal_conversions::timestamp_us_to_datetime(v), TimeUnit::Milliseconds => temporal_conversions::timestamp_ms_to_datetime(v), }; let fmt = match &options.datetime_format { None => match tu { TimeUnit::Nanoseconds => "%FT%H:%M:%S.%9f".to_string(), TimeUnit::Microseconds => "%FT%H:%M:%S.%6f".to_string(), TimeUnit::Milliseconds => "%FT%H:%M:%S.%3f".to_string(), }, Some(fmt) => fmt, } write!(f, "{}", dt.format(fmt)) }

Are you afraid of a performance penalty this way?

Actually, I think this would be faster than your way -- primarily because it requires no additional looping

This would be a heap alloc to string every value written. That would be slow.

I don't think we need the string alloc. But I don't think we can make this performance claims, as this hits every written value.

Ah right, this is on a per-value basis, not per-column. Fair enough

You are right -- the .to_string() bit is unnecessary.

There isn't much of a performance penalty to my approach

df = pl.Series(pd.date_range("2000-01-01", "2100-01-01")).to_frame() %timeit df.write_csv() # master 46.5 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # my approach 46.5 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is not really a good benchmark. On writing such a small DataFrame, we are only benchmarking virtual dispatch, and likely python interop. Try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop, not the setup code.

This different row sizes tells us more about how it deals with caches.

@matteosantama

you mention that your approach is a compromise since per-column formatting was too heavyweight

Mostly because on my first approach I found myself making it unnecessarily complicated (I was allowing a different user-specified format per time-unit and it became... unruly ;). So, I pulled back to a cleaner implementation that would slide into the existing code with minimal fuss. Would be nice to do per-timeunit formatting though... (I added something similar for cast(str) datetimes last week).

@ritchie46

try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop

Done; the per-time unit approach above becomes ~5-7% slower on larger data (I benched on ~63,000,000 rows, as created by: df = pl.Series(pl.date_range(date(2000,1,1), date(2001,12,31), "1s")).to_frame()). So not much slower, but not faster - at least, not on my home setup (vanilla M1).

Yep, I'd expect something like that.

I think we should go for your approach. It is simple.

We can explore something more fine grained later, but I think that should require some effort to not loose performance.

@alexander-beedie thanks for checking!

ritchie46 · 2022-09-04T18:37:16Z

Thanks @alexander-beedie. That will save a lot of future bytes written. :)

alexander-beedie added 2 commits September 4, 2022 13:04

feat[rust, python]: if datetime format not specified in "write_csv", …

a0fef21

…infer the maximum required precision

de-lint

bd197bc

github-actions bot added the python Related to Python Polars label Sep 4, 2022

ritchie46 reviewed Sep 4, 2022

View reviewed changes

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

move format-inference into polars-io

1d05beb

github-actions bot added the rust Related to Rust Polars label Sep 4, 2022

update "write_csv" docstring with note about datetime format inference

8cd4b0b

matteosantama reviewed Sep 4, 2022

View reviewed changes

ritchie46 merged commit 3620e06 into pola-rs:master Sep 4, 2022

alexander-beedie deleted the infer-datetime-output-precision branch September 4, 2022 18:52

alexander-beedie mentioned this pull request Sep 4, 2022

Expose date_format, time_format, and datetime_format parameters to DataFrame.write_csv #4362

Closed

MarcoGorelli mentioned this pull request Feb 8, 2023

feat(rust, python): support timezone in csv writer #6722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

alexander-beedie commented Sep 4, 2022 •

edited

Loading

matteosantama Sep 4, 2022

matteosantama Sep 4, 2022

ritchie46 Sep 4, 2022

matteosantama Sep 4, 2022

matteosantama Sep 4, 2022

ritchie46 Sep 4, 2022

alexander-beedie Sep 4, 2022 •

edited

Loading

ritchie46 Sep 4, 2022

matteosantama Sep 4, 2022

ritchie46 commented Sep 4, 2022

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

Conversation

alexander-beedie commented Sep 4, 2022 • edited Loading

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-beedie Sep 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 commented Sep 4, 2022

alexander-beedie commented Sep 4, 2022 •

edited

Loading

alexander-beedie Sep 4, 2022 •

edited

Loading