Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

Merged
merged 4 commits into from
Sep 4, 2022
Merged

feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724

merged 4 commits into from
Sep 4, 2022

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Sep 4, 2022

If the datetime_format parameter is not set by the caller, the CSVWriter defaults to nanosecond precision, even when not required. This patch allows the write_csv call to infer the maximum required output precision so that unnecessary decimals aren't written, and the precision of the output matches the precision of the frame.

(If the datetime_format parameter is set, it is of course respected, and no inference is done).

Example

from datetime import datetime, date
import polars as pl

# frame with millisecond-precision datetime cols
df = pl.DataFrame(
    data = {
        "colx": [datetime(2022,9,4,10,30,45,123000)],
        "coly": [datetime(2022,9,4,10,30,45,123000)],
    },
    columns = [
        ("colx", pl.Datetime("ms")),
        ("coly", pl.Datetime("ms")),
    ],
)
print( df.write_csv() )

Before (forces "ns" precision on output)

colx,coly
2022-09-04T10:30:45.123000000,2022-09-04T10:30:45.123000000

After (infers actual "ms" precision: 20% more compact)

colx,coly
2022-09-04T10:30:45.123,2022-09-04T10:30:45.123

FYI: I did consider trying per-column precision on output, but it felt like an unnecessarily heavyweight approach, and was more intrusive in the code; given that most frames are likely to have consistent time units this seemed more reasonable.

@github-actions github-actions bot added the python Related to Python Polars label Sep 4, 2022
@github-actions github-actions bot added the rust Related to Rust Polars label Sep 4, 2022
Comment on lines +175 to +195
// if datetime format not specified, infer the maximum required precision
if options.datetime_format.is_none() {
for col in df.get_columns() {
match col.dtype() {
DataType::Datetime(TimeUnit::Microseconds, _)
if options.datetime_format.is_none() =>
{
options.datetime_format = Some("%FT%H:%M:%S.%6f".to_string());
}
DataType::Datetime(TimeUnit::Nanoseconds, _) => {
options.datetime_format = Some("%FT%H:%M:%S.%9f".to_string());
break; // highest precision; no need to check further
}
_ => {}
}
}
// if still not set, no cols require higher precision than "ms" (or no datetime cols)
if options.datetime_format.is_none() {
options.datetime_format = Some("%FT%H:%M:%S.%3f".to_string());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-beedie you mention that your approach is a compromise since per-column formatting was too heavyweight. Accommodating per-column formatting would be pretty surgical, though -- you could simply change these lines to

            None => {
                let dt = match tu {
                    TimeUnit::Nanoseconds => temporal_conversions::timestamp_ns_to_datetime(v),
                    TimeUnit::Microseconds => temporal_conversions::timestamp_us_to_datetime(v),
                    TimeUnit::Milliseconds => temporal_conversions::timestamp_ms_to_datetime(v),
                };
                let fmt = match &options.datetime_format {
                    None => match tu {
                        TimeUnit::Nanoseconds => "%FT%H:%M:%S.%9f".to_string(),
                        TimeUnit::Microseconds => "%FT%H:%M:%S.%6f".to_string(),
                        TimeUnit::Milliseconds => "%FT%H:%M:%S.%3f".to_string(),
                    },
                    Some(fmt) => fmt,
                }
                write!(f, "{}", dt.format(fmt))
            }

Are you afraid of a performance penalty this way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this would be faster than your way -- primarily because it requires no additional looping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a heap alloc to string every value written. That would be slow.

I don't think we need the string alloc. But I don't think we can make this performance claims, as this hits every written value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, this is on a per-value basis, not per-column. Fair enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- the .to_string() bit is unnecessary.

There isn't much of a performance penalty to my approach

df = pl.Series(pd.date_range("2000-01-01", "2100-01-01")).to_frame()

%timeit df.write_csv()

# master
46.5 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# my approach
46.5 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a good benchmark. On writing such a small DataFrame, we are only benchmarking virtual dispatch, and likely python interop. Try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop, not the setup code.

This different row sizes tells us more about how it deals with caches.

Copy link
Collaborator Author

@alexander-beedie alexander-beedie Sep 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matteosantama

you mention that your approach is a compromise since per-column formatting was too heavyweight

Mostly because on my first approach I found myself making it unnecessarily complicated (I was allowing a different user-specified format per time-unit and it became... unruly ;). So, I pulled back to a cleaner implementation that would slide into the existing code with minimal fuss. Would be nice to do per-timeunit formatting though... (I added something similar for cast(str) datetimes last week).

@ritchie46

try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop

Done; the per-time unit approach above becomes ~5-7% slower on larger data (I benched on ~63,000,000 rows, as created by: df = pl.Series(pl.date_range(date(2000,1,1), date(2001,12,31), "1s")).to_frame()). So not much slower, but not faster - at least, not on my home setup (vanilla M1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'd expect something like that.

I think we should go for your approach. It is simple.

We can explore something more fine grained later, but I think that should require some effort to not loose performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-beedie thanks for checking!

@ritchie46
Copy link
Member

Thanks @alexander-beedie. That will save a lot of future bytes written. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants