-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724
feat[rust, python]: if datetime format not specified for "write_csv", infer the required precision #4724
Conversation
…infer the maximum required precision
// if datetime format not specified, infer the maximum required precision | ||
if options.datetime_format.is_none() { | ||
for col in df.get_columns() { | ||
match col.dtype() { | ||
DataType::Datetime(TimeUnit::Microseconds, _) | ||
if options.datetime_format.is_none() => | ||
{ | ||
options.datetime_format = Some("%FT%H:%M:%S.%6f".to_string()); | ||
} | ||
DataType::Datetime(TimeUnit::Nanoseconds, _) => { | ||
options.datetime_format = Some("%FT%H:%M:%S.%9f".to_string()); | ||
break; // highest precision; no need to check further | ||
} | ||
_ => {} | ||
} | ||
} | ||
// if still not set, no cols require higher precision than "ms" (or no datetime cols) | ||
if options.datetime_format.is_none() { | ||
options.datetime_format = Some("%FT%H:%M:%S.%3f".to_string()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexander-beedie you mention that your approach is a compromise since per-column formatting was too heavyweight. Accommodating per-column formatting would be pretty surgical, though -- you could simply change these lines to
None => {
let dt = match tu {
TimeUnit::Nanoseconds => temporal_conversions::timestamp_ns_to_datetime(v),
TimeUnit::Microseconds => temporal_conversions::timestamp_us_to_datetime(v),
TimeUnit::Milliseconds => temporal_conversions::timestamp_ms_to_datetime(v),
};
let fmt = match &options.datetime_format {
None => match tu {
TimeUnit::Nanoseconds => "%FT%H:%M:%S.%9f".to_string(),
TimeUnit::Microseconds => "%FT%H:%M:%S.%6f".to_string(),
TimeUnit::Milliseconds => "%FT%H:%M:%S.%3f".to_string(),
},
Some(fmt) => fmt,
}
write!(f, "{}", dt.format(fmt))
}
Are you afraid of a performance penalty this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this would be faster than your way -- primarily because it requires no additional looping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a heap alloc to string every value written. That would be slow.
I don't think we need the string alloc. But I don't think we can make this performance claims, as this hits every written value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, this is on a per-value basis, not per-column. Fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- the .to_string()
bit is unnecessary.
There isn't much of a performance penalty to my approach
df = pl.Series(pd.date_range("2000-01-01", "2100-01-01")).to_frame()
%timeit df.write_csv()
# master
46.5 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# my approach
46.5 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really a good benchmark. On writing such a small DataFrame, we are only benchmarking virtual dispatch, and likely python interop. Try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop, not the setup code.
This different row sizes tells us more about how it deals with caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mention that your approach is a compromise since per-column formatting was too heavyweight
Mostly because on my first approach I found myself making it unnecessarily complicated (I was allowing a different user-specified format per time-unit and it became... unruly ;). So, I pulled back to a cleaner implementation that would slide into the existing code with minimal fuss. Would be nice to do per-timeunit formatting though... (I added something similar for cast(str)
datetimes last week).
try measuring different row sizes up to a millions rows or so, such that we benchmark the hot loop
Done; the per-time unit approach above becomes ~5-7% slower on larger data (I benched on ~63,000,000 rows, as created by: df = pl.Series(pl.date_range(date(2000,1,1), date(2001,12,31), "1s")).to_frame()
). So not much slower, but not faster - at least, not on my home setup (vanilla M1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'd expect something like that.
I think we should go for your approach. It is simple.
We can explore something more fine grained later, but I think that should require some effort to not loose performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexander-beedie thanks for checking!
Thanks @alexander-beedie. That will save a lot of future bytes written. :) |
If the
datetime_format
parameter is not set by the caller, theCSVWriter
defaults to nanosecond precision, even when not required. This patch allows thewrite_csv
call to infer the maximum required output precision so that unnecessary decimals aren't written, and the precision of the output matches the precision of the frame.(If the
datetime_format
parameter is set, it is of course respected, and no inference is done).Example
Before (forces "ns" precision on output)
After (infers actual "ms" precision: 20% more compact)
FYI: I did consider trying per-column precision on output, but it felt like an unnecessarily heavyweight approach, and was more intrusive in the code; given that most frames are likely to have consistent time units this seemed more reasonable.