-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precise text file parsing #4081
Conversation
@jameslamb The benchmark result doesn't show |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. Just two comments that address my concerns.
@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) { | |||
return p; | |||
} | |||
|
|||
// Use fast_double_parse and strtod (if parse failed) to parse double. | |||
inline static const char* AtofPrecise(const char* p, double* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean we can replace
LightGBM/include/LightGBM/utils/common.h
Lines 1079 to 1102 in 1d2f3e1
T operator()(const std::string& str) const { | |
double tmp; | |
// Fast (common) path: For numeric inputs in RFC 7159 format: | |
const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp); | |
// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. | |
if (!fast_parse_succeeded) { | |
std::string strlower(str); | |
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); }); | |
if (strlower == std::string("inf")) | |
tmp = std::numeric_limits<double>::infinity(); | |
else if (strlower == std::string("-inf")) | |
tmp = -std::numeric_limits<double>::infinity(); | |
else if (strlower == std::string("nan")) | |
tmp = std::numeric_limits<double>::quiet_NaN(); | |
else if (strlower == std::string("-nan")) | |
tmp = -std::numeric_limits<double>::quiet_NaN(); | |
else | |
Log::Fatal("Failed to parse double: %s", str.c_str()); | |
} | |
return static_cast<T>(tmp); | |
} |
which was added in #3942 with a call to this
AtofPrecise
? Do they behave exactly the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on floating number. So I write a test program https://gist.github.com/cyfdecyf/63f4e7339bbe5a5a23474fda66375742
The only difference is that strtod
would not return negative NaN. So the Atof
function in this gist handles this special case.
Please help take a look at the gist and check if there's any problem. I'll update Common::Atof
and replace the change added in #3492.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special handling for -NaN in gist (revision 5) the has one problem though: it's incorrect to handle input like " -nan" (note beginning space). But I'm wondering if there's need for the special handling of this case? From this mailling list thread, it seems like different C library have different treatment for parsing "-nan".
I suggest just leave Common::AtofPrecise
as is and don't add special handling for "-nan" and the like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM.
@@ -0,0 +1,18 @@ | |||
cmake_minimum_required(VERSION 3.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether it is appropriate to add these benchmark scripts to the master branch. Why do you think they are necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary to merge into the master branch. I'm just curious about the actual performance of Common::Atof
and fast_double_parser
.
And maybe this benchmark can be helpful for people who want to improve the performance of text parser.
I can remove this commit if you decide to not include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark commit is now reverted. It's better to create microbenchmarks for this.
CMakeLists.txt
Outdated
@@ -1,3 +1,4 @@ | |||
OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need new compilation option only for one function? Why not simply use AtofPrecise
instead of Atof
by default?
Having a lot of functions doing the same work is very confusing, greatly increases maintenance burden and hurts overall development process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If
AtofPrecise
is faster (or not much slower) thanAtof
, I'd like to use it by default. But it's actually much slower in my simple benchmark- When I see
fast_double_parser
mentioned in the commit log, I thought it's been used for text file parsing too. But it's actually not which confused me at first. I guess someone might have done the performance test and thus not using it for text parsing - With this performance difference, I'd choose precise version only when precision is required
- When I see
AtofPrecise
does not behave exactly the same withAtof
. For some of my test models, prediction results for csv input can differ starting from the 5th non-zero digit- As comment in
utils/common.h
says bothStringToHelperFast
andStringToHelper
are kept to maintain bit-for-bit legacy LightGBM behavior for precision, I guess you'd prefer to keep the old behavior by default - I noticed this when working on [python-package] Create Dataset from multiple data files #4089 and some followups. Result verification shows this difference which makes me upset about the correctness
- As comment in
I've also been biten by ClickHouse's choice of non-precise float parsing by default. They choose to sacrifice 1 or 2 bits of precision to keep good float parsing performance ClickHouse/ClickHouse#1665 (3 years ago, not sure whether still true now). For me, I choose to change the float parsing function to precise and the compile the package myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see now that the main reason is that AtofPrecise
is much slower than current solution.
But I'm still strongly against adding new compilation option for this because of maintenance burden and not many users will compile the library on their own to get precise file parsing.
However, I believe that new config param will be a good workaround for this situation. Just like recently added (see #3494 and #3578) deterministic
param. Users don't have to re-compile the library every time they are switching "performance/accuracy" scenarios but do it in the runtime.
Is it possible to add new param for precise parsing? WDYT? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll add a config parameter and remove the compile time option to do this.
BTW, deterministic
parameter is very useful for verifying result when change code in LightGBM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the config parameter for precise float parsing, shall we remove this compile option?
@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) { | |||
return p; | |||
} | |||
|
|||
// Use fast_double_parse and strtod (if parse failed) to parse double. | |||
inline static const char* AtofPrecise(const char* p, double* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM.
I've been busy with other features these days. I'll finish adding new option to enable precise float parsing soon. |
Sorry, I didn't noticed that @StrikerRUS 's comment hasn't been addressed before approving this. |
The compilation option should be changed into a config option.
498090d
to
9db490a
Compare
I rebased this PR to latest master and made a force push. The latest commit adds new option |
9db490a
to
724872b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done. Just a question about keeping the compilation option for precise float parsing.
CMakeLists.txt
Outdated
@@ -1,3 +1,4 @@ | |||
OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the config parameter for precise float parsing, shall we remove this compile option?
a60d94f
to
4a658e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one nit below to keep all params descriptions in consistent style.
As this parser is not used for model files but only for datasets, I believe there will be no any inconsistency issues with default parser, right?
#3463 (comment)
LightGBM/include/LightGBM/utils/common.h
Lines 314 to 321 in d517ba1
if (tmp_str == std::string("na") || tmp_str == std::string("nan") || | |
tmp_str == std::string("null")) { | |
*out = NAN; | |
} else if (tmp_str == std::string("inf") || tmp_str == std::string("infinity")) { | |
*out = sign * 1e308; | |
} else { | |
Log::Fatal("Unknown token %s in data file", tmp_str.c_str()); | |
} |
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS Thank you for taking time to review this PR. The added corner test cases indeed found one problem which is not setting Regarding your question about loading model files, The latest commit in master branch does not handle LightGBM/include/LightGBM/utils/common.h Lines 1086 to 1099 in 5014f19
|
BTW, why does model loading uses both precise and non-precise version of floating point number parsing function? For example: Line 686 in 5014f19
Line 710 in 5014f19
Is this for keeping backward compatibility? |
The second one is not doing floating point number parsing. It just parses an integer. Using |
I thought that Lines 751 to 761 in d517ba1
, https://github.com/microsoft/LightGBM/pull/3938/files. Was I wrong? |
@StrikerRUS Sorry, I did not made it clear. I mean because here the value is an integer, so both methods won't cause information loss. BTW, can we merge this PR? |
Thanks, got it! But for the linked case with the
I don't have any objections. I think we can merge if you don't have any comments for code changed after your previous review. |
Exactly @StrikerRUS, and we should be very careful if we were to switch methods when parsing doubles or we can end up with subtly different model scores just by upgrading LightGBM. Hence the decision at the time when the big model read/write was done, to keep the old bit-for-bit behaviour, independently of the speed of the parsing when switching to the fast_double_parser. |
I linked to the wrong example in the first place. In fact I was planing to include an example like @StrikerRUS has showed. |
@StrikerRUS Yes, that's correct. @AlbertoEAF thanks for your explanation. @cyfdecyf That's OK. We can merge this PR now. Thanks for your contribution. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
When comparing prediction result using command line version and Python API, I noticed some prediction values differs starting at the 5th non-zero digits. I suspect the difference is caused by differnet float number parsing algorithms used in LightGBM and pandas. (For reference, I used
pandas.read_csv(fname, float_precision="round_trip")
to load csv file in my Python code.)So I added precise text file parsing with
fast_double_parser
and the result confirms my guess.This patch also contains a simple benchmark which shows
Common::Atof
is much faster than usingfast_double_parser
.