-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perfect accuracy float parsing #536
Comments
The python |
@dtolnay Any update on this PR? I just ran into this issue today. |
A stupid question: Why not use the standard library
Is this what is already available under the Just in case it is open for debate, I would vote for making the exact version the default and making the approximate/optimized logic opt-in. Defaulting to lossy serialization/deserialization round-trips is a source of real headaches. As an example: rust-geo-booleanop is modeled after an algorithm written in JS, and currently we are trying to fix some issues in both implementations. Test cases are specified as JSON. In theory the implementations should do exactly the same, but I was debugging weird differences for days. Considering Rust striving for correctness in general, I suspected the differences were caused on JS side. I really didn't expect that the test case input data gets modified during parsing on Rust side. |
If the mechanism for changing this option is feature-flags, then it makes more sense to opt-in to the higher precision option, even though you might think that should be the default. Otherwise, any dependency could trigger lossy precision and unintentionally break your application. |
Would it be possible to have the best of both worlds, by falling back to 100% correct parsing when an inaccuracy is detected? Perhaps by reserializing with ryu? This feature is important to my use case. |
Adding to @bluenote10's comment. It seems that the standard library's parse function is straightforward and probably good for this use case. In fact, @dtolnay has committed to the stdlib file containing Edit: And the JSON parser in Rust's libserialize uses |
Adding some more detail about how important this is for me: what I'm working on will be the source of truth for our users' data. As a result, we need to know that the parsing is lossless. I'm happy to help out with a PR if there's consensus on how to approach this. |
Did you see this WIP PR #541 ? |
@sjackman IMHO, because that PR takes the feature flag approach, I find it problematic. I think correct parsing should be the default behavior. |
I agree. |
As the author of rust-lexical, I would highly recommend not implementing your own correct float parser without exhaustive testing, since it's very easy to get it wrong. I see one of 3 general approaches: Any correct float parser requires: It's easy to get this wrong, so I will gladly help others do this, if I can be of service. To use the lexical approach, you would use the following code: let format = lexical_core::NumberFormat::JSON;
let result = lexical_core::parse_format::<f64>("3.0e7", format); |
@Alexhuszagh Would you consider submitting a PR to use |
@Alexhuszagh Are there differences between the byte-at-a-time approach and what #541 is doing? IIUC, using lexical as-is means keeping intermediate buffers around. |
@rw I haven't checked exactly, but the byte-at-a-time approach would still require an intermediate buffer, it just wouldn't be visible and would allow us to implement it in terms of Iterator TypesUsing C++-like nomenclature, I'm going to briefly introduce iterator types:
Examples of input iterators in Rust include Algorithm ApproachAll Algorithms Fast Algorithm Moderate Algorithm Slow Algorithms SummaryAs you can see, there is no way to remove iterating over the input twice in the worst-case, so we have to have an intermediate buffer at some level. This also coincides with how readers typically work: they almost always use buffered-reads, just the buffer isn't user-visible (imagine making syscalls for every byte read, or a network request per-byte). That means at some level, a buffer needs to be present to implement a correct and efficient parser. In order to avoid iterating twice over the input we would need to: We do, however, have 1 advantage: we know that only 767 decimal digits can contribute to a binary number with 53-bits of significant digits. For the theory behind this claim, please see Section 2.7 on radix conversions. This means we have an upper limit on the number of bytes required for both the input buffer, and the big-integer, allowing us to stack-allocate both. |
In short, any approach accepting an input iterator or similar object like |
@Alexhuszagh thanks for the link and the great explanation! However, I think your statement about 767 decimal digits and buffer sizes is misleading. The handbook says that 767 digits may be required to perfectly represent a 64-bit float in base 10, but parsing algorithms are doing the opposite conversion, and way fewer digits than that are required to unambiguously identify a 64-bit float. A rough estimate |
@Diggsey You're doing it the wrong way, although you're getting fairly close (you have a very good intuition for this). Float to string only requires 17 digits to uniquely identify a float, but going string to float requires up to 767 digits because you can get close-to-halfway cases. However, this allows us to use a stack-allocated buffer regardless, since there is no dynamic memory allocation. See some examples in here for examples of it (see C126, C127, C128 for this): |
Here are practical Python examples for those test cases, @Diggsey. As you can see, we need a max of 767 significant digits to parse a float, only 17 to write a float. >>> x = "7.4109846876186981626485318930233205854758970392148714663837852375101326090531312779794975454245398856969484704316857659638998506553390969459816219401617281718945106978546710679176872575177347315553307795408549809608457500958111373034747658096871009590975442271004757307809711118935784838675653998783503015228055934046593739791790738723868299395818481660169122019456499931289798411362062484498678713572180352209017023903285791732520220528974020802906854021606612375549983402671300035812486479041385743401875520901590172592547146296175134159774938718574737870961645638908718119841271673056017045493004705269590165763776884908267986972573366521765567941072508764337560846003984904972149117463085539556354188641513168478436313080237596295773983001708984374999e-324"
>>> float(x)
5e-324
>>> y = "7.4109846876186981626485318930233205854758970392148714663837852375101326090531312779794975454245398856969484704316857659638998506553390969459816219401617281718945106978546710679176872575177347315553307795408549809608457500958111373034747658096871009590975442271004757307809711118935784838675653998783503015228055934046593739791790738723868299395818481660169122019456499931289798411362062484498678713572180352209017023903285791732520220528974020802906854021606612375549983402671300035812486479041385743401875520901590172592547146296175134159774938718574737870961645638908718119841271673056017045493004705269590165763776884908267986972573366521765567941072508764337560846003984904972149117463085539556354188641513168478436313080237596295773983001708984375e-324"
>>> float(y)
1e-323
>>> z = "7.4109846876186981626485318930233205854758970392148714663837852375101326090531312779794975454245398856969484704316857659638998506553390969459816219401617281718945106978546710679176872575177347315553307795408549809608457500958111373034747658096871009590975442271004757307809711118935784838675653998783503015228055934046593739791790738723868299395818481660169122019456499931289798411362062484498678713572180352209017023903285791732520220528974020802906854021606612375549983402671300035812486479041385743401875520901590172592547146296175134159774938718574737870961645638908718119841271673056017045493004705269590165763776884908267986972573366521765567941072508764337560846003984904972149117463085539556354188641513168478436313080237596295773983001708984375001e-324"
>>> float(z)
1e-323 |
Ah I was forgetting about such edge cases - I would be glad if serde_json would just losslessly round-trip all 64-bit floats, which seems like a much easier problem. |
I agree with @Diggsey, I am more interested in round tripping correctly by default and a lot less interested in perfect parsing of arbitrarily long decimal representations by default, as I don't see that as a common practical requirement that would be worth paying for in compile time by default. We can have an option but it would be off by default if it doesn't come for free from perfect round tripping. In particular I don't think I am on the same page as @Alexhuszagh right now, somewhere along the work in the perfect_float PR it appears to have gotten further away from correctly round tripping things (#541 (comment)). |
@dtolnay See the comments addressed to that, I'm not entirely sure what went on in that example, but it doesn't reproduce for me anymore. As for round-tripping 17-digit floats, it's definitely possible using only the fast and moderate paths described above in most cases (there are some edge-cases with exponents that will cause this to fail, actually). Unfortunately, arbitrary-precision arithmetic is pretty much required to fully round-trip every value, which also means it supports arbitrarily-long inputs. |
For example, the following 3 tests are good test-cases on why you need support for arbitrarily-long inputs necessarily: >>> float("2.4703282292062326e-324")
0.0
>>> float("2.4703282292062327e-324")
0.0
>>> float("2.4703282292062328e-324")
5e-324 This is because although only 17 digits are present, which is enough to uniquely identify a float, the exponent being base-10 means we get approximation errors for close-to-halfway values. A side-effect is we need internally to use arbitrary-precision arithmetic, which also allows us to support arbitrarily long inputs. The two, sadly, are linked. |
@dtolnay said
Personally, I'd be happy to have a longer compile time if this was available. I guess it would be like a restricted version of the arbitrary-precision arithmetic feature? |
I do have code for 128-bit mantissa using the moderate path that should be able to handle round-trip conversions, however, it provided a fairly meagre performance benefit when used, and was a lot slower when the slow path was required. However, it should handle any case of 17-19 (likely, anything 30 or less) digits to fully round-trip. |
I think we are still not on the same page. :( I don't see how the examples in #536 (comment) are relevant to round tripping f64 -> json -> f64. |
If that's your goal, I can gift you code that will do this. The moderate path with be more than sufficient for this. I won't implement it personally, however. |
I can submit a PR that would do this. It will take me some time, but it's stable code, and it will round-trip clear representations of values. I don't personally like the approach (it will get certain inputs wrong, which may be present from decimals or other representations) so I'm not willing to modify lexical to do this. Let me know if that works? |
Thanks! That would be great. The things I would like to know to move this issue forward, from you or someone else, are:
|
@dtolnay Compile-time performance should be fairly minimal. Run-time performance should be fairly minimal as well. Almost all of the overhead comes from the arbitrary-precision work. Assuming you have the code already to parse the significant digits and the exponent in a float (which I'm assuming you do), then the remaining requirements are basically this: 1). You need an extended-precision float class. For the other requirement, a lot, lot more. Especially for parsing complex inputs. Lexical is not optimized for compile-times, and for good reason. Let me get you a simple proof of concept, and I can show you performance is good in pretty much all cases. I can give you numbers right now, but it would likely be better-done in-context. |
Thanks, this was my hunch, and why I insisted on distinguishing between the requirements even if lexical isn't interested in supporting the simpler one. I know that people who care about accuracy at all costs are overrepresented among those who choose to follow this thread, but in the broader serde_json user base compile time matters a whole lot. |
@dtolnay Fair. I'll have a working example shortly, it's not optimized for speed yet (I've stripped out every single inline, and just have stripped out almost every trait and assumption), which will be published here. I'll license this code however you wish, as long as you give me appropriate credit (just a blurb in the source file). |
Sounds good, I'd be happy to give credit in the source file. Thanks for your help! |
@dtolnay I've pushed the initial commit. Everything is done through the parse_float method, but it should give you a pretty good idea of how minimal the requirements are: It's < 800 lines of code, no macros (except for trait impl). Let me add documentation so it makes some sense, and then a README, and a few working examples and I'll submit a PR to integrate it in. |
Ok, it's live. Let me know if everything looks good, and if so, I'll submit a PR with this as the default float parser. I'll add more comprehensive tests in the actual composite, but every piece is working identically to its behavior in rust-lexical, which makes it unlikely any errors are present. Please note this doesn't actual handle parsing the significant digits from value, or the exponent, but those require fairly trivial string-to-integer conversions (using |
@Alexhuszagh Nice! I ran the following loop as a smoke test. Is it fair to expect that this loop should run forever? I believe termination would indicate a bug in either ryu or minimal-lexical. let mut rng = XorShiftRng::from_seed([0; 16]);
let mut buffer = ryu::Buffer::new();
loop {
let input = f64::from_bits(rng.next_u64());
if input.is_finite() {
let printed = buffer.format_finite(input);
let (output, rest) = parse_float::<f64>(printed.as_bytes());
assert_eq!(output, input);
assert_eq!(rest, b"");
}
} However it fails on the following value, and the failure reproduces using std::fmt instead of ryu, which suggests this is more likely a bug in minimal-lexical. Any idea what could be going on? let expected = -0.04628372940652459_f64;
println!("expected: {}", expected);
let mantissa = 4628372940652459_u64;
let exponent = -17_i32;
let truncated = false;
let actual = -minimal_lexical::create_float::<f64>(mantissa, exponent, truncated);
println!("actual: {}", actual);
assert_eq!(actual, expected); |
Ok, this is patched, thank you. I forgot (hard-earned lessons) that |
@dtolnay I am finding one real-world example where this round-trip fails, and this won't be solvable without the slow-path algorithm. The failing test case is: let expected = 2.638344616030823e-256_f64;
let mantissa = 2638344616030823_u64;
let exponent = -271_i32;
let truncated = false;
let actual = -minimal_lexical::create_float::<f64>(mantissa, exponent, truncated);
assert_eq!(actual, expected); This value is actually a halfway point, which is why we can't differentiate it. Here are the examples printed with numerous extra digits, just to show the problem: b = 2.638344616030822852118090731545e-256
b1 = 2.638344616030823147881137273239e-256
# Estimated, since this cannot be a real float
bh = 2.638344616030822999999614002392e-256 A quick check shows the moderate path in rust-lexical fails here, because it's very close to a halfway value. To fix this, we would need a slow-path algorithm. Ultimately, this is your call. It works with the longer formatting printed out to 17-19 digits: let mantissa = 2638344616030823147_u64;
let exponent = -274_i32;
let truncated = false;
let actual = -minimal_lexical::create_float::<f64>(mantissa, exponent, truncated); However, any modern float formatting algorithm (like Ryu, Grisu3, etc.) will likely format it almost exactly on the halfway point, which means this is a real problem in practice. |
What is the minimum compile time required to handle cases like 2.638344616030823e-256? Is it effectively equivalent to compiling lexical-core with "correct" feature enabled? On my machine that corresponds to a 45% increase in CPU-seconds to compile serde_json which is not going to be something that we would do by default. |
I can provide a much smaller subset, so no, it won't be that large. A majority of the compile-time bloat from lexical is the presence of features to parse non-decimal strings, non-float-parsing conversions, and also support for arbitrary formats including decimal separators, etc. I can add in the slow-path algorithms and the compile-time overhead will be much lower than lexical itself, however, the actual extraction of the significant float digits and validation of the float format will have to be done by the client. Also, I don't have to use all my slow path algorithms either, which will cut down compile-time even further. I currently have 3 slow-path algorithms, 2 of which are used for inputs <= 1000 characters, and one which is used for >= 1000 characters (non-decimal floats only). Since we only need some of these, it cuts back even further. A sample signature would be: fn parse_float<'a, F, Iter1, Iter2>(integer: Iter1, fraction: Iter2, exponent: i32) -> F
where F: Float,
Iter1: Iterator<Item=&'a u8>,
Iter2: Iterator<Item=&'a u8>; This way, all the parsing logic and validation of the float format can be handled with minimal overhead by the library using the minimal version. If this is the case, I would likely spin out minimal-lexical into a published crate, rather than a stub of one, because the amount of code will be a fair bit larger, and I would want to ensure any regressions get fixed immediately. |
Ok, @dtolnay, if you want to test minimal-lexical and the parse times, the updated version is now up. The API has somewhat changed, but see the examples provided for how to use it. I've also integrated your smoke test in, which seems to be passing now. It's <3,000 lines of code (the full version of lexical is ~40,000, because supporting different float formats is a complete and utter nightmare, so compile-times are slow, even if the binary is fairly small and fast), compiles quickly, and all my simple stress tests seem to pass. I'm going to add in Rust's tests for comprehensive float parsing and well-known edge-cases in float parsers: If all that passes, and the smoke test runs for more than an hour, I'll submit a PR to integrate this in. |
@Alexhuszagh Would it be useful to exhaustively test all f32 values? |
@rw If you want, sure. The unittests are quite comprehensive, and I'm mostly looking for regressions introduced in extracting the code from the greater lexical project, but more coverage is always good. |
Ok, everything works: I'll wait for @dtolnay to play with it a bit, see if he can break it, see if the compile times are fast enough, and then I'll submit a PR. Finally, the build times are quite nice now:
|
Could you see if it's possible to strip this down further to bring down the compile time? I know you said the crate this is mostly coming from was never optimized for compile time, but now would be the time to do that. In particular I am not excited about bringing in new transitive dependencies on nodrop, arrayvec, and cfg-if. |
Would you be fine with dynamic allocation for the big-integer backing store? Otherwise, arrayvec will be a requirement ( cfg_if can be trivially replaced, but it's lightweight, so I doubt that will significantly change compile times. However, from a dependency management standpoint, I get it, and can remove it. As for nodrop, it's a requirement for arrayvec until v0.5, which unfortunately only supports Rustc 1.37+. |
Ok, I've removed both dependencies by default. Arrayvec can be re-enabled by the feature |
@Alexhuszagh @dtolnay Is there anything I can do to help? |
I have opened a new PR (#671) that uses the latest version of |
The new float parser landed behind a feature in 1.0.54 -- https://github.com/serde-rs/json/blob/v1.0.54/Cargo.toml#L58-L65. It's possible it will be default in a future version. serde_json = { version = "1.0.54", features = ["float_roundtrip"] } |
Float parsing is currently implemented by calculating the significand as u64, casting to f64, and then multiplying or dividing by a nonnegative power of 10. For example the input
10.9876543210987654
would be parsed into the value109876543210987654_u64 as f64 / 10e15
.This algorithm is sometimes correct, or else usually close to correct in practical usage. It matches how JSON parsers are implemented in other languages.
However, it can happen that the result from this algorithm is not the mathematically nearest 64-bit floating point value to the exact value of the input. A "correct" algorithm would always produce the mathematically nearest answer. This requires high precision big-integer arithmetic in the general case so there would be a large performance cost; if implemented, we would likely want this behind a cfg that is off by default, with the current approximate behavior as default. This way programs can opt in to the more expensive algorithm as required.
The text was updated successfully, but these errors were encountered: