Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating-point numbers are uglified, a.k.a. write shortest floating-point representation with round-trip guarantee #1289

Open
Anton3 opened this issue Jun 23, 2024 · 23 comments · May be fixed by #1293

Comments

@Anton3
Copy link

Anton3 commented Jun 23, 2024

In the last few years, libraries like fmt have mastered printing of floating-point numbers. They use shortest representation with round-trip guarantee.

Meanwhile, yaml-cpp, after #649, started to uglify numbers in my configs (I use yaml-cpp to patch them).

For example, before:

latitude: 34.34
longitude: 56.56
altitude: 12.12
heading: 78.78

After:

latitude: 34.340000000000003
longitude: 56.560000000000002
altitude: 12.119999999999999
heading: 78.780000000000001
@davidzchen
Copy link

davidzchen commented Jun 29, 2024

+1 I am running into this issue too. Emitting a float 678.9 results in 678.900024, and emitting a double 890.1 results in 890.10000000000002.

@jbeder Can you take a look?

@jbeder
Copy link
Owner

jbeder commented Jun 29, 2024

This sounds reasonable, open to PRs. But you'll have to be careful about the API here, you don't want unintended consequences.

@Anton3
Copy link
Author

Anton3 commented Jul 2, 2024

The default should be "use shortest representation with round-trip guarantee" (like fmt's {}), while SetFloatPrecision and SetDoublePrecision should produce the current behavior of resulting in the given number of decimal significant figures (like printf's "%.Ng")

There should also probably be member functions for reverting to shortest, like SetShortestFloatPrecision() and SetShortestDoublePrecision()

@davidzchen
Copy link

davidzchen commented Jul 4, 2024

Is there a workaround for this issue in the meantime to get the behavior of using the shortest representation but not setting a fixed precision such that floating point values with more decimal points do not get truncated?

@SGSSGene since you were the author of #649, can you please take a look?

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 4, 2024

Ah, yes, #649 uglifies a lot.
I also think it is best to use "shortest representation with round-trip guarantee.", especially for a format like YAML.

A quick look, fmt uses internally an adjusted copy* of dragonbox.

I see following options:

  • Make a dependency of yaml-cpp to dragonbox (ugly, we than have to manage dependency, that yaml-cpp currently doesn't have)
  • Put an adjusted copy* of dragonbox into the yaml-cpp (disadvantage: dragonbox is Apache2 licensed, yaml-cpp MIT, we suddenly have a license mix)
  • Take the code copy of dragonbox of fmt (which is weirdly MIT licensed, disvadvantage, code copy of a code copy)

"adjusted copy": a copy of the orginal code, but wrapped into an additional namespace, such that users of yaml-cpp still can depend on there own version of dragenbox.

Implementing custom solution is out of scope of my time resources and abilities.

  • Setting a dependency between yaml-cpp and dragonbox via CPM I could do.
  • Copying an adjusted copy of dragonbox (two files) directly into yaml-cpp I could also do.

@davidzchen
Copy link

Thanks for taking a look, @SGSSGene.

Looking at the relevant part of the changelog of fmt, it looks like @jk-jeon, the author of jk-jeon/dragonbox themselves, implemented the Dragonbox algorithm in fmt (fmtlib/fmt#1882, fmtlib/fmt#1887, fmtlib/fmt#1894). I think this is how fmt avoids the license mix issue.

This might be a long-shot but @jk-jeon, is there any chance you could contribute a similar change to yaml-cpp?

If not, I think that the second or third option may be the best options. I would like to note that whichever option we take, we should ensure that it does not break compatibility with the Bazel build system set up in this repository that is in parallel with the CMake build scripts, since yaml-cpp is both used within Google and by other Bazel projects.

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 5, 2024

I will make a minimal draft PR as a discussion base, in the next few days.

@jk-jeon
Copy link

jk-jeon commented Jul 5, 2024

@davidzchen @SGSSGene @Anton3 First of all, thank you for having interests in my work!
As you pointed out, I directly contributed to fmt and the main motivation indeed was to add no trouble about licensing/dependency shenanigans into fmt whatsoever.

I recommend you to do any of the three options. Unfortunately I have no time to actually make a PR, but I can participate in maintaining the copy if you get one in your codebase. I can make PR's if I come up with some upgrades, or also you can poke me if you find some issues in it.

@SGSSGene SGSSGene linked a pull request Jul 6, 2024 that will close this issue
5 tasks
@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 6, 2024

@jk-jeon: I am a big fan of your work 😅 I just read your post about fixed precision floats: https://www.reddit.com/r/cpp/comments/1dv2nu1/the_complete_solution_for_floornxy_computation/
Very cool!
@jk-jeon: Does this mean you also consent for dragonbox being published under MIT License in yaml-cpp? If yes, I will be willing to add the necessary files and adjust them.

I have created a PR #1293 that utilizes dragonbox. It might not do it in the most performant way. Open to any suggestion.
@Anton3 maybe you could try it, and check if this would solve your issue.

@jk-jeon
Copy link

jk-jeon commented Jul 6, 2024

@SGSSGene Thanks 😊

I do not mind publishing code under MIT. I think maybe a short note about my agreement in a comment would be nice.

I briefly looked at the PR, but I think it can cause some subtle rounding error.
For example, consider float x = 1.5474251e+26f. Its exact value is $1.54742504910672534362390528​ \times 10^{26}$, so if you round it after 8 decimal digits, then it becomes 1.547250e+26 which is incorrect (i.e. doesn't roundtrip), because if you feed it to a correct parser, what you get is the neighboring float whose exact value is $1.5474249568730049750761472​\times 10^{26}$.

The problem here is that the result of the nearest-rounding of x is actually closer to a different instance of float than to x. This counterintuitive shit show happens because the interval of real numbers around x that gets converted into x by a correct parser is not symmetric around x, which is because x is located precisely at the boundary where the binary exponent bumps to the next number.

I don't think there is a simple way to work around it. You probably should just give up std::stringstream and write a custom formatter. I think printing out actual digits will not be that hard, rather the most confusing part will be to choosing between fixed-point vs scientific notations.

Also, adding some more precision when the exponent is positive is questionable as well. I didn't investigate it closely, but I wouldn't be surprised if that breaks roundtrippability. Maybe not? I don't know honestly.

What fmt currently does is to just add trailing zeros if fixed-point format is used and the number of digits Dragonbox provides is smaller than the number of digits in the integer part. This at least doesn't deteriorate roundtrippability, but now problem is that the answer is not correctly rounded. Fixing this is maybe not a breeze. See fmtlib/fmt#3649. The title says "not shortest" but later someone else raised this correct rounding issue.

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 6, 2024

@jk-jeon: yes I am aware that there might be rounding errors if not full precision is used.
But in the case that 'std::numcric_limits::max_digits10` is used, it should produce the correct result with roundtrip guarantee, wouldn't it?
I believe what we really want is floff.

So for everyone, I think there are a few discussion points:

  • If set precision is not the maximum precision, do we want roundtrip guarantee? (Currently it doesn't)
  • We could only use dragonbox, if max precision is set. In every other case we just do what ever std::stringstream does. (This would only lead to pretty output if max precision is choosen).
  • More a side note dragonbox currently only has support for float and double. long double will just go through std::stringstream.
  • How important is scientific notation? Currently I use dragonbox to calculate the precision, and set std::stringstream accordingly.

I believe that writing a complete custom formatter is out of scope, time-wise on my side.

@jk-jeon
Copy link

jk-jeon commented Jul 6, 2024

@SGSSGene That is not correct, because this shortest roundtrip business is a bit different from rounding. What it does is not trying to find the minimum number of digits such that the result of rounding at that digit will roundtrip. What it does is to find whatever number, which in principle can have nothing to do with rounding, which will roundtrip when fed to a correct parser.

So, let's look at the example I brought: x = 1.5474251e+26f. If you feed it into Dragonbox, it would spit out 1.5474251e+26, i.e., 8 decimal digit number.

However, rounding the original floating-point number x after 8 decimal digits produces a different result: 1.5474250e+26. Note that the last digit is different. You can confirm from the exact value I wrote above that the digit after 15474250 is 4, so that is indeed the correctly rounded answer.

Yet, if you feed this into a correct parser, then it will interpret this number into not x, rather the one right below it, say y, because the nominal value of 1.5474250e+26 is actually closer to y than x.

This sounds tricky, but it is not something super esoteric if you think about it: 1.5474250e+26 is the decimal that is closest to x, but that doesn't mean that x is the binary that is closest to it. The relation is not symmetric.

The roundtrip business is not supposed to find the closest decimal to x, rather, it finds the decimal whose closest binary is x. In contrast, rounding of x finds the decimal closest to x.

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 7, 2024

@jk-jeon: Ah! yes, you are right. Ok. In this case I will write a simple custom formatter, that also supports scientific notation. How annoying 😅.

@Anton3
Copy link
Author

Anton3 commented Jul 8, 2024

  • If set precision is not the maximum precision, do we want roundtrip guarantee? (Currently it doesn't)

I believe not having roundtrip guarantee by default after #649 is a bug.

  • We could only use dragonbox, if max precision is set. In every other case we just do what ever std::stringstream does. (This would only lead to pretty output if max precision is choosen).

This is a non-obvious API, but could work.

In principle, some code could rely on there being exactly N digits after Set*Precision, where max_digits10 is one of the possible values.

  • How important is scientific notation? Currently I use dragonbox to calculate the precision, and set std::stringstream accordingly.

It is important to switch to scientific notation for very small / very large numbers.

What is not critical:

  1. Not always writing shortest representation is OK (this would result in some incremental decrease in uglification)
  2. Switching to the scientific notation at imprecise boundaries

@jk-jeon
Copy link

jk-jeon commented Jul 8, 2024

How annoying 😅.

Welcome to floating-point hell! This monstrosity never lets me down 😂

So my understanding is that you want to allow the users to provide the precision argument, or the users also can ignore it and let it to be the default.

My recommendation is, actually make it an overload set, so when the precision argument is absent, redirect the input to Dragonbox with a custom formatter (which simply adds trailing zeros if fixed-point format is chosen but it has insufficient amount of digits to fill out the integer part), and if the precision argument is present, redirect the input to std::stringstream or std::printf or things like that. I want to emphasize that Dragonbox is pretty useless for precision-based printing.

As @SGSSGene pointed out, floff is supposed to do this precision-based printing, but (1) its implementation has never really polished, (2) if you use it along with Dragonbox you are pointlessly duplicating the same (but slightly different) table twice, and more importantly (3) it only does scientific printing and modifying it for other formats is very nontrivial. Boost.CharConv already did this modification (and also table deduplication IIRC), but if you don't need absolute performance and are okay with its locale-dependence, then std::stringstream should work fine.

And just to be sure: you can't rely on std::to_chars, right?

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 9, 2024

Thank you for all your great input.

I believe not having roundtrip guarantee by default after #649 is a bug.

#649 fixes the roundtrip guarantee, but is not producing the shortest string representation possible. Creating very ugly outputs.

In principle, some code could rely on there being exactly N digits after Set*Precision, where max_digits10 is one of the possible values.

Set*Precision does not produce exactly N digits. It produces maximum of N digits. It matches the behavior of std::defaultfloat

I updated #1293.
I will try to describe how it works, I tried to mimic std::stringstream as much as possible: (long double is still using std::stringstream and produces ugly output).
It has two thresholds, which decides when to switch to scientific notation.

  • The lower limit is hard coded '1e-4'. Everything strict smaller will use scientific notation. (So everything with 10^-5).
  • The upper limit depends on the precision. By default Everything larger than 1e9 will use scientific notation, the '1e9' seems to reflect the precision. So a precision of '3' will switch to scientific notation at '1e3'.

Scientific notation will always use exactly 4 characters of the form e+04. The sign might be a -. And the integer will always be two digits, leading with a zero if required.

To stay as close as possible to std::stringstream I also introduced rounding. This was a little trickier than expected.
So if you want to print 0.1999 with precision 2, it will now print 0.2.
Even harder was printing 0.9999 with precision 2. This will now print 1.

@jk-jeon: currently, I am assuming the exponent is always a value between -99 and 99 (only two digits), is this a valid assumption for double and float? (Will this still be valid for long double?)

std::to_chars is only available since c++17, but yaml-cpp tests are currently being compiled with c++11, so I am assuming it is not an option.

@jk-jeon
Copy link

jk-jeon commented Jul 9, 2024

@SGSSGene For binary32 (float) it's true. But binary64 (double) does have numbers with three-digits exponents.

https://en.wikipedia.org/wiki/IEEE_754 (see the table)

Encoding of long double varies over platforms. MSVC interprets as binary64 (so same as double), and IIRC GCC/Clang usually interpret it as Intel's non-IEEE "binary80" extension:

https://en.wikipedia.org/wiki/Extended_precision

For both this format and for binary128, the exponents are at most 4-digits long.

SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 14, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 14, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 14, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 14, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 14, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
SGSSGene added a commit to SGSSGene/yaml-cpp-1 that referenced this issue Jul 18, 2024
Add dragonbox to compute the required precision to print floating point
numbers. This avoids uglification of floating point numbers that
happen by default via std::stringstream.

Numbers like 34.34 used to be converted to '34.340000000000003' as strings.
With this version they will be converted to the string '34.34'.

This fixes issue jbeder#1289
@SGSSGene
Copy link
Contributor

@Anton3 any chance you can checkout #1293 ? If not, a quick message would also be welcome :-)

@Anton3
Copy link
Author

Anton3 commented Jul 29, 2024

@SGSSGene I might get around to check the PR on my code later this week. Initial comments:

  1. Huge dragonbox include leaking to external includes is unacceptable. In general, move FpToString implementation to source files
  2. Do not forget about Node, which converts floating-point numbers to strings immediately, dropping their origin. We should use the new shortest and correct representation there as well

@Anton3
Copy link
Author

Anton3 commented Jul 29, 2024

I'm not a repo maintainer, but a common practice is to put a patch/diff of your modifications of dragonbox.h somewhere, so that yaml-cpp maintainers are able to update this library. Hopefully they will help to determine the best location for this file. Also write somewhere (in the main Readme?) a small description of the dependency, where you've took the file, what's the license and how to update the dependency version using the patch file

@SGSSGene
Copy link
Contributor

SGSSGene commented Jul 30, 2024

@Anton3: thank you for your feedback. Looking forward to your results.

As a reaction to your feedback I did the following:

  • Split fp_to_string.h into .h/.cpp to have minimal headers
  • Renamed fp_to_string.{h, cpp} to fptostring.{h, cpp}` to fit more the rest of the file naming scheme in yaml-cpp
  • Moved dragonbox.h to src/contrib
  • Added a little more information into the dragonbox.h file which adjustments where made

@SGSSGene
Copy link
Contributor

@Anton3 any chance you take another look if #1293 fixes your issues?

@Anton3
Copy link
Author

Anton3 commented Aug 26, 2024

@SGSSGene Finally got around to looking at the PR. Everything is OK now. Running it through my codebase, everything works, but I found some broken tests like:

EXPECT_EQ(ToString(yaml), "... 0.00200000009 ...");

The error message here says that the value on the left is actually ... 0.002 ...

So I'll have to fix those tests manually, but this is exactly the improvement we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants