Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for better-looking outputs of to_chars #35

Open
alugowski opened this issue Jan 6, 2023 · 6 comments
Open

Options for better-looking outputs of to_chars #35

alugowski opened this issue Jan 6, 2023 · 6 comments

Comments

@alugowski
Copy link

Great package! Thank you!

I've noticed that to_chars sometimes emits an extra E0 suffix when it's not needed. For example, the number 1.0 is emitted as 1E0.

Is this intentional?

@jk-jeon
Copy link
Owner

jk-jeon commented Jan 6, 2023

Thank you so much for your kind words!

Is this intentional?

Yes. The intention is to match what the Ryu library spits out (which is always in the scientific form with E as the exponent marker).

I know this is probably not what people usually want. It might be more useful if to_chars outputs the string with the smallest number of characters (while preferring the fixed-point form for tie), but there are some reasons why I didn't do so.

  1. One of the main purposes of this library is the demonstration of the algorithm I developed, and for that purpose I need to compare it to existing competitors. At the time of writing this (and also right now), Ryu has been considered by many people the "state-of-the-art", although there already was a better algorithm (Schubfach) which for whatever reason didn't get much attention compared to Ryu. So I wanted to compare my algorithm's performance to that of Ryu's, and to make a fair comparison I need to match the output format details with Ryu. Another big bonus for doing so is that it makes the testing against Ryu a lot easier. Ryu is very extensively tested for its correctness, so testing against Ryu is a very effective way to prove the correctness of my algorithm and implementation. In fact, initially I just copied the to_chars implementation from the Ryu repo, as it was faster than the one I could write by myself at that time, and also by doing so I could more fairly compare two algorithms. Though, at some point, I completely rewrote the to_chars implementation from the scratch and it's now very different from Ryu's.
  2. The main focus of the algorithm and also of this library is on to_decimal rather than on to_chars. The reason is that I don't believe there is the single right answer on how the output string should look like. For instance, you suggested that 1.0 is a better output than 1E0, but why not just 1? I'm pretty sure some people will prefer 1.0, while some other will prefer 1. Maybe some people want 1,0, or maybe 1e0, or maybe 1.0e0. There are just too many possible ways to do it and I don't have the ability to accommodate all possible scenarios that I can ever imagine and optimize all of them. The main purpose of providing to_chars is to merely prove that a fast to_chars implementation is possible with the provided to_decimal, and also to demonstrate how that might be done. My goal was to provide a to_decimal implementation, so that anyone who need a fast to_chars can leverage it to write their own to_chars optimized for their own use. To my understanding, that is indeed the way this library is being used by several other projects.

I want to also say that writing your own to_chars is not a devastatingly difficult job, if your goal is not to deliver an absolutely amazing performance out of it. I can also help you if you want to write one. FYI, the implementation here is based on the idea explained here. A more refined analysis is done in the appendix of this another post.

Nevertheless, it is very much welcome if anybody comes up with a generic mechanism for specifying the formatting details, and I would be even happier if anyone opens a PR for that, but right now I have no plan for doing so by myself.

@ecorm
Copy link

ecorm commented Jan 6, 2023

@alugowski It it helps, the C++ fmt library uses the Dragonbox algorithm under the hood, and you can control its output exactly the way you want.

@jk-jeon
Copy link
Owner

jk-jeon commented Jan 6, 2023

Ah right. I forgot to mention that. Thanks @ecorm.

@alugowski
Copy link
Author

Thank you for the detailed explanation!

My question came from the sentence in the README that says the "output is of the shortest length". So I was expecting to see "1" instead of "1E0".

I actually came to find this project by looking for alternatives to std::to_chars. As I'm sure you're aware, compiler support for the floating-point versions of bothstd::from_chars and std::to_chars is spotty at best. AFAIK VisualStudio has it, but the current versions of GCC and clang do not. The alternative for std::from_chars is the excellent fast_float library. They have implemented a fast parser and exposed it as something that implements the C++17 spec (in another namespace of course). Looks like the GCC 12 version of std::from_chars will simply be fast_float.

Since I need both writing and reading, I found Dragonbox as a usable analog of fast_float on the writing side. I don't have any ambitions of writing my own methods. fmt would work, though it's larger than my entire project for only this one method.

I don't know what your ambitions are, but I think if you're interested then both Clang and GCC could use your work.

Regarding your "everyone wants something else" point, that true. The std::to_chars standard has options for everything you've mentioned. If you're interested I think it would be easy to implement on top of what you already have, and would make Dragonbox a proper substitute for the large number of folks looking for alternatives not for performance reasons but just for basic support.

https://en.cppreference.com/w/cpp/utility/to_chars

Now don't get me started on long double support :P You're largely stuck with the C methods if you need that.

@jk-jeon
Copy link
Owner

jk-jeon commented Jan 7, 2023

My question came from the sentence in the README that says the "output is of the shortest length". So I was expecting to see "1" instead of "1E0".

In fact, if you look carefully at what it says:

  1. The output is of the shortest length; that is, no other output strings that are interpreted as the input number can contain less number of significand digits than the output of Dragonbox.

So the shortness is in terms of the number of (significand) digits, not in terms of the number of characters. I mean, it's confusing I admit, but the exact number of characters is not the most interesting detail from the point of view of developing a conversion algorithm.

Since I need both writing and reading, I found Dragonbox as a usable analog of fast_float on the writing side. I don't have any ambitions of writing my own methods. fmt would work, though it's larger than my entire project for only this one method.

If fmt is too bulky, then there is nanofmt which also uses Dragonbox under the hood. I presume this one is probably lighter than fmt. Also there is Alexander Bolz' implementation (https://github.com/abolz/Drachennest) which IIRC produces prettier outputs. A small problem of these two is that their Dragonbox is a bit outdated because it has been improved since they copied the implementation from this repo. But that should not be a serious issue if your goal is not to win a competitive benchmark. (Also IIRC Alexander's implementation only supports double; but he has a Schubfach implementation for float instead.)

I don't know what your ambitions are, but I think if you're interested then both Clang and GCC could use your work.

As far as I know, Dragonbox has been considered for libc++ implementation of std::to_chars, but eventually dropped in favor of Ryu, which already had a working adoption thanks to Mr. STL's hard work.

Adoption into the standard library is something I ultimately want for, and this was an attempt for preparing that. But I was way too ambitious and could not really afford the required amount of time and effort, so that project is "dead" at this point. I'm very slowly making some progress (e.g. developing this) though.

Now don't get me started on long double support :P You're largely stuck with the C methods if you need that.

This is also something in my TODO list. May take long to be realized.

Regarding your "everyone wants something else" point, that true. The std::to_chars standard has options for everything you've mentioned.

IIRC std::to_chars is not that versatile. I don' t think e.g. it has an option for the mandatory trailing zero .0. But yeah, probably your point is that std::to_chars would be enough for many people while they are not happy about the ugly 1E0.

If you're interested I think it would be easy to implement on top of what you already have, and would make Dragonbox a proper substitute for the large number of folks looking for alternatives not for performance reasons but just for basic support.

Simply put, the reason why I'm sort of hesitant in writing some more useful version of to_chars is because I'm not comfortable with providing a very suboptimal implementation in this "supposed to be fast" library. Optimizing and testing require a lot of effort and currently I don't have enough resource to put for them.

I would say this again: it's not super daunting to implement your own to_chars given that to_decimal already does all the hard work. Not-so-fast-but-working implementation might not take more than 80 lines I guess. You may refer to https://en.cppreference.com/w/cpp/io/c/fprintf, the row for the general format (g and G) to get an idea on how to mimic that behavior. (It doesn't give the shortest output I guess but I think you may not need absolutely shortest string either.)

Well I don't know. I may write a shitty one if I get some time and post it here.

@jk-jeon
Copy link
Owner

jk-jeon commented Jan 7, 2023

Hmm. I guess I somewhat sounded like a jerk 😅 Sorry about that.

I think your suggestion about providing an alternative interface doing std::to_chars-like formatting would be a nice addition. I'll consider that for the next release. Thanks for the input!

By the way, it won't be a drop-in replacement for std::to_chars because this repo has no plan for supporting printf-style fixed-precision formatting as well as hexfloat formatting. Both have nothing to do with what Dragonbox does afaict so they are out of scope, and especially the first one is very difficult to do correctly.

alugowski added a commit to alugowski/fast_matrix_market that referenced this issue Jan 10, 2023
1.0 is rendered as '1E0' instead of just '1'. Drop such suffixes, as Dragonbox does not offer an option to do so.

See jk-jeon/dragonbox#35
@jk-jeon jk-jeon changed the title to_chars unnecessary 'E0' Options for better-looking outputs of to_chars Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants