Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

look into using Ryu instead of Errol3 for floating point printing #1299

Closed
andrewrk opened this issue Jul 28, 2018 · 21 comments · Fixed by #19229
Closed

look into using Ryu instead of Errol3 for floating point printing #1299

andrewrk opened this issue Jul 28, 2018 · 21 comments · Fixed by #19229
Labels
contributor friendly This issue is limited in scope and/or knowledge of Zig internals. standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@andrewrk
Copy link
Member

Bold claim:

simpler and approximately three times faster than the previously fastest implementation.

paper: https://dl.acm.org/citation.cfm?id=3192369
C implementation: https://github.com/ulfjack/ryu

things to measure:

  • performance
  • code size
  • if it's possible to work without ever using floating point registers
@andrewrk andrewrk added the standard library This issue involves writing Zig code for the standard library. label Jul 28, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone Jul 28, 2018
@andrewrk
Copy link
Member Author

Note if porting the reference C implementation: we should port the -DRYU_OPTIMIZE_SIZE stuff by checking builtin.Mode.ReleaseSmall.

@tiehuis
Copy link
Member

tiehuis commented Jul 28, 2018

I'll likely do an initial implementation of this in the next few days.

@tiehuis
Copy link
Member

tiehuis commented Aug 1, 2018

Will be committing here for the moment. https://github.com/tiehuis/zig-ryu

Also, it would be nice to implement proper f16 printing support and f128 (#1181) alongside so I'll keep these in mind.

@tiehuis
Copy link
Member

tiehuis commented Aug 9, 2018

Quick update:

  • Preliminary support was added upstream for f128 and f80 values. I've got a local port nearly done.
  • f16 as well I've pretty much done as we can reuse f32 formatting.

This leaves the formatted printing of values remaining before I'd be happy to make a PR. There is an upstream issue for that as well: ulfjack/ryu#27.

@tiehuis
Copy link
Member

tiehuis commented Aug 10, 2018

f16 and f128 printing support is now present using the 128-bit conversion backend. We should also have proper c_longdouble (80-bit float) printing as well when #1188 is done.

@andrewrk andrewrk modified the milestones: 0.4.0, 0.5.0 Feb 15, 2019
@andrewrk
Copy link
Member Author

@tiehuis were you blocking on #1188 for this? I can prioritize that if you like.

@tiehuis
Copy link
Member

tiehuis commented Feb 15, 2019

Not really. Main requirement here is to put the time and effort in getting the formatted float output modes modes done for alignment/precision, scientific/decimal modes etc. It'll probably tie in with #1358 since they both touch a lot of the formatting code.

@tiehuis
Copy link
Member

tiehuis commented Mar 18, 2019

This is a response to #1290 (comment). Related moreso to this issue so bringing the discussion here.

Here are some actual numbers in regards to the size savings. First, we can construct two simple examples:

format-std.zig

const std = @import("std");

var buf: [32]u8 = undefined;

export fn print(value: f32) [*]const u8 {
    const slice = std.fmt.bufPrint(buf[0..], "{}", value) catch unreachable;
    return slice.ptr;
}

format-ryu.zig

const ryu = @import("index.zig");

var buf: [32]u8 = undefined;

export fn print(value: f32) [*]const u8 {
    const slice = ryu.floatToBuffer(value, buf[0..]);
    return slice.ptr;
}

Compiling both of these under --release-small with build-obj we get two object files. The output of readelf -s is as follows (omitting irrelevant noise).

format-std.o

Total size is 29163 bytes (excluding buf symbol).

Symbol table '.symtab' contains 483 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-std

   ...

   460: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    6 buf
   461: 0000000000004668   200 OBJECT  LOCAL  DEFAULT    7 c_digits_lut
   462: 0000000000000000  3456 OBJECT  LOCAL  DEFAULT    7 enum3
   463: 0000000000000000 10368 OBJECT  LOCAL  DEFAULT    8 enum3_data
   464: 00000000000020e8  9600 OBJECT  LOCAL  DEFAULT    7 lookup_table
   465: 00000000000011f0  1788 FUNC    LOCAL  DEFAULT    2 std.fmt.errol.u64toa
   466: 00000000000018f0   168 FUNC    LOCAL  DEFAULT    2 std.math.frexp.frexp64

   ...

   479: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __fixunsdfti
   480: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __udivti3
   481: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __umodti3
   482: 0000000000000000  4583 FUNC    GLOBAL DEFAULT    2 print

format-ryu.o

Total size is 3007 bytes. Note also that ryu only requires integer operations while errol operates on floats directly.

Symbol table '.symtab' contains 19 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-ryu
     2: 0000000000000000   200 OBJECT  LOCAL  DEFAULT    5 DIGIT_TABLE

    ...

     5: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    4 buf
     6: 00000000000000c8   248 OBJECT  LOCAL  DEFAULT    5 float_pow5_inv_split
     7: 00000000000001c0   376 OBJECT  LOCAL  DEFAULT    5 float_pow5_split
 
        ...

    18: 0000000000000000  2183 FUNC    GLOBAL DEFAULT    2 print

An important thing to note is that internally our errol implementation only works on f64 values. Even when printing an f32 we cast up to an f32 and back down once down. Likewise for f128, in which we lose precision during printing. The current ryu implementation has a 32, 64 and 128 bit backend for each of the float types which any narrower type can use without loss of precision.

That being said, the above example is slightly apples to oranges, since we are comparing a 32-bit printing backend to a 64-bit. Modifying our original examples we get the following:

format-std.o

Same size as previously. Slightly smaller by a few bytes since casting between f32 to f64 is now not done.

format-ryu.o

Total size is 4176 bytes. Slightly larger that the f32 but not a huge amount. The increase in size is not linear as the types increase since the precomputed values are different between the engines. Storing lots of tables is only required for increasing runtime performance. We can precompute smaller amounts and compute based on these instead of doing them all to keep the table sizes reasonable. This is done as per the original implementation.

Symbol table '.symtab' contains 50 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-ryu
     2: 0000000000000000   200 OBJECT  LOCAL  DEFAULT    5 DIGIT_TABLE

    ...

    31: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    4 buf
    32: 0000000000000000   208 OBJECT  LOCAL  DEFAULT    7 double_pow5_inv_split2
    33: 00000000000000d0   208 OBJECT  LOCAL  DEFAULT    7 double_pow5_split2
    34: 00000000000000c8   208 OBJECT  LOCAL  DEFAULT    5 double_pow5_table
    35: 0000000000000198    80 OBJECT  LOCAL  DEFAULT    5 pow5_inv_offsets
    36: 00000000000001e8    52 OBJECT  LOCAL  DEFAULT    5 pow5_offsets

    ...

    49: 0000000000000000  3220 FUNC    GLOBAL DEFAULT    2 print

Summary

  • 5-10 times smaller table sizes
  • Accurate floating point printing
  • No floating point operations required (valuable for soft-float architectures)
  • 2-3 times fast the errol
  • Supports f16, f32 and f128 without messy casts to f64

@andrewrk andrewrk added the contributor friendly This issue is limited in scope and/or knowledge of Zig internals. label Mar 18, 2019
@andrewrk
Copy link
Member Author

Wow, huge improvements!

@tiehuis
Copy link
Member

tiehuis commented Mar 19, 2019

An additional note but there is some preliminary support now upstream for formatting to a specific precision with ryu. This is done for the f64 backend. This is a separate implementation distinct from the existing (which is focused on shortest representable unique output). I haven't yet looked at it in depth but this was the main remaining issue on the "backend" side of things.

@shawnl
Copy link
Contributor

shawnl commented Mar 22, 2019

parse_float.zig Is not round-trip accurate for all values

So does that need to be re-written to, to work specifically with ryu's output, so we can be round-trip accurate for all values (including subnormals)?

Note that the Ryu paper specifies that differn't rounding modes are available, so parse_float.zig must match the rounding mode of the ryu implementation.

@daurnimator
Copy link
Contributor

@tiehuis is there something ready to merge?

@tiehuis
Copy link
Member

tiehuis commented Dec 11, 2019

No. I haven't done anything on this since my last message here so it is still in the same state. You could technically merge right now (updating the code for latest zig), but you would only have the default format provided by ryu and would be missing precision and different notation options.

@marler8997
Copy link
Contributor

I ported the 32-bit version to D if anyone wants to see it for reference. It's all in one file in D (tests as well): https://github.com/dragon-lang/mar/blob/master/src/mar/ryu.d

@data-man
Copy link
Contributor

updating the code for latest zig

tiehuis/zig-ryu#3

@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Jan 3, 2020
@tiehuis
Copy link
Member

tiehuis commented Jan 18, 2020

I've implemented the fixed precision modes for scientific/fixed in a new branch: https://github.com/tiehuis/zig-ryu/tree/dev

I'll likely make a PR soon once I finish up some comparisons and do some testing. Do note that we need more tables to handle this formatting so I'll need to benchmark the increase compared to just preferring shortest unique.

Also, since I only have a 64-bit variant for these, the plan is to upcast f16 and f32 to f64 for these modes. This is fine since it will preserve accuracy, however we will need to downcast f128 until a dedicated implementation is added. I don't consider this a blocker since errol currently does this now so it remains in line at least with the status quo.

@data-man
Copy link
Contributor

Just for info: dragonbox was merged to fmtlib.

@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 10, 2020
@andrewrk andrewrk modified the milestones: 0.8.0, 0.9.0 Nov 6, 2020
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 May 19, 2021
ehaas added a commit to ehaas/zig that referenced this issue Apr 3, 2022
I consider this an interim workaround/hack until ziglang#1299 is finished.

There is a bug in the original C implementation of the errol3 (and errol4)
algorithm that can result in undefined behavior or an obviously incorrect
result (leading ':' in the output)

This change checks for those two problems and uses a slower fallback
path if they occur. I can't guarantee that this will always produce
the correct result, but since the workaround is only used if the original
algorithm is guaranteed to fail, it should never turn a previously-correct
result into an incorrect one.

Fixes ziglang#11283
andrewrk pushed a commit that referenced this issue Apr 4, 2022
I consider this an interim workaround/hack until #1299 is finished.

There is a bug in the original C implementation of the errol3 (and errol4)
algorithm that can result in undefined behavior or an obviously incorrect
result (leading ':' in the output)

This change checks for those two problems and uses a slower fallback
path if they occur. I can't guarantee that this will always produce
the correct result, but since the workaround is only used if the original
algorithm is guaranteed to fail, it should never turn a previously-correct
result into an incorrect one.

Fixes #11283
ikrima pushed a commit to ikrima/zig that referenced this issue Apr 14, 2022
I consider this an interim workaround/hack until ziglang#1299 is finished.

There is a bug in the original C implementation of the errol3 (and errol4)
algorithm that can result in undefined behavior or an obviously incorrect
result (leading ':' in the output)

This change checks for those two problems and uses a slower fallback
path if they occur. I can't guarantee that this will always produce
the correct result, but since the workaround is only used if the original
algorithm is guaranteed to fail, it should never turn a previously-correct
result into an incorrect one.

Fixes ziglang#11283
@frmdstryr
Copy link
Contributor

frmdstryr commented Aug 3, 2022

@tiehuis Thanks for sharing your port.

I moved everything into a single folder ( tiehuis/zig-ryu@master...frmdstryr:zig-ryu:single-folder ) updated the tests and used it to format a f32 on a cortex-m4f (@192mhz).

With ryu:

const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = ryu(f32).printFixed(&buf, t, 1);
try lcd.writeLine(3, result);

Size is ~116k (a lot larger) but it only takes about 9us to format!

   text    data     bss     dec     hex filename
 115828      20     152  116000   1c520 firmware.elf

With zig's current fmt

const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = std.fmt.bufPrint(&buf, "{d:.1}", .{t}) catch unreachable;
try lcd.writeLine(3, result);

Size is ~46k (a lot smaller) but it takes about ~120us to format the time. (Oddly for small numbers it is almost 500us while ryu seems to be faster in that case).

   text    data     bss     dec     hex filename
  46124      20     152   46296    b4d8 firmware.elf

Both output the correct value and are compiled with release small.

It is unfortunate that the tables are so large... as that makes it unusable for a lot of mcus with smaller flash sizes... for PC's the size is negligible.

Edit: There also seems to be no difference in speed when I disable the FPU with ryu. With std.fmt its about 10% slower.

@andrewrk
Copy link
Member Author

andrewrk commented Aug 3, 2022

I wonder if table values could be generated lazily for -OReleaseSmall builds 🤔

@tiehuis
Copy link
Member

tiehuis commented Aug 3, 2022

The tables for the fixed printing variant are pretty large. Something to note as well is the fixed print for f32 up-casts internally to f64 so uses larger tables than is strictly required. There isn't an upstream f32 fixed print however so I didn't implement one in my original code.

As for smaller tables or lazy computation, I do know that the computation of accurate values for these tables uses BigInteger's and needs higher precision than native values so the performance hit could be significant. I am unsure however if you could compute partial tables and get a better trade-off.

On your edit, Ryu doesn't use floating point values internally to compute the string so this matches my expectation.

@jk-jeon
Copy link

jk-jeon commented Jan 24, 2023

It is in fact possible to do fixed-precision formatting with a much smaller table: https://jk-jeon.github.io/posts/2022/12/fixed-precision-formatting/

For small enough precisions I think it will be faster than Ryu-printf. Actually, the algorithm operates in different ways for small precisions and large precisions, and the portion for the small precisions is recently adapted into fmtlib (fmtlib/fmt#3269). That portion reuses the table used for the shortest roundtrip formatting (which for the case of Dragonbox is of size 9~10KB, or 500~600 bytes with a bit of runtime performance cost).

Large precisions case (which should be very rare in practice) still can be done with an additional table of size 580 bytes with 2x-3x slower performance than Ryu-printf, or with a table of size 3~4 KB with similar performance to Ryu-printf, or something in between those two, according to the benchmark I've done.

@andrewrk andrewrk modified the milestones: 0.15.0, 0.12.0 Mar 12, 2024
RossComputerGuy pushed a commit to ExpidusOS-archive/zig that referenced this issue Mar 20, 2024
This replaces the errol backend with one based on ryu. The 128-bit
backend only is implemented. This supports all floating-point types and
does not use fp logic to print.

Closes ziglang#1181.
Closes ziglang#1299.
Closes ziglang#3612.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor friendly This issue is limited in scope and/or knowledge of Zig internals. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants