look into using Ryu instead of Errol3 for floating point printing #1299

andrewrk · 2018-07-28T20:12:46Z

Bold claim:

simpler and approximately three times faster than the previously fastest implementation.

paper: https://dl.acm.org/citation.cfm?id=3192369
C implementation: https://github.com/ulfjack/ryu

things to measure:

performance
code size
if it's possible to work without ever using floating point registers

andrewrk · 2018-07-28T20:15:33Z

Note if porting the reference C implementation: we should port the -DRYU_OPTIMIZE_SIZE stuff by checking builtin.Mode.ReleaseSmall.

tiehuis · 2018-07-28T23:28:00Z

I'll likely do an initial implementation of this in the next few days.

tiehuis · 2018-08-01T10:16:06Z

Will be committing here for the moment. https://github.com/tiehuis/zig-ryu

Also, it would be nice to implement proper f16 printing support and f128 (#1181) alongside so I'll keep these in mind.

tiehuis · 2018-08-09T07:03:25Z

Quick update:

Preliminary support was added upstream for f128 and f80 values. I've got a local port nearly done.
f16 as well I've pretty much done as we can reuse f32 formatting.

This leaves the formatted printing of values remaining before I'd be happy to make a PR. There is an upstream issue for that as well: ulfjack/ryu#27.

tiehuis · 2018-08-10T12:09:25Z

f16 and f128 printing support is now present using the 128-bit conversion backend. We should also have proper c_longdouble (80-bit float) printing as well when #1188 is done.

andrewrk · 2019-02-15T06:42:34Z

@tiehuis were you blocking on #1188 for this? I can prioritize that if you like.

tiehuis · 2019-02-15T07:40:41Z

Not really. Main requirement here is to put the time and effort in getting the formatted float output modes modes done for alignment/precision, scientific/decimal modes etc. It'll probably tie in with #1358 since they both touch a lot of the formatting code.

tiehuis · 2019-03-18T22:01:37Z

This is a response to #1290 (comment). Related moreso to this issue so bringing the discussion here.

Here are some actual numbers in regards to the size savings. First, we can construct two simple examples:

format-std.zig

const std = @import("std");

var buf: [32]u8 = undefined;

export fn print(value: f32) [*]const u8 {
    const slice = std.fmt.bufPrint(buf[0..], "{}", value) catch unreachable;
    return slice.ptr;
}

format-ryu.zig

const ryu = @import("index.zig");

var buf: [32]u8 = undefined;

export fn print(value: f32) [*]const u8 {
    const slice = ryu.floatToBuffer(value, buf[0..]);
    return slice.ptr;
}

Compiling both of these under --release-small with build-obj we get two object files. The output of readelf -s is as follows (omitting irrelevant noise).

format-std.o

Total size is 29163 bytes (excluding buf symbol).

Symbol table '.symtab' contains 483 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-std

   ...

   460: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    6 buf
   461: 0000000000004668   200 OBJECT  LOCAL  DEFAULT    7 c_digits_lut
   462: 0000000000000000  3456 OBJECT  LOCAL  DEFAULT    7 enum3
   463: 0000000000000000 10368 OBJECT  LOCAL  DEFAULT    8 enum3_data
   464: 00000000000020e8  9600 OBJECT  LOCAL  DEFAULT    7 lookup_table
   465: 00000000000011f0  1788 FUNC    LOCAL  DEFAULT    2 std.fmt.errol.u64toa
   466: 00000000000018f0   168 FUNC    LOCAL  DEFAULT    2 std.math.frexp.frexp64

   ...

   479: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __fixunsdfti
   480: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __udivti3
   481: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __umodti3
   482: 0000000000000000  4583 FUNC    GLOBAL DEFAULT    2 print

format-ryu.o

Total size is 3007 bytes. Note also that ryu only requires integer operations while errol operates on floats directly.

Symbol table '.symtab' contains 19 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-ryu
     2: 0000000000000000   200 OBJECT  LOCAL  DEFAULT    5 DIGIT_TABLE

    ...

     5: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    4 buf
     6: 00000000000000c8   248 OBJECT  LOCAL  DEFAULT    5 float_pow5_inv_split
     7: 00000000000001c0   376 OBJECT  LOCAL  DEFAULT    5 float_pow5_split
 
        ...

    18: 0000000000000000  2183 FUNC    GLOBAL DEFAULT    2 print

An important thing to note is that internally our errol implementation only works on f64 values. Even when printing an f32 we cast up to an f32 and back down once down. Likewise for f128, in which we lose precision during printing. The current ryu implementation has a 32, 64 and 128 bit backend for each of the float types which any narrower type can use without loss of precision.

That being said, the above example is slightly apples to oranges, since we are comparing a 32-bit printing backend to a 64-bit. Modifying our original examples we get the following:

format-std.o

Same size as previously. Slightly smaller by a few bytes since casting between f32 to f64 is now not done.

format-ryu.o

Total size is 4176 bytes. Slightly larger that the f32 but not a huge amount. The increase in size is not linear as the types increase since the precomputed values are different between the engines. Storing lots of tables is only required for increasing runtime performance. We can precompute smaller amounts and compute based on these instead of doing them all to keep the table sizes reasonable. This is done as per the original implementation.

Symbol table '.symtab' contains 50 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS format-ryu
     2: 0000000000000000   200 OBJECT  LOCAL  DEFAULT    5 DIGIT_TABLE

    ...

    31: 0000000000000000    32 OBJECT  LOCAL  DEFAULT    4 buf
    32: 0000000000000000   208 OBJECT  LOCAL  DEFAULT    7 double_pow5_inv_split2
    33: 00000000000000d0   208 OBJECT  LOCAL  DEFAULT    7 double_pow5_split2
    34: 00000000000000c8   208 OBJECT  LOCAL  DEFAULT    5 double_pow5_table
    35: 0000000000000198    80 OBJECT  LOCAL  DEFAULT    5 pow5_inv_offsets
    36: 00000000000001e8    52 OBJECT  LOCAL  DEFAULT    5 pow5_offsets

    ...

    49: 0000000000000000  3220 FUNC    GLOBAL DEFAULT    2 print

Summary

5-10 times smaller table sizes
Accurate floating point printing
No floating point operations required (valuable for soft-float architectures)
2-3 times fast the errol
Supports f16, f32 and f128 without messy casts to f64

andrewrk · 2019-03-18T22:10:53Z

Wow, huge improvements!

tiehuis · 2019-03-19T02:22:46Z

An additional note but there is some preliminary support now upstream for formatting to a specific precision with ryu. This is done for the f64 backend. This is a separate implementation distinct from the existing (which is focused on shortest representable unique output). I haven't yet looked at it in depth but this was the main remaining issue on the "backend" side of things.

shawnl · 2019-03-22T00:46:53Z

parse_float.zig Is not round-trip accurate for all values

So does that need to be re-written to, to work specifically with ryu's output, so we can be round-trip accurate for all values (including subnormals)?

Note that the Ryu paper specifies that differn't rounding modes are available, so parse_float.zig must match the rounding mode of the ryu implementation.

daurnimator · 2019-12-11T14:17:13Z

@tiehuis is there something ready to merge?

tiehuis · 2019-12-11T19:43:54Z

No. I haven't done anything on this since my last message here so it is still in the same state. You could technically merge right now (updating the code for latest zig), but you would only have the default format provided by ryu and would be missing precision and different notation options.

marler8997 · 2019-12-12T05:56:25Z

I ported the 32-bit version to D if anyone wants to see it for reference. It's all in one file in D (tests as well): https://github.com/dragon-lang/mar/blob/master/src/mar/ryu.d

data-man · 2019-12-18T22:03:44Z

updating the code for latest zig

tiehuis/zig-ryu#3

tiehuis · 2020-01-18T09:25:45Z

I've implemented the fixed precision modes for scientific/fixed in a new branch: https://github.com/tiehuis/zig-ryu/tree/dev

I'll likely make a PR soon once I finish up some comparisons and do some testing. Do note that we need more tables to handle this formatting so I'll need to benchmark the increase compared to just preferring shortest unique.

Also, since I only have a 64-bit variant for these, the plan is to upcast f16 and f32 to f64 for these modes. This is fine since it will preserve accuracy, however we will need to downcast f128 until a dedicated implementation is added. I don't consider this a blocker since errol currently does this now so it remains in line at least with the status quo.

data-man · 2020-09-22T13:41:20Z

Just for info: dragonbox was merged to fmtlib.

I consider this an interim workaround/hack until ziglang#1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes ziglang#11283

I consider this an interim workaround/hack until #1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes #11283

I consider this an interim workaround/hack until ziglang#1299 is finished. There is a bug in the original C implementation of the errol3 (and errol4) algorithm that can result in undefined behavior or an obviously incorrect result (leading ':' in the output) This change checks for those two problems and uses a slower fallback path if they occur. I can't guarantee that this will always produce the correct result, but since the workaround is only used if the original algorithm is guaranteed to fail, it should never turn a previously-correct result into an incorrect one. Fixes ziglang#11283

frmdstryr · 2022-08-03T17:34:19Z

@tiehuis Thanks for sharing your port.

I moved everything into a single folder ( tiehuis/zig-ryu@master...frmdstryr:zig-ryu:single-folder ) updated the tests and used it to format a f32 on a cortex-m4f (@192mhz).

With ryu:

const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = ryu(f32).printFixed(&buf, t, 1);
try lcd.writeLine(3, result);

Size is ~116k (a lot larger) but it only takes about 9us to format!

   text    data     bss     dec     hex filename
 115828      20     152  116000   1c520 firmware.elf

With zig's current fmt

const t = mcu.system.time(.s);
var buf: [20]u8 = undefined;
const result = std.fmt.bufPrint(&buf, "{d:.1}", .{t}) catch unreachable;
try lcd.writeLine(3, result);

Size is ~46k (a lot smaller) but it takes about ~120us to format the time. (Oddly for small numbers it is almost 500us while ryu seems to be faster in that case).

   text    data     bss     dec     hex filename
  46124      20     152   46296    b4d8 firmware.elf

Both output the correct value and are compiled with release small.

It is unfortunate that the tables are so large... as that makes it unusable for a lot of mcus with smaller flash sizes... for PC's the size is negligible.

Edit: There also seems to be no difference in speed when I disable the FPU with ryu. With std.fmt its about 10% slower.

andrewrk · 2022-08-03T21:09:14Z

I wonder if table values could be generated lazily for -OReleaseSmall builds 🤔

tiehuis · 2022-08-03T23:50:56Z

The tables for the fixed printing variant are pretty large. Something to note as well is the fixed print for f32 up-casts internally to f64 so uses larger tables than is strictly required. There isn't an upstream f32 fixed print however so I didn't implement one in my original code.

As for smaller tables or lazy computation, I do know that the computation of accurate values for these tables uses BigInteger's and needs higher precision than native values so the performance hit could be significant. I am unsure however if you could compute partial tables and get a better trade-off.

On your edit, Ryu doesn't use floating point values internally to compute the string so this matches my expectation.

jk-jeon · 2023-01-24T06:36:40Z

It is in fact possible to do fixed-precision formatting with a much smaller table: https://jk-jeon.github.io/posts/2022/12/fixed-precision-formatting/

For small enough precisions I think it will be faster than Ryu-printf. Actually, the algorithm operates in different ways for small precisions and large precisions, and the portion for the small precisions is recently adapted into fmtlib (fmtlib/fmt#3269). That portion reuses the table used for the shortest roundtrip formatting (which for the case of Dragonbox is of size 9~10KB, or 500~600 bytes with a bit of runtime performance cost).

Large precisions case (which should be very rare in practice) still can be done with an additional table of size 580 bytes with 2x-3x slower performance than Ryu-printf, or with a table of size 3~4 KB with similar performance to Ryu-printf, or something in between those two, according to the benchmark I've done.

This replaces the errol backend with one based on ryu. The 128-bit backend only is implemented. This supports all floating-point types and does not use fp logic to print. Closes ziglang#1181. Closes ziglang#1299. Closes ziglang#3612.

andrewrk added the standard library This issue involves writing Zig code for the standard library. label Jul 28, 2018

andrewrk added this to the 0.4.0 milestone Jul 28, 2018

andrewrk modified the milestones: 0.4.0, 0.5.0 Feb 15, 2019

andrewrk mentioned this issue Mar 18, 2019

Missing compiler_rt functions #1290

Closed

andrewrk added the contributor friendly This issue is limited in scope and/or knowledge of Zig internals. label Mar 18, 2019

andrewrk mentioned this issue Jun 27, 2019

unnecessary usage of soft float library routines causing linking errors #2771

Closed

andrewrk modified the milestones: 0.5.0, 0.6.0 Sep 20, 2019

LakeByTheWoods mentioned this issue Nov 16, 2019

Possible bug: f128 constant only has 16 decimals of precision when printed #3612

Closed

andrewrk modified the milestones: 0.6.0, 0.7.0 Jan 3, 2020

LemonBoy mentioned this issue Feb 23, 2020

Format decimal 0.0 with no precision as just 0 #4520

Merged

andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 10, 2020

andrewrk modified the milestones: 0.8.0, 0.9.0 Nov 6, 2020

andrewrk modified the milestones: 0.9.0, 0.10.0 May 19, 2021

andrewrk mentioned this issue Mar 24, 2022

printing float value causes fmt to panic #11283

Closed

ehaas mentioned this issue Apr 3, 2022

std.fmt: Fix incorrect behavior with large floating point integers. #11380

Merged

Vexu mentioned this issue Jul 9, 2022

fstat + fmtIntSizeBin causes huge substantial .rodata section #12061

Closed

batiati mentioned this issue Aug 4, 2022

Decimal Interpolation batiati/mustache-zig#7

Open

tiehuis mentioned this issue Mar 9, 2024

std.fmt: add ryu floating-point formatting implementation #19229

Merged

andrewrk closed this as completed in c6ad551 Mar 12, 2024

andrewrk closed this as completed in #19229 Mar 12, 2024

andrewrk modified the milestones: 0.15.0, 0.12.0 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

look into using Ryu instead of Errol3 for floating point printing #1299

look into using Ryu instead of Errol3 for floating point printing #1299

andrewrk commented Jul 28, 2018

andrewrk commented Jul 28, 2018

tiehuis commented Jul 28, 2018

tiehuis commented Aug 1, 2018

tiehuis commented Aug 9, 2018

tiehuis commented Aug 10, 2018

andrewrk commented Feb 15, 2019

tiehuis commented Feb 15, 2019

tiehuis commented Mar 18, 2019

andrewrk commented Mar 18, 2019

tiehuis commented Mar 19, 2019

shawnl commented Mar 22, 2019 •

edited

Loading

daurnimator commented Dec 11, 2019

tiehuis commented Dec 11, 2019 •

edited

Loading

marler8997 commented Dec 12, 2019

data-man commented Dec 18, 2019

tiehuis commented Jan 18, 2020 •

edited

Loading

data-man commented Sep 22, 2020

frmdstryr commented Aug 3, 2022 •

edited

Loading

andrewrk commented Aug 3, 2022

tiehuis commented Aug 3, 2022

jk-jeon commented Jan 24, 2023

look into using Ryu instead of Errol3 for floating point printing #1299

look into using Ryu instead of Errol3 for floating point printing #1299

Comments

andrewrk commented Jul 28, 2018

andrewrk commented Jul 28, 2018

tiehuis commented Jul 28, 2018

tiehuis commented Aug 1, 2018

tiehuis commented Aug 9, 2018

tiehuis commented Aug 10, 2018

andrewrk commented Feb 15, 2019

tiehuis commented Feb 15, 2019

tiehuis commented Mar 18, 2019

format-std.zig

format-ryu.zig

format-std.o

format-ryu.o

format-std.o

format-ryu.o

Summary

andrewrk commented Mar 18, 2019

tiehuis commented Mar 19, 2019

shawnl commented Mar 22, 2019 • edited Loading

daurnimator commented Dec 11, 2019

tiehuis commented Dec 11, 2019 • edited Loading

marler8997 commented Dec 12, 2019

data-man commented Dec 18, 2019

tiehuis commented Jan 18, 2020 • edited Loading

data-man commented Sep 22, 2020

frmdstryr commented Aug 3, 2022 • edited Loading

andrewrk commented Aug 3, 2022

tiehuis commented Aug 3, 2022

jk-jeon commented Jan 24, 2023

shawnl commented Mar 22, 2019 •

edited

Loading

tiehuis commented Dec 11, 2019 •

edited

Loading

tiehuis commented Jan 18, 2020 •

edited

Loading

frmdstryr commented Aug 3, 2022 •

edited

Loading