std.json: Unify stringify and writeStream #16405

thejoshwolfe · 2023-07-14T05:01:41Z

Inspired by this comment:

Lines 946 to 953 in 3ec3374

    
           try std.json.stringify(@field(self, case.name), opts, w); 
        
           jsw.state_index -= 1; 
        
           // TODO: we should not reach into the state of the 
        
           //       json writer, but alas, this is what's 
        
           //       necessary with the current api. 
        
           //       would be nice to have a proper integration 
        
           //       between the json writer and the generic 
        
           //       std.json.stringify implementation

and my own experience adding a 5th copypaste of code that looks like this:

                try out_stream.writeByte(':');
                if (child_options.whitespace.separator) {
                    try out_stream.writeByte(' ');
                }

I figured it was time to implement stringify() using writeStream(), and that's what this PR does. The user-supplied pub fn jsonStringify method that plugs into this mechanism now takes a json.WriteStream instead of a regular out_stream and options or whatever. This means the implementation of custom stringifiers gets dramatically simpler:

// Before:
        pub fn jsonStringify(self: @This(), options: StringifyOptions, out_stream: anytype) !void {
            try out_stream.writeByte('{');
            var field_output = false;
            var child_options = options;
            child_options.whitespace.indent_level += 1;
            var it = self.map.iterator();
            while (it.next()) |kv| {
                if (!field_output) {
                    field_output = true;
                } else {
                    try out_stream.writeByte(',');
                }
                try child_options.whitespace.outputIndent(out_stream);
                try encodeJsonString(kv.key_ptr.*, options, out_stream);
                try out_stream.writeByte(':');
                if (child_options.whitespace.separator) {
                    try out_stream.writeByte(' ');
                }
                try stringify(kv.value_ptr.*, child_options, out_stream);
            }
            if (field_output) {
                try options.whitespace.outputIndent(out_stream);
            }
            try out_stream.writeByte('}');
        }

// After:
        pub fn jsonStringify(self: @This(), jws: anytype) !void {
            try jws.beginObject();
            var it = self.map.iterator();
            while (it.next()) |kv| {
                try jws.objectField(kv.key_ptr.*);
                try jws.write(kv.value_ptr.*);
            }
            try jws.endObject();
        }

This PR almost entirely rewrites json.WriteStream.

The new API does away with the finely specified arrayElem, emitNumber, emitString, etc, and instead you just call objectField(s) for object keys, and write(x) for every other value. You still call beginArray/endArray for arrays and beginObject/endObject for objects, and then write(x) for everything that was previously supported only by the stringify() entrypoint, such as structs and unions and everything. There is a grammar in the WriteStream docs that specifies how you must call the functions.

This comment is also realized in this PR:

zig/lib/std/json/write_stream.zig

Lines 21 to 22 in 546212f

    
           /// TODO A future iteration of this API will allow passing `null` for this value, 
        
           /// and disable safety checks in release builds.

, although the design has some questionable details. There are two different functions to call depending on whether you want the safety, which is primarily because you only need an allocator for the safe version. Should this instead be implemented by checking the build mode or something? Please advise.

Breaking changes to `StringifyOptions`

The structure of StringifyOptions is overhauled to be more flat.

escape_unicode moved to the top level (because it applies to object keys, it's still relevant when in strings-as-arrays mode, so it should be accessible outside the .String string mode.). TODO: rename to ensure_ascii, like the Python option. Since ASCII is a subset of Unicode, the name escape_unicode isn't precisely what it's doing. Also TODO: change 0x7f to be considered a non-ascii character, which would slightly change the behavior.
escape_solidus removed. I'm pretty sure the motivation for this option in Zig was because the JSON spec allows it to be escaped, and I'm pretty sure that's because server-generated JSON content inside <script> tags wanted to escape <\/script>, which is very far removed from anything that Zig cares about. A proper solution to escaping dangerous web characters would look something a lot more like Go's implementation, which escapes <>& and some obscure whitespace stuff instead of /. For now, this can be done as a post-processing pass over the string, and if we want a more performant single-pass solution integrated with std.json, we can add that later.
string = .Array replaced by .emit_strings_as_arrays = true. It's unclear why we even have this option, but I didn't have a compelling reason to remove it.
whitespace.indent_level removed. The internal bookkeeping is now being done with a field inside the WriteStream instead of a mutable field in the options object. Technically, this removes the ability for a caller to give additional indentation levels, which can easily be implemented again, but I'd like to see a use case for that before we add it back. (And if we do add it back, should it be a slice instead of a number?)
whitespace.separator and whitespace.indent combined into .whitespace = .minified vs .whitespace = .indent_2 etc. I'm guessing that we don't need to support a usecase where we've got indentation but no separator after the :, or a separator after the : but no indentation. Seems like you either want minified or pretty printed. Please let me know if there's another use case I'm not considering. We've got 1, 2, 3, 4, and 8 space indentation and 1 tab indentation. Do we need anything else? This is probably all we need, but it's easy to add more.
emit_null_optional_fields is unchanged. ... However, I think we should change the behavior to omit optional fields that are null only when their default value is actually null. That's TODO for another time though.

The default whitespace formatting in various apis is now unified to be consistent. Previously: the default StringifyOptions{} had minified whitespace, but StringifyOptions{ .whitespace = .{} } gave 4-space pretty printing, and then WriteStream defaulted to using 1-space pretty printing. Now the default everywhere is minified whitespace, and switching to pretty printing is as simple as changing .{} to .{ .whitespace = .indent_2 }.

16hournaps · 2023-07-15T12:40:51Z

I am basically completely uninvolved with zig development, BUT I used json a lot and I agree with you. Sometimes when working with C structs it is a desired feature when you output some fields manually via emitNumber/emitString etc and some would be nice to just json.stringify. I would just say that merging objectField and emitXXXXXX into one call is confusing since it is not clearly differentiating (for the user) which one is the field name and which one is the field value. Maybe these should be different. But other than that thumbs up for this one!

thejoshwolfe · 2023-07-16T23:50:27Z

src/Autodoc.zig

-                    if (self.int.negated) try w.writeAll("-");
-                    try jsw.emitNumber(self.int.value);


I'm pretty sure this was broken, because the old emitNumber() would render large integers in quotes, resulting in -"12345678901234567890". The new code will render it as "-12345678901234567890".

andrewrk · 2023-07-17T01:48:11Z

There are two different functions to call depending on whether you want the safety, which is primarily because you only need an allocator for the safe version. Should this instead be implemented by checking the build mode or something? Please advise.

The null hypothesis is to implement something with minimal resource usage, and a debug version that requires an allocator and tracks more state to provide debugging tools is extra. So I suggest that the unqualified function name be the one that does not have an allocator parameter, since it's not actually needed.

If you can provide assertions without an allocator parameter then you can base it on the optimization mode and not require the user to make a choice.

"safe" and "unsafe" are not really the concepts we use in Zig. Instead, the concepts are:

Detectable/Undetectable Illegal Behavior (terminology update: use the phrase "detectable illegal behavior" rather than "safety-checked undefined behavior" #2402)
Safety Checks
Assume, Assert

Therefore I think stringifyChecked would be a better name than stringifyUnsafe.

The use case of writing JSON text with a compile-time known upper bound nesting depth, with assertions to prevent mistakes, is regressed in this PR since it requires an allocator now.

In summary, there is not really a way to ask for an allocator that is "for safety checks", and these changes are problematic because, as you suspected, there is not a simple way to automatically do the correct thing for each build mode. So very likely, in practice, this change would result in many users generating invalid JSON (via the unchecked function) using zig standard library and thinking zig is a silly language that can't even figure out text processing, and another set of users (using the checked function) would notice that Zig's json processing is slower than its competitors, and think zig is a silly language that can't even have high performance of simple text processing.

thejoshwolfe · 2023-07-18T03:09:35Z

@andrewrk I believed I've addressed all your concerns. The default now does safety checks up to 256 nesting levels (which i believe is the same memory size as an empty ArrayList(T), so pretty reasonably small footprint), then there's stringifyMaxDepth() to change the depth or set to null to completely disable safety, and then there's stringifyArbitraryDepth() which takes the allocator and safety any depth. And similarly for writeStream*().

The only thing missing perhaps is setting the 256 depth to null in certain build modes. Would that be idiomatic to do?

thejoshwolfe · 2023-07-18T12:22:42Z

I don't know how to debug the CI failures. I tried attaching a debugger, and it looks like it might be in __zig_probe_stack assembly?

i'll try bisecting the problem in my code,

andrewrk · 2023-07-20T23:29:13Z

I rebased this on your behalf and then force pushed. If you're getting "segmentation fault" rather than a stack trace with a debug build, it very likely means a stack overflow in the compiler. I'm guessing you added some comptime checks that are triggering #13724.

Another clue is -femit-docs which could indicate a bug in Autodocs. Inserting these two log statements would likely lead directly to a fix:

log decl name and file name when analyzing it
log when entering and exiting autodoc logic

Edit: gdb could not handle the stack overflow but lldb did handle it, and it is indeed stack overflow:

    frame #592: 0x0000000005aa4ebd zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x00007fffd801b770, parent_scope=0x00007ffffff939e8, parent_src=(line = 0, bytes = 0, src_node = 0), inst_index=0, need_type=true) at Autodoc.zig:3119:59
    frame #593: 0x0000000005a8d777 zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), inst_index=263, need_type=true) at Autodoc.zig:1110:40
    frame #594: 0x0000000005cf5f54 zig`Autodoc.walkRef(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), ref=347, need_type=true) at Autodoc.zig:4827:36
    frame #595: 0x0000000005a9c3fa zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), inst_index=262, need_type=true) at Autodoc.zig:2386:32
    frame #596: 0x0000000005f7f9a3 zig`Autodoc.analyzeDecl(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, scope=0x00007ffffffdcf20, parent_src=(line = 0, bytes = 0, src_node = 0), decl_indexes=0x00007ffffffca1b8, priv_decl_indexes=0x00007ffffffca1d0, d=Zir.DeclIterator.Item @ 0x00007ffffffc6960) at Autodoc.zig:3500:49
    frame #597: 0x0000000005d01668 zig`Autodoc.analyzeAllDecls(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, scope=0x00007ffffffdcf20, parent_inst_index=0, parent_src=(line = 0, bytes = 0, src_node = 0), decl_indexes=0x00007ffffffca1b8, priv_decl_indexes=0x00007ffffffca1d0) at Autodoc.zig:3383:29
    frame #598: 0x0000000005aa4ebd zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffe5e48, parent_src=(line = 0, bytes = 0, src_node = 0), inst_index=0, need_type=false) at Autodoc.zig:3119:59
    frame #599: 0x0000000005a89690 zig`Autodoc.generateZirData(self=0x00007ffffffeb7d0) at Autodoc.zig:331:33
    frame #600: 0x0000000005ab19b4 zig`Compilation.update(comp=0x000000000ebfcb90, main_progress_node=0x00007ffffffebc40) at Compilation.zig:2081:44
    frame #601: 0x0000000005ae017f zig`main.updateModule(comp=0x000000000ebfcb90) at main.zig:3870:24
    frame #602: 0x0000000005b0fc1c zig`main.buildOutputType(gpa=mem.Allocator @ 0x000000000de73638, arena=mem.Allocator @ 0x00007fffffff9ea0, all_args=[][]const u8 @ 0x00007ffffffed8c0, arg_mode=main.ArgMode @ 0x00000000002540e6) at main.zig:3305:17
    frame #603: 0x0000000005982145 zig`main.mainArgs(gpa=mem.Allocator @ 0x000000000de73638, arena=mem.Allocator @ 0x00007fffffff9ea0, args=[][]const u8 @ 0x00007fffffff9840) at main.zig:275:31
    frame #604: 0x000000000597f2c6 zig`main.main at main.zig:213:20
    frame #605: 0x000000000597ed1f zig`main [inlined] start.callMain at start.zig:608:37
    frame #606: 0x000000000597ed13 zig`main [inlined] start.initEventLoopAndCallMain at start.zig:542:5
    frame #607: 0x000000000597ed13 zig`main at start.zig:492:36
    frame #608: 0x000000000597ecb6 zig`main(c_argc=5, c_argv=0x00007fffffffa278, c_envp=0x00007fffffffa2a8) at start.zig:507:101
    frame #609: 0x00007ffff7ce024e libc.so.6`__libc_start_call_main + 126
    frame #610: 0x00007ffff7ce0309 libc.so.6`__libc_start_main@@GLIBC_2.34 + 137
    frame #611: 0x000000000597e925 zig`_start + 37

After this (at the top) it repeats until stack overflow.

ianprime0509 · 2023-07-21T00:43:46Z

If it's any help, the following stripped down example reproduces the same issue:

pub fn writeStreamMaxDepth(out_stream: anytype, comptime max_depth: ?usize) WriteStream(
    @TypeOf(out_stream),
    if (max_depth) |d| .{ .checked_to_fixed_depth = d } else .assumed_correct,
) {
    return .{};
}

pub fn WriteStream(
    comptime OutStream: type,
    comptime safety_checks: union(enum) {
        checked_to_arbitrary_depth,
        checked_to_fixed_depth: usize,
        assumed_correct,
    },
) type {
    _ = OutStream;
    _ = safety_checks;
    return struct {};
}

The part confusing Autodoc seems to be related to the inferred type of the if expression in the second parameter to the WriteStream call. The following example works unless the explicit @as(SafetyChecks, ...) is removed:

pub fn writeStreamMaxDepth(out_stream: anytype, comptime max_depth: ?usize) WriteStream(
    @TypeOf(out_stream),
    @as(SafetyChecks, if (max_depth) |d| .{ .checked_to_fixed_depth = d } else .assumed_correct),
) {
    return .{};
}

pub const SafetyChecks = union(enum) {
    checked_to_arbitrary_depth,
    checked_to_fixed_depth: usize,
    assumed_correct,
};

pub fn WriteStream(
    comptime OutStream: type,
    comptime safety_checks: SafetyChecks,
) type {
    _ = OutStream;
    _ = safety_checks;
    return struct {};
}

Unfortunately this is as far as my understanding goes. I don't really understand why the as_nodes in %14 of the ZIR are pointing back to %8 as the destination type when %8 seems to be the call to WriteStream itself rather than the expected type of the parameter (SafetyChecks). This seems to be the source of the infinite recursion in Autodoc (as to walk the destination type ref %8, it needs to walk the args, which eventually leads it to an as_node(%8, ...), causing it to walk %8 again, etc.). This is also why the explicit @as version above works fine.

      %8 = call(.compile_time, %6, [
        {
          %9 = typeof_builtin({
            %10 = break_inline(%9, %2)
          }) node_offset:2:5 to :2:24
          %11 = break_inline(%8, %9)
        },
        {
          %14 = block({
            %12 = is_non_null(%5) node_offset:3:9 to :3:18
            %13 = condbr(%12, {
              %15 = optional_payload_unsafe(%5) node_offset:3:24 to :3:56
              %16 = field_type(%8, checked_to_fixed_depth) node_offset:3:53 to :3:54
              %17 = as_node(%16, %15) node_offset:3:53 to :3:54
              %18 = struct_init_anon([checked_to_fixed_depth=%17]) node_offset:3:24 to :3:56
              %19 = as_node(%8, %18) node_offset:3:24 to :3:56
              %22 = break(%14, %19)
            }, {
              %20 = enum_literal("assumed_correct") token_offset:3:63 to :3:78
              %21 = as_node(%8, %20) node_offset:3:62 to :3:78
              %23 = break(%14, %21)
            }) node_offset:3:5 to :3:78
          }) node_offset:3:5 to :3:78
          %24 = as_node(%8, %14) node_offset:3:5 to :3:78
          %25 = break_inline(%8, %24)
        },
      ]) node_offset:1:77 to :1:89

kristoff-it · 2023-07-21T16:47:35Z

Rebase the branch on latest master, I've fixed the infinite loop in 426e737

thejoshwolfe · 2023-07-21T23:31:02Z

Thanks for your help @ianprime0509 and @kristoff-it ! ❤️

andrewrk

something about making a safety check null

Yeah it would be idiomatic to avoid safety checks in the unsafe release modes, and of course it is always idiomatic to not store dead memory. However, that is fine to leave as a follow-up enhancement.

Nice work!

I request you please either rebase the commits and clean up the messages, or squash the whole branch upon merging.

thejoshwolfe · 2023-07-21T23:57:05Z

I've added the breaking changes summary to my 0.11 std.json upgrade guide here: https://gist.github.com/thejoshwolfe/0642d7f42485d4d632a22ffe05dd6f29

thejoshwolfe marked this pull request as ready for review July 16, 2023 23:39

thejoshwolfe requested a review from kristoff-it as a code owner July 16, 2023 23:39

thejoshwolfe commented Jul 16, 2023

View reviewed changes

andrewrk force-pushed the json-write-stream branch from ccee113 to fd34cbe Compare July 20, 2023 23:27

thejoshwolfe added 19 commits July 21, 2023 12:52

std.json: Simplify json.writeStream state machine

5fe2af6

writeStream implements stringify

ba2860c

fix tests

32a469a

update src again

ea86856

comptime works too!

b46751c

typo

ece03f0

bring back objectField. implement O(1) memory unsafe mode.

c431622

why did i indent this

81cecbf

rename escape_unicode to ensure_ascii

d5c7892

never mind

378b2f5

fix

f29101b

publish BitStack. fix CMakeLists.txt

7cf0ee7

simplify .whitespace option

fe7b643

remove .escape_solidus option

c40b2c7

delete unnecessary test

8f7b48a

docs

a23e30e

fix error names being converted to arrays

563a8b8

fixed-depth safety mode

0ff7f5a

unsafe is the null for max depth

7716ba0

thejoshwolfe added 5 commits July 21, 2023 12:52

update safety terminology

f73898c

depth 256 is the default

e3b4e97

later

696a4c7

don't need these

8cd0109

optimize number of calls to writer for typical strings

14abf70

thejoshwolfe force-pushed the json-write-stream branch from fd34cbe to 14abf70 Compare July 21, 2023 16:55

andrewrk approved these changes Jul 21, 2023

View reviewed changes

thejoshwolfe merged commit 8924f81 into master Jul 21, 2023

thejoshwolfe deleted the json-write-stream branch July 21, 2023 23:56

thejoshwolfe mentioned this pull request Jul 22, 2023

std.json.WriteStream: eliminate dead memory in optimized modes #16485

Merged

karlseguin mentioned this pull request Jul 24, 2023

JSON WriteStream functionality regression #16519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.json: Unify stringify and writeStream #16405

std.json: Unify stringify and writeStream #16405

thejoshwolfe commented Jul 14, 2023 •

edited

Loading

16hournaps commented Jul 15, 2023

thejoshwolfe Jul 16, 2023 •

edited

Loading

andrewrk commented Jul 17, 2023

thejoshwolfe commented Jul 18, 2023

thejoshwolfe commented Jul 18, 2023 •

edited

Loading

andrewrk commented Jul 20, 2023 •

edited

Loading

ianprime0509 commented Jul 21, 2023 •

edited

Loading

kristoff-it commented Jul 21, 2023

thejoshwolfe commented Jul 21, 2023

andrewrk left a comment

thejoshwolfe commented Jul 21, 2023

	try std.json.stringify(@field(self, case.name), opts, w);
	jsw.state_index -= 1;
	// TODO: we should not reach into the state of the
	// json writer, but alas, this is what's
	// necessary with the current api.
	// would be nice to have a proper integration
	// between the json writer and the generic
	// std.json.stringify implementation

	/// TODO A future iteration of this API will allow passing `null` for this value,
	/// and disable safety checks in release builds.

		if (self.int.negated) try w.writeAll("-");
		try jsw.emitNumber(self.int.value);

std.json: Unify stringify and writeStream #16405

std.json: Unify stringify and writeStream #16405

Conversation

thejoshwolfe commented Jul 14, 2023 • edited Loading

Breaking changes to StringifyOptions

16hournaps commented Jul 15, 2023

thejoshwolfe Jul 16, 2023 • edited Loading

Choose a reason for hiding this comment

andrewrk commented Jul 17, 2023

thejoshwolfe commented Jul 18, 2023

thejoshwolfe commented Jul 18, 2023 • edited Loading

andrewrk commented Jul 20, 2023 • edited Loading

ianprime0509 commented Jul 21, 2023 • edited Loading

kristoff-it commented Jul 21, 2023

thejoshwolfe commented Jul 21, 2023

andrewrk left a comment

Choose a reason for hiding this comment

thejoshwolfe commented Jul 21, 2023

thejoshwolfe commented Jul 14, 2023 •

edited

Loading

Breaking changes to `StringifyOptions`

thejoshwolfe Jul 16, 2023 •

edited

Loading

thejoshwolfe commented Jul 18, 2023 •

edited

Loading

andrewrk commented Jul 20, 2023 •

edited

Loading

ianprime0509 commented Jul 21, 2023 •

edited

Loading