Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.json: Unify stringify and writeStream #16405

Merged
merged 24 commits into from
Jul 21, 2023
Merged

Conversation

thejoshwolfe
Copy link
Contributor

@thejoshwolfe thejoshwolfe commented Jul 14, 2023

Inspired by this comment:

zig/src/Autodoc.zig

Lines 946 to 953 in 3ec3374

try std.json.stringify(@field(self, case.name), opts, w);
jsw.state_index -= 1;
// TODO: we should not reach into the state of the
// json writer, but alas, this is what's
// necessary with the current api.
// would be nice to have a proper integration
// between the json writer and the generic
// std.json.stringify implementation

and my own experience adding a 5th copypaste of code that looks like this:

                try out_stream.writeByte(':');
                if (child_options.whitespace.separator) {
                    try out_stream.writeByte(' ');
                }

I figured it was time to implement stringify() using writeStream(), and that's what this PR does. The user-supplied pub fn jsonStringify method that plugs into this mechanism now takes a json.WriteStream instead of a regular out_stream and options or whatever. This means the implementation of custom stringifiers gets dramatically simpler:

// Before:
        pub fn jsonStringify(self: @This(), options: StringifyOptions, out_stream: anytype) !void {
            try out_stream.writeByte('{');
            var field_output = false;
            var child_options = options;
            child_options.whitespace.indent_level += 1;
            var it = self.map.iterator();
            while (it.next()) |kv| {
                if (!field_output) {
                    field_output = true;
                } else {
                    try out_stream.writeByte(',');
                }
                try child_options.whitespace.outputIndent(out_stream);
                try encodeJsonString(kv.key_ptr.*, options, out_stream);
                try out_stream.writeByte(':');
                if (child_options.whitespace.separator) {
                    try out_stream.writeByte(' ');
                }
                try stringify(kv.value_ptr.*, child_options, out_stream);
            }
            if (field_output) {
                try options.whitespace.outputIndent(out_stream);
            }
            try out_stream.writeByte('}');
        }

// After:
        pub fn jsonStringify(self: @This(), jws: anytype) !void {
            try jws.beginObject();
            var it = self.map.iterator();
            while (it.next()) |kv| {
                try jws.objectField(kv.key_ptr.*);
                try jws.write(kv.value_ptr.*);
            }
            try jws.endObject();
        }

This PR almost entirely rewrites json.WriteStream.

The new API does away with the finely specified arrayElem, emitNumber, emitString, etc, and instead you just call objectField(s) for object keys, and write(x) for every other value. You still call beginArray/endArray for arrays and beginObject/endObject for objects, and then write(x) for everything that was previously supported only by the stringify() entrypoint, such as structs and unions and everything. There is a grammar in the WriteStream docs that specifies how you must call the functions.

This comment is also realized in this PR:

/// TODO A future iteration of this API will allow passing `null` for this value,
/// and disable safety checks in release builds.

, although the design has some questionable details. There are two different functions to call depending on whether you want the safety, which is primarily because you only need an allocator for the safe version. Should this instead be implemented by checking the build mode or something? Please advise.

Breaking changes to StringifyOptions

The structure of StringifyOptions is overhauled to be more flat.

  • escape_unicode moved to the top level (because it applies to object keys, it's still relevant when in strings-as-arrays mode, so it should be accessible outside the .String string mode.). TODO: rename to ensure_ascii, like the Python option. Since ASCII is a subset of Unicode, the name escape_unicode isn't precisely what it's doing. Also TODO: change 0x7f to be considered a non-ascii character, which would slightly change the behavior.
  • escape_solidus removed. I'm pretty sure the motivation for this option in Zig was because the JSON spec allows it to be escaped, and I'm pretty sure that's because server-generated JSON content inside <script> tags wanted to escape <\/script>, which is very far removed from anything that Zig cares about. A proper solution to escaping dangerous web characters would look something a lot more like Go's implementation, which escapes <>& and some obscure whitespace stuff instead of /. For now, this can be done as a post-processing pass over the string, and if we want a more performant single-pass solution integrated with std.json, we can add that later.
  • string = .Array replaced by .emit_strings_as_arrays = true. It's unclear why we even have this option, but I didn't have a compelling reason to remove it.
  • whitespace.indent_level removed. The internal bookkeeping is now being done with a field inside the WriteStream instead of a mutable field in the options object. Technically, this removes the ability for a caller to give additional indentation levels, which can easily be implemented again, but I'd like to see a use case for that before we add it back. (And if we do add it back, should it be a slice instead of a number?)
  • whitespace.separator and whitespace.indent combined into .whitespace = .minified vs .whitespace = .indent_2 etc. I'm guessing that we don't need to support a usecase where we've got indentation but no separator after the :, or a separator after the : but no indentation. Seems like you either want minified or pretty printed. Please let me know if there's another use case I'm not considering. We've got 1, 2, 3, 4, and 8 space indentation and 1 tab indentation. Do we need anything else? This is probably all we need, but it's easy to add more.
  • emit_null_optional_fields is unchanged. ... However, I think we should change the behavior to omit optional fields that are null only when their default value is actually null. That's TODO for another time though.

The default whitespace formatting in various apis is now unified to be consistent. Previously: the default StringifyOptions{} had minified whitespace, but StringifyOptions{ .whitespace = .{} } gave 4-space pretty printing, and then WriteStream defaulted to using 1-space pretty printing. Now the default everywhere is minified whitespace, and switching to pretty printing is as simple as changing .{} to .{ .whitespace = .indent_2 }.

@16hournaps
Copy link

I am basically completely uninvolved with zig development, BUT I used json a lot and I agree with you. Sometimes when working with C structs it is a desired feature when you output some fields manually via emitNumber/emitString etc and some would be nice to just json.stringify. I would just say that merging objectField and emitXXXXXX into one call is confusing since it is not clearly differentiating (for the user) which one is the field name and which one is the field value. Maybe these should be different. But other than that thumbs up for this one!

@thejoshwolfe thejoshwolfe marked this pull request as ready for review July 16, 2023 23:39
Comment on lines -930 to -939
if (self.int.negated) try w.writeAll("-");
try jsw.emitNumber(self.int.value);
Copy link
Contributor Author

@thejoshwolfe thejoshwolfe Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this was broken, because the old emitNumber() would render large integers in quotes, resulting in -"12345678901234567890". The new code will render it as "-12345678901234567890".

@andrewrk
Copy link
Member

There are two different functions to call depending on whether you want the safety, which is primarily because you only need an allocator for the safe version. Should this instead be implemented by checking the build mode or something? Please advise.

The null hypothesis is to implement something with minimal resource usage, and a debug version that requires an allocator and tracks more state to provide debugging tools is extra. So I suggest that the unqualified function name be the one that does not have an allocator parameter, since it's not actually needed.

If you can provide assertions without an allocator parameter then you can base it on the optimization mode and not require the user to make a choice.

"safe" and "unsafe" are not really the concepts we use in Zig. Instead, the concepts are:

Therefore I think stringifyChecked would be a better name than stringifyUnsafe.

The use case of writing JSON text with a compile-time known upper bound nesting depth, with assertions to prevent mistakes, is regressed in this PR since it requires an allocator now.

In summary, there is not really a way to ask for an allocator that is "for safety checks", and these changes are problematic because, as you suspected, there is not a simple way to automatically do the correct thing for each build mode. So very likely, in practice, this change would result in many users generating invalid JSON (via the unchecked function) using zig standard library and thinking zig is a silly language that can't even figure out text processing, and another set of users (using the checked function) would notice that Zig's json processing is slower than its competitors, and think zig is a silly language that can't even have high performance of simple text processing.

@thejoshwolfe
Copy link
Contributor Author

@andrewrk I believed I've addressed all your concerns. The default now does safety checks up to 256 nesting levels (which i believe is the same memory size as an empty ArrayList(T), so pretty reasonably small footprint), then there's stringifyMaxDepth() to change the depth or set to null to completely disable safety, and then there's stringifyArbitraryDepth() which takes the allocator and safety any depth. And similarly for writeStream*().

The only thing missing perhaps is setting the 256 depth to null in certain build modes. Would that be idiomatic to do?

@thejoshwolfe
Copy link
Contributor Author

thejoshwolfe commented Jul 18, 2023

I don't know how to debug the CI failures. I tried attaching a debugger, and it looks like it might be in __zig_probe_stack assembly?

i'll try bisecting the problem in my code,

@andrewrk andrewrk force-pushed the json-write-stream branch from ccee113 to fd34cbe Compare July 20, 2023 23:27
@andrewrk
Copy link
Member

andrewrk commented Jul 20, 2023

I rebased this on your behalf and then force pushed. If you're getting "segmentation fault" rather than a stack trace with a debug build, it very likely means a stack overflow in the compiler. I'm guessing you added some comptime checks that are triggering #13724.

Another clue is -femit-docs which could indicate a bug in Autodocs. Inserting these two log statements would likely lead directly to a fix:

  • log decl name and file name when analyzing it
  • log when entering and exiting autodoc logic

Edit: gdb could not handle the stack overflow but lldb did handle it, and it is indeed stack overflow:

    frame #592: 0x0000000005aa4ebd zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x00007fffd801b770, parent_scope=0x00007ffffff939e8, parent_src=(line = 0, bytes = 0, src_node = 0), inst_index=0, need_type=true) at Autodoc.zig:3119:59
    frame #593: 0x0000000005a8d777 zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), inst_index=263, need_type=true) at Autodoc.zig:1110:40
    frame #594: 0x0000000005cf5f54 zig`Autodoc.walkRef(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), ref=347, need_type=true) at Autodoc.zig:4827:36
    frame #595: 0x0000000005a9c3fa zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffdcf20, parent_src=(line = 75, bytes = 3963, src_node = 245), inst_index=262, need_type=true) at Autodoc.zig:2386:32
    frame #596: 0x0000000005f7f9a3 zig`Autodoc.analyzeDecl(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, scope=0x00007ffffffdcf20, parent_src=(line = 0, bytes = 0, src_node = 0), decl_indexes=0x00007ffffffca1b8, priv_decl_indexes=0x00007ffffffca1d0, d=Zir.DeclIterator.Item @ 0x00007ffffffc6960) at Autodoc.zig:3500:49
    frame #597: 0x0000000005d01668 zig`Autodoc.analyzeAllDecls(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, scope=0x00007ffffffdcf20, parent_inst_index=0, parent_src=(line = 0, bytes = 0, src_node = 0), decl_indexes=0x00007ffffffca1b8, priv_decl_indexes=0x00007ffffffca1d0) at Autodoc.zig:3383:29
    frame #598: 0x0000000005aa4ebd zig`Autodoc.walkInstruction(self=0x00007ffffffeb7d0, file=0x000000000ec0e3d0, parent_scope=0x00007ffffffe5e48, parent_src=(line = 0, bytes = 0, src_node = 0), inst_index=0, need_type=false) at Autodoc.zig:3119:59
    frame #599: 0x0000000005a89690 zig`Autodoc.generateZirData(self=0x00007ffffffeb7d0) at Autodoc.zig:331:33
    frame #600: 0x0000000005ab19b4 zig`Compilation.update(comp=0x000000000ebfcb90, main_progress_node=0x00007ffffffebc40) at Compilation.zig:2081:44
    frame #601: 0x0000000005ae017f zig`main.updateModule(comp=0x000000000ebfcb90) at main.zig:3870:24
    frame #602: 0x0000000005b0fc1c zig`main.buildOutputType(gpa=mem.Allocator @ 0x000000000de73638, arena=mem.Allocator @ 0x00007fffffff9ea0, all_args=[][]const u8 @ 0x00007ffffffed8c0, arg_mode=main.ArgMode @ 0x00000000002540e6) at main.zig:3305:17
    frame #603: 0x0000000005982145 zig`main.mainArgs(gpa=mem.Allocator @ 0x000000000de73638, arena=mem.Allocator @ 0x00007fffffff9ea0, args=[][]const u8 @ 0x00007fffffff9840) at main.zig:275:31
    frame #604: 0x000000000597f2c6 zig`main.main at main.zig:213:20
    frame #605: 0x000000000597ed1f zig`main [inlined] start.callMain at start.zig:608:37
    frame #606: 0x000000000597ed13 zig`main [inlined] start.initEventLoopAndCallMain at start.zig:542:5
    frame #607: 0x000000000597ed13 zig`main at start.zig:492:36
    frame #608: 0x000000000597ecb6 zig`main(c_argc=5, c_argv=0x00007fffffffa278, c_envp=0x00007fffffffa2a8) at start.zig:507:101
    frame #609: 0x00007ffff7ce024e libc.so.6`__libc_start_call_main + 126
    frame #610: 0x00007ffff7ce0309 libc.so.6`__libc_start_main@@GLIBC_2.34 + 137
    frame #611: 0x000000000597e925 zig`_start + 37

After this (at the top) it repeats until stack overflow.

@ianprime0509
Copy link
Contributor

ianprime0509 commented Jul 21, 2023

If it's any help, the following stripped down example reproduces the same issue:

pub fn writeStreamMaxDepth(out_stream: anytype, comptime max_depth: ?usize) WriteStream(
    @TypeOf(out_stream),
    if (max_depth) |d| .{ .checked_to_fixed_depth = d } else .assumed_correct,
) {
    return .{};
}

pub fn WriteStream(
    comptime OutStream: type,
    comptime safety_checks: union(enum) {
        checked_to_arbitrary_depth,
        checked_to_fixed_depth: usize,
        assumed_correct,
    },
) type {
    _ = OutStream;
    _ = safety_checks;
    return struct {};
}

The part confusing Autodoc seems to be related to the inferred type of the if expression in the second parameter to the WriteStream call. The following example works unless the explicit @as(SafetyChecks, ...) is removed:

pub fn writeStreamMaxDepth(out_stream: anytype, comptime max_depth: ?usize) WriteStream(
    @TypeOf(out_stream),
    @as(SafetyChecks, if (max_depth) |d| .{ .checked_to_fixed_depth = d } else .assumed_correct),
) {
    return .{};
}

pub const SafetyChecks = union(enum) {
    checked_to_arbitrary_depth,
    checked_to_fixed_depth: usize,
    assumed_correct,
};

pub fn WriteStream(
    comptime OutStream: type,
    comptime safety_checks: SafetyChecks,
) type {
    _ = OutStream;
    _ = safety_checks;
    return struct {};
}

Unfortunately this is as far as my understanding goes. I don't really understand why the as_nodes in %14 of the ZIR are pointing back to %8 as the destination type when %8 seems to be the call to WriteStream itself rather than the expected type of the parameter (SafetyChecks). This seems to be the source of the infinite recursion in Autodoc (as to walk the destination type ref %8, it needs to walk the args, which eventually leads it to an as_node(%8, ...), causing it to walk %8 again, etc.). This is also why the explicit @as version above works fine.

      %8 = call(.compile_time, %6, [
        {
          %9 = typeof_builtin({
            %10 = break_inline(%9, %2)
          }) node_offset:2:5 to :2:24
          %11 = break_inline(%8, %9)
        },
        {
          %14 = block({
            %12 = is_non_null(%5) node_offset:3:9 to :3:18
            %13 = condbr(%12, {
              %15 = optional_payload_unsafe(%5) node_offset:3:24 to :3:56
              %16 = field_type(%8, checked_to_fixed_depth) node_offset:3:53 to :3:54
              %17 = as_node(%16, %15) node_offset:3:53 to :3:54
              %18 = struct_init_anon([checked_to_fixed_depth=%17]) node_offset:3:24 to :3:56
              %19 = as_node(%8, %18) node_offset:3:24 to :3:56
              %22 = break(%14, %19)
            }, {
              %20 = enum_literal("assumed_correct") token_offset:3:63 to :3:78
              %21 = as_node(%8, %20) node_offset:3:62 to :3:78
              %23 = break(%14, %21)
            }) node_offset:3:5 to :3:78
          }) node_offset:3:5 to :3:78
          %24 = as_node(%8, %14) node_offset:3:5 to :3:78
          %25 = break_inline(%8, %24)
        },
      ]) node_offset:1:77 to :1:89

@kristoff-it
Copy link
Member

Rebase the branch on latest master, I've fixed the infinite loop in 426e737

@thejoshwolfe
Copy link
Contributor Author

Thanks for your help @ianprime0509 and @kristoff-it ! ❤️

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something about making a safety check null

Yeah it would be idiomatic to avoid safety checks in the unsafe release modes, and of course it is always idiomatic to not store dead memory. However, that is fine to leave as a follow-up enhancement.

Nice work!

I request you please either rebase the commits and clean up the messages, or squash the whole branch upon merging.

@thejoshwolfe thejoshwolfe merged commit 8924f81 into master Jul 21, 2023
@thejoshwolfe thejoshwolfe deleted the json-write-stream branch July 21, 2023 23:56
@thejoshwolfe
Copy link
Contributor Author

I've added the breaking changes summary to my 0.11 std.json upgrade guide here: https://gist.github.com/thejoshwolfe/0642d7f42485d4d632a22ffe05dd6f29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants