Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete diff-match-patch port (closes #25) #26

Open
wants to merge 182 commits into
base: main
Choose a base branch
from

Conversation

mnemnion
Copy link
Contributor

This branch has a completed port of the match and patch algorithms, with the necessary test coverage.

It actually doesn't include the delta code, because I think the delta format sucks and it's worth improving.

Another major change is Unicode awareness: this branch will not cut into Unicode codepoints. This may conceivably come with some performance penalty for ASCII code, if so, the logic is isolated to a few functions and can be feature-gated accordingly. I expect the damage to be minor, perhaps unnoticeable, once I finish performance tuning the algorithms.

A diff which truncates code point sequences is not inherently incorrect, in that it will still apply properly and give a correct before-and-after value. It can result in patches which are invalid UTF-8, which is undesirable, but the patchAddContext logic could prevent this at the boundary.

However, correct code point handling is essential for a change I haven't yet implemented: making the diffLineMode function work with a useful number of lines. A reasonable median for number of lines in a text file is 1000 (made up number), and the current diffLineMode gives up after a round one-quarter of those lines, meaning that the effort spent iterating and hashing may, or may not, be made up for by the amount it speeds up a diff. Line mode pays off more for longer files, so this is worth correcting.

Because this library is UTF-8 native, we don't have to take a shortcut of ignoring surrogate pairs, and therefore, the approach I intend to take here will allow for 1,112,034 lines of text to be compared. Engineering a test which will actually trigger the "give up" condition is going to be difficult enough that I suggest that we comptime-limit the amount of lines severely in test mode, in order to cover that case.

I've focused on compatibility with the existing library, as the useful default for a port, but I don't think we should let that hold us back here. For the (notional) sake of compatibility, perhaps some of the functionality I'm interested in improving could be offered in a compatibility mode, but there are several least-common-denominator aspects to the design which I'd like to liberate this library from. I expect very few uses will cross over into some other language, and it's not difficult to provide for that use case while offering a better alternative for the Zig ecosystem.

This isn't ready for merge yet, so I'd appreciate if detailed critique waits until it is, at which point it is welcome. As a bit of self-critique, I'm not satisfied that there are at least three copies of the Unicode-breaking algorithm, which differ in the specifics. I think I can get it down to two: an adjustForward and an adjustBackward, and call/inline them in the places where they get used.

Directional/design feedback, any bugs spotted while kicking the tires, and any additional test coverage that one of you might spot a need for and want to contribute, are all immediately welcomed. Please let me know what you think.

mnemnion added 30 commits July 1, 2024 20:24
This adds a test running step and a zig.zon manifest.  Also adds to the
.gitignore the new location of caches, .zig-out.
halfMatch now correctly manages its own memory, like it's supposed to.
The tests currently create Diffs with constant/static text, which
causes a panic on any attempt to free said text.

This pass fixes a couple leaks, and lays a foundation for consistently
freeing any diff's .text field when no longer in use: this requires
that, as it will be in real programs, the diff text is allocated to
begin with.
I'll need to knock out all the tests in diffMergeSemantic so that
I can flip the rest of the diffCleanupMerge tests on, without trying
to free .rodata memory.

The real annoyance of the decision to punt on memory management is in
the tests, which all have to be rewritten.  Whoever you are, you could
at least have written the tests responsibly, and spared the person who
cleans up your sloppy code from a bunch of drudgery.
Found a double free in the falsed-out one, so that's next up...
Also changing the use of .replaceRange, which lead me to think that the
item at the pointer location was being, y'know, replaced, with a use of
.insert, which is,, correct.
I had missed that some slices in cleanupSemantic were being taken from
the same string/slice that was already at that index, due to the prior
code renaming them.  Ganbatte!
The library now manages its own memory, as it should.
A diff splitting a UTF-8 codepoint is possible: Ω and ϩ, for example,
share a common suffix.  This split will make the diff invalid UTF-8,
even if both texts are valid.  A greater concern is common prefixes:
every Greek letter between Α and ο shares the prefix byte 0xce, meaning
that splits on a prefix would be the rule, not the exception, in normal
multibyte texts.

This patch prevents the prefix and suffix diffing from ever splitting
on such common prefixes and suffixes, by backing out when a diff lands
on a follow byte.  It remains to ensure that other splits are never in
the midst of a multibyte codepoint.
The only thing which matters here is not splitting codepoints down the
middle.  Since prefix and suffix matching will also refuse to split,
later repair stages won't re-split on us.  As the comments note, this
doesn't validate UTF-8, and might create less-than-perfect diffs if the
texts contain bad Unicode, a situation and outcome I do not care about
even slightly.

In addition to fixing the problem where diffs would be invalid utf-8 in
many ordinary cases, what this gets us is the ability to use codepoints
as indices for the diffLine function.  Since diffing won't damage utf-8
sequences, we can diff in linemode and decode each sequence back to the
codepoint, which will serve as an offset into the line_array.
I believe that's it for operations which perform splits on common seqs.

The fun part will be getting code coverage for all these pathways.  Not
a clue how to handle 1,110,032 line bailouts for diffLinesToChars.
@mnemnion mnemnion changed the title WIP: Complete diff-match-patch port (closes #25) Complete diff-match-patch port (closes #25) Jul 21, 2024
@mnemnion
Copy link
Contributor Author

The work on this branch is now complete.

I think it would be good to have some larger tests for diffLineMode, and that broadly speaking more tests for this library would be a good goal. Line coverage is at 99.6%, that includes one errdefer in diffCompute which it should be possible to trigger (but I haven't figured out how, and given how many checkAllAllocationFailures blocks it appears in, that might be wrong), and a branch in diffCleanupSemantic which the ported tests don't hit. There's also an @panic which I left in diffCommonOverlap, for now, although I'm fairly sure that, by construction, it will not be triggered except by a bug elsewhere in the code. The rest of the non-covered lines are spurious, they show up in test code which is 100% getting called, that might be a comptime artifact but the reasons are unclear.

Full line coverage, however, doesn't mean that it all works correctly, and it would be good to supplement that with some integration tests which do non-trivial diffing. Also worth checking if the Unicode-aware code causes any performance regression on ASCII text, my guess is that this would be minor, if visible at all, but it's worth having an actual answer.

I know the previous branch isn't merged in yet, so no special hurry with this one. I'm using it in another application, and pointing that at this repo at such time as it all gets merged in will very easy. But it's ready for detailed review when you have time and inclination to do so.

@travisstaloch
Copy link
Collaborator

Just pulled latest zls and this branch in order to test drive. But this branch doesn't build:

~/.../zig/zls $ zig build -Doptimize=ReleaseSafe
install
└─ install zls
   └─ zig build-exe zls ReleaseSafe native 1 errors
src/diff.zig:26:19: error: expected type 'error{OutOfMemory}', found 'error{OutOfMemory,BadPatchString}'
    const diffs = try dmp.diff(arena.allocator(), before, after, true);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/diff.zig:26:19: note: 'error.BadPatchString' not a member of destination error set

My local zls changes

diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..6809a60 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -15,9 +15,12 @@
             .url = "https://github.com/ziglibs/known-folders/archive/47076c6b11214a218e9244471d8762310820911a.tar.gz",
             .hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
         },
         .diffz = .{
-            .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
-            .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+              .path = "../diffz",
         },
         .@"lsp-codegen" = .{
             .url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz",

@travisstaloch
Copy link
Collaborator

travisstaloch commented Jul 21, 2024

Is this branch meant to be used in zls or just #23? If not, please ignore my comment above.

@travisstaloch
Copy link
Collaborator

If it is meant to be used in zls, here is a potential zls patch. I'm not sure whether InvalidRequest is the correct error mapping though:

error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest),
~/.../zig/zls $ gd master
diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..84e8706 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -16,8 +16,10 @@
             .hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
         },
         .diffz = .{
-            .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
-            .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+            //.url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
+            // TODO change back to ziglibs once this is merged
+            .url = "https://github.com/mnemnion/diffz/archive/bb9a1ae4e7220e5360024df61c369d51516c6860.tar.gz",
+            .hash = "12205c450ca56247e78753a09784f74cb45a6d372c48653e21f64acdc07d5dc40b38",
         },
         .@"lsp-codegen" = .{
             .url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz",
diff --git a/src/Server.zig b/src/Server.zig
index f8c4a22..82bc647 100644
--- a/src/Server.zig
+++ b/src/Server.zig
@@ -138,7 +138,7 @@ pub const Error = error{
     /// The client has canceled a request and a server as detected
     /// the cancel.
     RequestCancelled,
-};
+} || diff.Error;

 pub const Status = enum {
     /// the server has not received a `initialize` request
@@ -1945,6 +1945,7 @@ fn processMessageReportError(server: *Server, message: Message) ?[]const u8 {
             .request => |request| return server.sendToClientResponseError(request.id, lsp.JsonRPCMessage.Response.Error{
                 .code = @enumFromInt(switch (err) {
                     error.OutOfMemory => @intFromEnum(types.ErrorCodes.InternalError),
+                    error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest),
                     error.ParseError => @intFromEnum(types.ErrorCodes.ParseError),
                     error.InvalidRequest => @intFromEnum(types.ErrorCodes.InvalidRequest),
                     error.MethodNotFound => @intFromEnum(types.ErrorCodes.MethodNotFound),
diff --git a/src/diff.zig b/src/diff.zig
index a3e602f..ecdf577 100644
--- a/src/diff.zig
+++ b/src/diff.zig
@@ -10,7 +10,7 @@ const dmp = DiffMatchPatch{
     .diff_timeout = 250,
 };

-pub const Error = error{OutOfMemory};
+pub const Error = error{OutOfMemory} || DiffMatchPatch.DiffError;

 pub fn edits(
     allocator: std.mem.Allocator,

@travisstaloch
Copy link
Collaborator

I'm test driving this patch now in zls with the the diff above. I have been test driving #23 for the past 2 weeks or so and didn't notice any issues. However I haven't written much zig during this time so this might not be saying much.

@mnemnion
Copy link
Contributor Author

Hmm. So the BadPatchString error is for, well, a bad/malformed patch. There are a lot of ways that a patch can be wrong, and just the one way that it can be right.

But it's actually isolated to one public function, so it doesn't have to be a part of the signature for the others. But then we have a DiffError error set which doesn't cover all the errors in the library, which seems like the nomenclature is incorrect.

I think the move here is to keep both errors in the DiffError set, but define the functions which only return OutOfMemory as only returning OutOfMemory. That way, the DiffError set answers the question "what errors does the entire library throw" and the part of it which ZLS uses will stay compatible with the original interface.

It also seems better with specifically OutOfMemory, if functions which only throw that one error aren't defined as part of a wider error set.

@mnemnion
Copy link
Contributor Author

It does occur to me that code calling diffz should, in general, add the DiffError set to error sets for functions which call it. But I'm still going to slim the interface down so that the functions which can't return BadPatchString won't be in that error set, because I don't think it makes sense for functions which will only receive OutOfMemory to have to handle all the possible errors in the library.

It doesn't make a lot of sense for functions where that's the only case
to handle to use a larger superset.
@mnemnion
Copy link
Contributor Author

Done, that was relatively painless.

Kinda makes me think that DiffError should be PatchError, since those are the only functions which use it. But that's a cosmetic change which could be made instantly, so I'll defer on that for now.

@travisstaloch
Copy link
Collaborator

Thanks! With that last commit, zls no longer requires any changes to build with this PR.

@travisstaloch
Copy link
Collaborator

I'm test driving this patch now in zls with the the diff above.

Update: I've been writing more zig this week and haven't noticed any issues.

@mnemnion
Copy link
Contributor Author

mnemnion commented Aug 2, 2024

That reminded me that I had gotten ZLS to build earlier in the week, but hadn't patched in the branch.

So I did that. So far so good! I removed the arena, and tossed in dmp.diffCleanupEfficiency for good measure. I had had some notion of using the patch mechanism to apply the diff, but a closer look at the source makes it clear that it's getting transmogrified into JSON and sent elsewhere, makes sense though.

No idea if the overhead of making an arena for each diff covers the overhead of freeing things individually, I suspect it's a wash though.

@mnemnion
Copy link
Contributor Author

mnemnion commented Aug 5, 2024

I found a crashing case in Unicode handling, I should get a chance to debug it soon.

There's some odd false sharing happening which is codeunit related,
but at least it will no longer aggressively blow stack.
@mnemnion
Copy link
Contributor Author

mnemnion commented Aug 6, 2024

Fixed. The Unicode-aware diffBisectSplit logic would sometimes create a before or after pair with empty strings. Since this doesn't reduce the problem space for half of the code, it would happen again, until stack overflow.

Now it doesn't do that. If one of the pairs is empty, that means the other pair is a single insert/delete diff.

@AdjectiveAllison
Copy link

How far is this from being merged in? I've been following for a while and from what I remember it was pretty close to functional. Do you need any help getting over the finish line?

@mnemnion
Copy link
Contributor Author

I've been running ZLS with it patched in since August with no problems. But that doesn't actually touch the new code.

The work on this branch is done, although now that it's on my mind again I wouldn't mind doing a linting pass. It's been working in ohsnap for awhile, which some folks are starting to use.

What it could use is a test which handles a large-ish amount of diffed text in multibyte Unicode, using the line-oriented optimization. The bug fixed on 5 August turned up using ohsnap to diff RuneSet tests, which have a lot of multibyte sequences, and that code is original, the other systems use UTF-16-LE and this version uses UTF-8.

So if you wanted to hit up a source like the Wikipedia database, and work up a couple of 1-5kB diff tests using, say, Greek for two-byte and Chinese for three, that would be the last thing I've thought of to add here. The branch itself is done, and I think I got kcov to 100% line coverage, it was at least two nines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants