Complete diff-match-patch port (closes #25) #26

mnemnion · 2024-07-11T18:25:24Z

This branch has a completed port of the match and patch algorithms, with the necessary test coverage.

It actually doesn't include the delta code, because I think the delta format sucks and it's worth improving.

Another major change is Unicode awareness: this branch will not cut into Unicode codepoints. This may conceivably come with some performance penalty for ASCII code, if so, the logic is isolated to a few functions and can be feature-gated accordingly. I expect the damage to be minor, perhaps unnoticeable, once I finish performance tuning the algorithms.

A diff which truncates code point sequences is not inherently incorrect, in that it will still apply properly and give a correct before-and-after value. It can result in patches which are invalid UTF-8, which is undesirable, but the patchAddContext logic could prevent this at the boundary.

However, correct code point handling is essential for a change I haven't yet implemented: making the diffLineMode function work with a useful number of lines. A reasonable median for number of lines in a text file is 1000 (made up number), and the current diffLineMode gives up after a round one-quarter of those lines, meaning that the effort spent iterating and hashing may, or may not, be made up for by the amount it speeds up a diff. Line mode pays off more for longer files, so this is worth correcting.

Because this library is UTF-8 native, we don't have to take a shortcut of ignoring surrogate pairs, and therefore, the approach I intend to take here will allow for 1,112,034 lines of text to be compared. Engineering a test which will actually trigger the "give up" condition is going to be difficult enough that I suggest that we comptime-limit the amount of lines severely in test mode, in order to cover that case.

I've focused on compatibility with the existing library, as the useful default for a port, but I don't think we should let that hold us back here. For the (notional) sake of compatibility, perhaps some of the functionality I'm interested in improving could be offered in a compatibility mode, but there are several least-common-denominator aspects to the design which I'd like to liberate this library from. I expect very few uses will cross over into some other language, and it's not difficult to provide for that use case while offering a better alternative for the Zig ecosystem.

This isn't ready for merge yet, so I'd appreciate if detailed critique waits until it is, at which point it is welcome. As a bit of self-critique, I'm not satisfied that there are at least three copies of the Unicode-breaking algorithm, which differ in the specifics. I think I can get it down to two: an adjustForward and an adjustBackward, and call/inline them in the places where they get used.

Directional/design feedback, any bugs spotted while kicking the tires, and any additional test coverage that one of you might spot a need for and want to contribute, are all immediately welcomed. Please let me know what you think.

This adds a test running step and a zig.zon manifest. Also adds to the .gitignore the new location of caches, .zig-out.

halfMatch now correctly manages its own memory, like it's supposed to.

The tests currently create Diffs with constant/static text, which causes a panic on any attempt to free said text. This pass fixes a couple leaks, and lays a foundation for consistently freeing any diff's .text field when no longer in use: this requires that, as it will be in real programs, the diff text is allocated to begin with.

I'll need to knock out all the tests in diffMergeSemantic so that I can flip the rest of the diffCleanupMerge tests on, without trying to free .rodata memory. The real annoyance of the decision to punt on memory management is in the tests, which all have to be rewritten. Whoever you are, you could at least have written the tests responsibly, and spared the person who cleans up your sloppy code from a bunch of drudgery.

Found a double free in the falsed-out one, so that's next up...

Also changing the use of .replaceRange, which lead me to think that the item at the pointer location was being, y'know, replaced, with a use of .insert, which is,, correct.

I had missed that some slices in cleanupSemantic were being taken from the same string/slice that was already at that index, due to the prior code renaming them. Ganbatte!

The library now manages its own memory, as it should.

A diff splitting a UTF-8 codepoint is possible: Ω and ϩ, for example, share a common suffix. This split will make the diff invalid UTF-8, even if both texts are valid. A greater concern is common prefixes: every Greek letter between Α and ο shares the prefix byte 0xce, meaning that splits on a prefix would be the rule, not the exception, in normal multibyte texts. This patch prevents the prefix and suffix diffing from ever splitting on such common prefixes and suffixes, by backing out when a diff lands on a follow byte. It remains to ensure that other splits are never in the midst of a multibyte codepoint.

The only thing which matters here is not splitting codepoints down the middle. Since prefix and suffix matching will also refuse to split, later repair stages won't re-split on us. As the comments note, this doesn't validate UTF-8, and might create less-than-perfect diffs if the texts contain bad Unicode, a situation and outcome I do not care about even slightly. In addition to fixing the problem where diffs would be invalid utf-8 in many ordinary cases, what this gets us is the ability to use codepoints as indices for the diffLine function. Since diffing won't damage utf-8 sequences, we can diff in linemode and decode each sequence back to the codepoint, which will serve as an offset into the line_array.

I believe that's it for operations which perform splits on common seqs. The fun part will be getting code coverage for all these pathways. Not a clue how to handle 1,110,032 line bailouts for diffLinesToChars.

I intend to circle back and make all the BadPatchText's check for fails as well.

This brings the line-mode allocation failure test down to a point where runtimes are comparable to other tests.

mnemnion · 2024-07-21T21:20:43Z

The work on this branch is now complete.

I think it would be good to have some larger tests for diffLineMode, and that broadly speaking more tests for this library would be a good goal. Line coverage is at 99.6%, that includes one errdefer in diffCompute which it should be possible to trigger (but I haven't figured out how, and given how many checkAllAllocationFailures blocks it appears in, that might be wrong), and a branch in diffCleanupSemantic which the ported tests don't hit. There's also an @panic which I left in diffCommonOverlap, for now, although I'm fairly sure that, by construction, it will not be triggered except by a bug elsewhere in the code. The rest of the non-covered lines are spurious, they show up in test code which is 100% getting called, that might be a comptime artifact but the reasons are unclear.

Full line coverage, however, doesn't mean that it all works correctly, and it would be good to supplement that with some integration tests which do non-trivial diffing. Also worth checking if the Unicode-aware code causes any performance regression on ASCII text, my guess is that this would be minor, if visible at all, but it's worth having an actual answer.

I know the previous branch isn't merged in yet, so no special hurry with this one. I'm using it in another application, and pointing that at this repo at such time as it all gets merged in will very easy. But it's ready for detailed review when you have time and inclination to do so.

travisstaloch · 2024-07-21T21:35:30Z

Just pulled latest zls and this branch in order to test drive. But this branch doesn't build:

~/.../zig/zls $ zig build -Doptimize=ReleaseSafe
install
└─ install zls
   └─ zig build-exe zls ReleaseSafe native 1 errors
src/diff.zig:26:19: error: expected type 'error{OutOfMemory}', found 'error{OutOfMemory,BadPatchString}'
    const diffs = try dmp.diff(arena.allocator(), before, after, true);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/diff.zig:26:19: note: 'error.BadPatchString' not a member of destination error set

My local zls changes

diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..6809a60 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -15,9 +15,12 @@
             .url = "https://github.com/ziglibs/known-folders/archive/47076c6b11214a218e9244471d8762310820911a.tar.gz",
             .hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
         },
         .diffz = .{
-            .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
-            .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+              .path = "../diffz",
         },
         .@"lsp-codegen" = .{
             .url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz",

travisstaloch · 2024-07-21T22:00:17Z

Is this branch meant to be used in zls or just #23? If not, please ignore my comment above.

travisstaloch · 2024-07-21T22:07:47Z

If it is meant to be used in zls, here is a potential zls patch. I'm not sure whether InvalidRequest is the correct error mapping though:

error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest),

~/.../zig/zls $ gd master
diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..84e8706 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -16,8 +16,10 @@
             .hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
         },
         .diffz = .{
-            .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
-            .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+            //.url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
+            // TODO change back to ziglibs once this is merged
+            .url = "https://github.com/mnemnion/diffz/archive/bb9a1ae4e7220e5360024df61c369d51516c6860.tar.gz",
+            .hash = "12205c450ca56247e78753a09784f74cb45a6d372c48653e21f64acdc07d5dc40b38",
         },
         .@"lsp-codegen" = .{
             .url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz",
diff --git a/src/Server.zig b/src/Server.zig
index f8c4a22..82bc647 100644
--- a/src/Server.zig
+++ b/src/Server.zig
@@ -138,7 +138,7 @@ pub const Error = error{
     /// The client has canceled a request and a server as detected
     /// the cancel.
     RequestCancelled,
-};
+} || diff.Error;

 pub const Status = enum {
     /// the server has not received a `initialize` request
@@ -1945,6 +1945,7 @@ fn processMessageReportError(server: *Server, message: Message) ?[]const u8 {
             .request => |request| return server.sendToClientResponseError(request.id, lsp.JsonRPCMessage.Response.Error{
                 .code = @enumFromInt(switch (err) {
                     error.OutOfMemory => @intFromEnum(types.ErrorCodes.InternalError),
+                    error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest),
                     error.ParseError => @intFromEnum(types.ErrorCodes.ParseError),
                     error.InvalidRequest => @intFromEnum(types.ErrorCodes.InvalidRequest),
                     error.MethodNotFound => @intFromEnum(types.ErrorCodes.MethodNotFound),
diff --git a/src/diff.zig b/src/diff.zig
index a3e602f..ecdf577 100644
--- a/src/diff.zig
+++ b/src/diff.zig
@@ -10,7 +10,7 @@ const dmp = DiffMatchPatch{
     .diff_timeout = 250,
 };

-pub const Error = error{OutOfMemory};
+pub const Error = error{OutOfMemory} || DiffMatchPatch.DiffError;

 pub fn edits(
     allocator: std.mem.Allocator,

travisstaloch · 2024-07-21T22:17:11Z

I'm test driving this patch now in zls with the the diff above. I have been test driving #23 for the past 2 weeks or so and didn't notice any issues. However I haven't written much zig during this time so this might not be saying much.

mnemnion · 2024-07-22T00:17:44Z

Hmm. So the BadPatchString error is for, well, a bad/malformed patch. There are a lot of ways that a patch can be wrong, and just the one way that it can be right.

But it's actually isolated to one public function, so it doesn't have to be a part of the signature for the others. But then we have a DiffError error set which doesn't cover all the errors in the library, which seems like the nomenclature is incorrect.

I think the move here is to keep both errors in the DiffError set, but define the functions which only return OutOfMemory as only returning OutOfMemory. That way, the DiffError set answers the question "what errors does the entire library throw" and the part of it which ZLS uses will stay compatible with the original interface.

It also seems better with specifically OutOfMemory, if functions which only throw that one error aren't defined as part of a wider error set.

mnemnion · 2024-07-22T00:27:04Z

It does occur to me that code calling diffz should, in general, add the DiffError set to error sets for functions which call it. But I'm still going to slim the interface down so that the functions which can't return BadPatchString won't be in that error set, because I don't think it makes sense for functions which will only receive OutOfMemory to have to handle all the possible errors in the library.

It doesn't make a lot of sense for functions where that's the only case to handle to use a larger superset.

mnemnion · 2024-07-22T00:37:52Z

Done, that was relatively painless.

Kinda makes me think that DiffError should be PatchError, since those are the only functions which use it. But that's a cosmetic change which could be made instantly, so I'll defer on that for now.

travisstaloch · 2024-07-22T00:46:42Z

Thanks! With that last commit, zls no longer requires any changes to build with this PR.

travisstaloch · 2024-08-01T21:05:36Z

I'm test driving this patch now in zls with the the diff above.

Update: I've been writing more zig this week and haven't noticed any issues.

mnemnion · 2024-08-02T04:09:25Z

That reminded me that I had gotten ZLS to build earlier in the week, but hadn't patched in the branch.

So I did that. So far so good! I removed the arena, and tossed in dmp.diffCleanupEfficiency for good measure. I had had some notion of using the patch mechanism to apply the diff, but a closer look at the source makes it clear that it's getting transmogrified into JSON and sent elsewhere, makes sense though.

No idea if the overhead of making an arena for each diff covers the overhead of freeing things individually, I suspect it's a wash though.

mnemnion · 2024-08-05T16:39:27Z

I found a crashing case in Unicode handling, I should get a chance to debug it soon.

There's some odd false sharing happening which is codeunit related, but at least it will no longer aggressively blow stack.

mnemnion · 2024-08-06T00:29:15Z

Fixed. The Unicode-aware diffBisectSplit logic would sometimes create a before or after pair with empty strings. Since this doesn't reduce the problem space for half of the code, it would happen again, until stack overflow.

Now it doesn't do that. If one of the pairs is empty, that means the other pair is a single insert/delete diff.

AdjectiveAllison · 2024-12-23T15:31:39Z

How far is this from being merged in? I've been following for a while and from what I remember it was pretty close to functional. Do you need any help getting over the finish line?

mnemnion · 2024-12-31T04:25:15Z

I've been running ZLS with it patched in since August with no problems. But that doesn't actually touch the new code.

The work on this branch is done, although now that it's on my mind again I wouldn't mind doing a linting pass. It's been working in ohsnap for awhile, which some folks are starting to use.

What it could use is a test which handles a large-ish amount of diffed text in multibyte Unicode, using the line-oriented optimization. The bug fixed on 5 August turned up using ohsnap to diff RuneSet tests, which have a lot of multibyte sequences, and that code is original, the other systems use UTF-16-LE and this version uses UTF-8.

So if you wanted to hit up a source like the Wikipedia database, and work up a couple of 1-5kB diff tests using, say, Greek for two-byte and Chinese for three, that would be the last thing I've thought of to add here. The branch itself is done, and I think I got kcov to 100% line coverage, it was at least two nines.

mnemnion added 30 commits July 1, 2024 20:24

Add build.zig.zon, update build.zig, .gitignore

9c27090

This adds a test running step and a zig.zon manifest. Also adds to the .gitignore the new location of caches, .zig-out.

Module name is diffz, not DiffMatchPatch

4792861

More paths, .target, .optimize

10bcba0

Memory-managed halfMatch function

64bc1bf

halfMatch now correctly manages its own memory, like it's supposed to.

Managed memory in diffLinesToChars

0c7c4e2

Managed memory in diffCharsToLines

a5c993f

One test at a time...

7c5cccc

More tests

dbc3a84

Found a double free in the falsed-out one, so that's next up...

Fix two double-frees

0b2f274

Also changing the use of .replaceRange, which lead me to think that the item at the pointer location was being, y'know, replaced, with a use of .insert, which is,, correct.

Free two more clobbered diff texts

18b7531

Restore all diffCleanMerge tests

1ebee82

Restore frees to diffCleanupSemanticLossless

350a85e

Tests of diffCleanupSemanticLossless pass

fe21b51

alloc -> allocator

5ff7a76

Tests pass for diffBisect

b56449f

Catch a few leaks, lather, rinse, repeat

2388fff

Defer free until slice is taken

7a47bd7

I had missed that some slices in cleanupSemantic were being taken from the same string/slice that was already at that index, due to the prior code renaming them. Ganbatte!

Fait accompli

eecdb47

The library now manages its own memory, as it should.

Polish things up

7032b4b

Remove unused temp diff

87b1b32

Add (failing!) greek test

006d7a1

Add some easy two-byte tests

7ebdf12

Back out of overlaps if we clip a character

56104d4

I believe that's it for operations which perform splits on common seqs. The fun part will be getting code coverage for all these pathways. Not a clue how to handle 1,110,032 line bailouts for diffLinesToChars.

Less than zero impossible for usize

2f2e57e

Add coverage step for build system

7b7cd26

Add diffIndex, diffBeforeText, diffAfterText

0dbfb97

mnemnion added 9 commits July 13, 2024 12:19

Fix: logic and memory bugs in patchFromHeader

1d0c79f

I intend to circle back and make all the BadPatchText's check for fails as well.

Check memory failure within testBadPatchString

42f36f4

Add post-first-patch failing case

e01668b

Tracking down line-mode fails

ba58e7e

Port fix for line mode from memory branch

f22ccb4

Add configurable line threshold

85a84b7

Add check-lines limit to DiffMatchPatch struct

5d3a83f

This brings the line-mode allocation failure test down to a point where runtimes are comparable to other tests.

Roadmap updates

1fd9187

Make diffCleanupSemantic public

bf22ea4

mnemnion changed the title ~~WIP: Complete diff-match-patch port (closes #25)~~ Complete diff-match-patch port (closes #25) Jul 21, 2024

Remove if (false) branches and XXX bool

bb9a1ae

Change most error sets to error{OutOfMemory}

3a1b7ef

It doesn't make a lot of sense for functions where that's the only case to handle to use a larger superset.

Panic? At the backout!

b8d5661

mnemnion added 2 commits August 5, 2024 19:37

Fix stack overflow in diffBisectSplit

c87a288

There's some odd false sharing happening which is codeunit related, but at least it will no longer aggressively blow stack.

Bugfix for the bugfix

0249d68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete diff-match-patch port (closes #25) #26

Complete diff-match-patch port (closes #25) #26

mnemnion commented Jul 11, 2024

mnemnion commented Jul 21, 2024

travisstaloch commented Jul 21, 2024

travisstaloch commented Jul 21, 2024 •

edited

Loading

travisstaloch commented Jul 21, 2024

travisstaloch commented Jul 21, 2024

mnemnion commented Jul 22, 2024

mnemnion commented Jul 22, 2024

mnemnion commented Jul 22, 2024

travisstaloch commented Jul 22, 2024

travisstaloch commented Aug 1, 2024

mnemnion commented Aug 2, 2024

mnemnion commented Aug 5, 2024

mnemnion commented Aug 6, 2024

AdjectiveAllison commented Dec 23, 2024

mnemnion commented Dec 31, 2024

Complete diff-match-patch port (closes #25) #26

Are you sure you want to change the base?

Complete diff-match-patch port (closes #25) #26

Conversation

mnemnion commented Jul 11, 2024

mnemnion commented Jul 21, 2024

travisstaloch commented Jul 21, 2024

travisstaloch commented Jul 21, 2024 • edited Loading

travisstaloch commented Jul 21, 2024

travisstaloch commented Jul 21, 2024

mnemnion commented Jul 22, 2024

mnemnion commented Jul 22, 2024

mnemnion commented Jul 22, 2024

travisstaloch commented Jul 22, 2024

travisstaloch commented Aug 1, 2024

mnemnion commented Aug 2, 2024

mnemnion commented Aug 5, 2024

mnemnion commented Aug 6, 2024

AdjectiveAllison commented Dec 23, 2024

mnemnion commented Dec 31, 2024

travisstaloch commented Jul 21, 2024 •

edited

Loading