-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete diff-match-patch port (closes #25) #26
base: main
Are you sure you want to change the base?
Conversation
This adds a test running step and a zig.zon manifest. Also adds to the .gitignore the new location of caches, .zig-out.
halfMatch now correctly manages its own memory, like it's supposed to.
The tests currently create Diffs with constant/static text, which causes a panic on any attempt to free said text. This pass fixes a couple leaks, and lays a foundation for consistently freeing any diff's .text field when no longer in use: this requires that, as it will be in real programs, the diff text is allocated to begin with.
I'll need to knock out all the tests in diffMergeSemantic so that I can flip the rest of the diffCleanupMerge tests on, without trying to free .rodata memory. The real annoyance of the decision to punt on memory management is in the tests, which all have to be rewritten. Whoever you are, you could at least have written the tests responsibly, and spared the person who cleans up your sloppy code from a bunch of drudgery.
Found a double free in the falsed-out one, so that's next up...
Also changing the use of .replaceRange, which lead me to think that the item at the pointer location was being, y'know, replaced, with a use of .insert, which is,, correct.
I had missed that some slices in cleanupSemantic were being taken from the same string/slice that was already at that index, due to the prior code renaming them. Ganbatte!
The library now manages its own memory, as it should.
A diff splitting a UTF-8 codepoint is possible: Ω and ϩ, for example, share a common suffix. This split will make the diff invalid UTF-8, even if both texts are valid. A greater concern is common prefixes: every Greek letter between Α and ο shares the prefix byte 0xce, meaning that splits on a prefix would be the rule, not the exception, in normal multibyte texts. This patch prevents the prefix and suffix diffing from ever splitting on such common prefixes and suffixes, by backing out when a diff lands on a follow byte. It remains to ensure that other splits are never in the midst of a multibyte codepoint.
The only thing which matters here is not splitting codepoints down the middle. Since prefix and suffix matching will also refuse to split, later repair stages won't re-split on us. As the comments note, this doesn't validate UTF-8, and might create less-than-perfect diffs if the texts contain bad Unicode, a situation and outcome I do not care about even slightly. In addition to fixing the problem where diffs would be invalid utf-8 in many ordinary cases, what this gets us is the ability to use codepoints as indices for the diffLine function. Since diffing won't damage utf-8 sequences, we can diff in linemode and decode each sequence back to the codepoint, which will serve as an offset into the line_array.
I believe that's it for operations which perform splits on common seqs. The fun part will be getting code coverage for all these pathways. Not a clue how to handle 1,110,032 line bailouts for diffLinesToChars.
I intend to circle back and make all the BadPatchText's check for fails as well.
This brings the line-mode allocation failure test down to a point where runtimes are comparable to other tests.
The work on this branch is now complete. I think it would be good to have some larger tests for Full line coverage, however, doesn't mean that it all works correctly, and it would be good to supplement that with some integration tests which do non-trivial diffing. Also worth checking if the Unicode-aware code causes any performance regression on ASCII text, my guess is that this would be minor, if visible at all, but it's worth having an actual answer. I know the previous branch isn't merged in yet, so no special hurry with this one. I'm using it in another application, and pointing that at this repo at such time as it all gets merged in will very easy. But it's ready for detailed review when you have time and inclination to do so. |
Just pulled latest zls and this branch in order to test drive. But this branch doesn't build: ~/.../zig/zls $ zig build -Doptimize=ReleaseSafe
install
└─ install zls
└─ zig build-exe zls ReleaseSafe native 1 errors
src/diff.zig:26:19: error: expected type 'error{OutOfMemory}', found 'error{OutOfMemory,BadPatchString}'
const diffs = try dmp.diff(arena.allocator(), before, after, true);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/diff.zig:26:19: note: 'error.BadPatchString' not a member of destination error set My local zls changes diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..6809a60 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -15,9 +15,12 @@
.url = "https://github.com/ziglibs/known-folders/archive/47076c6b11214a218e9244471d8762310820911a.tar.gz",
.hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
},
.diffz = .{
- .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
- .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+ .path = "../diffz",
},
.@"lsp-codegen" = .{
.url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz", |
Is this branch meant to be used in zls or just #23? If not, please ignore my comment above. |
If it is meant to be used in zls, here is a potential zls patch. I'm not sure whether InvalidRequest is the correct error mapping though: error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest), ~/.../zig/zls $ gd master
diff --git a/build.zig.zon b/build.zig.zon
index e4ce16d..84e8706 100644
--- a/build.zig.zon
+++ b/build.zig.zon
@@ -16,8 +16,10 @@
.hash = "12209d2738a2e1dbd3781c2e5f01a2ea877dcfeea53efdfa1913247297d328e6b207",
},
.diffz = .{
- .url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
- .hash = "1220102cb2c669d82184fb1dc5380193d37d68b54e8d75b76b2d155b9af7d7e2e76d",
+ //.url = "https://github.com/ziglibs/diffz/archive/ef45c00d655e5e40faf35afbbde81a1fa5ed7ffb.tar.gz",
+ // TODO change back to ziglibs once this is merged
+ .url = "https://github.com/mnemnion/diffz/archive/bb9a1ae4e7220e5360024df61c369d51516c6860.tar.gz",
+ .hash = "12205c450ca56247e78753a09784f74cb45a6d372c48653e21f64acdc07d5dc40b38",
},
.@"lsp-codegen" = .{
.url = "https://github.com/zigtools/zig-lsp-codegen/archive/13d1d9e44e1c602953437438b163950298ff8d85.tar.gz",
diff --git a/src/Server.zig b/src/Server.zig
index f8c4a22..82bc647 100644
--- a/src/Server.zig
+++ b/src/Server.zig
@@ -138,7 +138,7 @@ pub const Error = error{
/// The client has canceled a request and a server as detected
/// the cancel.
RequestCancelled,
-};
+} || diff.Error;
pub const Status = enum {
/// the server has not received a `initialize` request
@@ -1945,6 +1945,7 @@ fn processMessageReportError(server: *Server, message: Message) ?[]const u8 {
.request => |request| return server.sendToClientResponseError(request.id, lsp.JsonRPCMessage.Response.Error{
.code = @enumFromInt(switch (err) {
error.OutOfMemory => @intFromEnum(types.ErrorCodes.InternalError),
+ error.BadPatchString => @intFromEnum(types.ErrorCodes.InvalidRequest),
error.ParseError => @intFromEnum(types.ErrorCodes.ParseError),
error.InvalidRequest => @intFromEnum(types.ErrorCodes.InvalidRequest),
error.MethodNotFound => @intFromEnum(types.ErrorCodes.MethodNotFound),
diff --git a/src/diff.zig b/src/diff.zig
index a3e602f..ecdf577 100644
--- a/src/diff.zig
+++ b/src/diff.zig
@@ -10,7 +10,7 @@ const dmp = DiffMatchPatch{
.diff_timeout = 250,
};
-pub const Error = error{OutOfMemory};
+pub const Error = error{OutOfMemory} || DiffMatchPatch.DiffError;
pub fn edits(
allocator: std.mem.Allocator, |
I'm test driving this patch now in zls with the the diff above. I have been test driving #23 for the past 2 weeks or so and didn't notice any issues. However I haven't written much zig during this time so this might not be saying much. |
Hmm. So the But it's actually isolated to one public function, so it doesn't have to be a part of the signature for the others. But then we have a I think the move here is to keep both errors in the It also seems better with specifically |
It does occur to me that code calling |
It doesn't make a lot of sense for functions where that's the only case to handle to use a larger superset.
Done, that was relatively painless. Kinda makes me think that |
Thanks! With that last commit, zls no longer requires any changes to build with this PR. |
Update: I've been writing more zig this week and haven't noticed any issues. |
That reminded me that I had gotten ZLS to build earlier in the week, but hadn't patched in the branch. So I did that. So far so good! I removed the arena, and tossed in No idea if the overhead of making an arena for each diff covers the overhead of freeing things individually, I suspect it's a wash though. |
I found a crashing case in Unicode handling, I should get a chance to debug it soon. |
There's some odd false sharing happening which is codeunit related, but at least it will no longer aggressively blow stack.
Fixed. The Unicode-aware diffBisectSplit logic would sometimes create a before or after pair with empty strings. Since this doesn't reduce the problem space for half of the code, it would happen again, until stack overflow. Now it doesn't do that. If one of the pairs is empty, that means the other pair is a single insert/delete diff. |
How far is this from being merged in? I've been following for a while and from what I remember it was pretty close to functional. Do you need any help getting over the finish line? |
I've been running ZLS with it patched in since August with no problems. But that doesn't actually touch the new code. The work on this branch is done, although now that it's on my mind again I wouldn't mind doing a linting pass. It's been working in ohsnap for awhile, which some folks are starting to use. What it could use is a test which handles a large-ish amount of diffed text in multibyte Unicode, using the line-oriented optimization. The bug fixed on 5 August turned up using So if you wanted to hit up a source like the Wikipedia database, and work up a couple of 1-5kB diff tests using, say, Greek for two-byte and Chinese for three, that would be the last thing I've thought of to add here. The branch itself is done, and I think I got kcov to 100% line coverage, it was at least two nines. |
This branch has a completed port of the match and patch algorithms, with the necessary test coverage.
It actually doesn't include the delta code, because I think the delta format sucks and it's worth improving.
Another major change is Unicode awareness: this branch will not cut into Unicode codepoints. This may conceivably come with some performance penalty for ASCII code, if so, the logic is isolated to a few functions and can be feature-gated accordingly. I expect the damage to be minor, perhaps unnoticeable, once I finish performance tuning the algorithms.
A diff which truncates code point sequences is not inherently incorrect, in that it will still apply properly and give a correct before-and-after value. It can result in patches which are invalid UTF-8, which is undesirable, but the patchAddContext logic could prevent this at the boundary.
However, correct code point handling is essential for a change I haven't yet implemented: making the
diffLineMode
function work with a useful number of lines. A reasonable median for number of lines in a text file is 1000 (made up number), and the currentdiffLineMode
gives up after a round one-quarter of those lines, meaning that the effort spent iterating and hashing may, or may not, be made up for by the amount it speeds up a diff. Line mode pays off more for longer files, so this is worth correcting.Because this library is UTF-8 native, we don't have to take a shortcut of ignoring surrogate pairs, and therefore, the approach I intend to take here will allow for 1,112,034 lines of text to be compared. Engineering a test which will actually trigger the "give up" condition is going to be difficult enough that I suggest that we comptime-limit the amount of lines severely in test mode, in order to cover that case.
I've focused on compatibility with the existing library, as the useful default for a port, but I don't think we should let that hold us back here. For the (notional) sake of compatibility, perhaps some of the functionality I'm interested in improving could be offered in a compatibility mode, but there are several least-common-denominator aspects to the design which I'd like to liberate this library from. I expect very few uses will cross over into some other language, and it's not difficult to provide for that use case while offering a better alternative for the Zig ecosystem.
This isn't ready for merge yet, so I'd appreciate if detailed critique waits until it is, at which point it is welcome. As a bit of self-critique, I'm not satisfied that there are at least three copies of the Unicode-breaking algorithm, which differ in the specifics. I think I can get it down to two: an
adjustForward
and anadjustBackward
, and call/inline them in the places where they get used.Directional/design feedback, any bugs spotted while kicking the tires, and any additional test coverage that one of you might spot a need for and want to contribute, are all immediately welcomed. Please let me know what you think.