Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Lackluster in diff hunks that have both code changes and indentation changes #820

Open
unphased opened this issue Dec 3, 2021 · 6 comments

Comments

@unphased
Copy link

unphased commented Dec 3, 2021

Example of a diff I generated today that represents the type of workload I care about the most for a tool like this:

Raw git diff:

diff --git i/csvselectcols w/csvselectcols
index 2d0fee0..7fe30bb 100755
--- i/csvselectcols
+++ w/csvselectcols
@@ -7,20 +7,24 @@ def sanitize(s):
 cols = sys.argv[1:]
 idxs = []
 
+def parse_header(cols, header_row):
+    names = [n for n in cols if not n.isnumeric()]
+    if len(names) > 0:
+        if all(name in header_row for name in names):
+            # provided colnames all check out
+            idxs_ = [int(i) if i.isnumeric() else header_row.index(i) for i in cols]
+        else:
+            print (f"Could not find one of the given colnames {str(names)} in header: {str(header_row)}")
+            exit()
+        print(f"indices derived: {idxs_}")
+    else:
+        idxs_ = [int(x) for x in cols]
+    return idxs_
+
 header = True
 for row in csv.reader(sys.stdin):
     if header:
-        names = [n for n in cols if not n.isnumeric()]
-        if len(names) > 0:
-            if all(name in row for name in names):
-                # provided colnames all check out
-                idxs = [int(i) if i.isnumeric() else row.index(i) for i in cols]
-            else:
-                print (f"Could not find one of the given colnames {str(names)} in header: {str(row)}")
-                exit()
-            print(f"idxs: {idxs}")
-        else:
-            idxs = [int(x) for x in cols]
+        idxs = parse_header(cols, row)
 
     print("\t".join([sanitize(row[i]) for i in idxs]))
     header = False

image

Note that Delta kind of "fails" here because it is hampered by not being able to trace content that remained the same across line boundaries.

I wonder if we could maybe add an option for dedicating more compute toward word diffs that we do but across the entire hunk? This would be potentially expensive but surely it is manageable if doing it across the entire input still runs okayishly fast for me.

Am I being unfair on this by throwing in an indentation change?

@unphased unphased changed the title 🐛 Lacking robust detection within unchanged areas 🐛 Lacking in diff hunks that have both code changes and indentation changes Dec 3, 2021
@unphased unphased changed the title 🐛 Lacking in diff hunks that have both code changes and indentation changes 🐛 Lackluster in diff hunks that have both code changes and indentation changes Dec 3, 2021
@dandavison
Copy link
Owner

Your diff is really clear -- nice work. I'd definitely like to have discussions like this about how delta can do better and take more interesting approaches. And yes, I see no reason in principle why we couldn't investigate having delta read in an entire hunk and process all hunk lines together. Currently, as you know, delta reads all lines in a "subhunk" (i.e. a sequence of minus and plus lines, unbroken by any zero lines: https://github.com/dandavison/delta/blob/master/ARCHITECTURE.md#handling-diff-hunk-lines

For this particular diff, what does it look like with diff -w and diff -w --color-moved?

@dandavison
Copy link
Owner

Incidentally, you inspired me to also investigate more minimal diff styles in delta. I was experimenting with using underline for the unchanged parts of lines. But I haven't yet figure out a nice way in the code to remove the underline from (leading) whitespace. OTOH perhaps it's hideous :)

image
    minus-non-emph-style = ul
    plus-non-emph-style = syntax ul

@dandavison
Copy link
Owner

But your diff above really emphasizes that there are interesting avenues to explore in delta involving "squashed diffs" (#152) and hunk rearrangements. How's your Rust? :)

@unphased
Copy link
Author

unphased commented Dec 4, 2021

I cannot conceive of a better reason to learn Rust! It will take some time to get up to speed, but I think a confluence of forces are taking shape for me on this front. I have at least one other person on my team at work (small startup) who is interested in Rust, and I also expect I will want to build some portions of my next big personal project in it. Plus this that's 3 good reasons to really get into it this time around.

I'll also point out that there is an aspect that is poor about the diff that I used in my example above. Note how both the vanilla diff and delta (which is constrained to its output structure on a line basis) have identified the unchanged region

header = True
 for row in csv.reader(sys.stdin):
     if header:

as unchanged while my diff actually treats this region as a removal and addition!. Since it's being treated by my diff as a move (which it knows not how to detect, let alone indicate as a move), it is suffering from the same issue of being shown to the user in a duplicated way, even though it is unchanged! Now, code moves realistically speaking would always be shown in duplicate, but it's rather important at least to indicate them. note that git's move detection cannot activate if any changes show up in the moved code (and the zebra feature acts on line boundaries, which in my eyes is only mildly better than nothing).

Therefore I think this example that I gave inadvertently demonstrates yet another class of deficiency in diffs, that for many practical diffs it would be nice to be able to do a 2nd level meta diff in it.

@dandavison
Copy link
Owner

dandavison commented Dec 4, 2021

I cannot conceive of a better reason to learn Rust! It will take some time to get up to speed,

Incidentally, I can really highly recommend the O'Reilly book Programming Rust by Jim Blandy and Jason Orendorff. It's enjoyable to read, and is pitched at experienced programmers, often explicitly explaining Rust in the context of C++ (not that I know C++ well myself). The best programming language book I've ecountered.

@unphased
Copy link
Author

unphased commented Dec 5, 2021

Thanks for the rec!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants