Diff output is sometimes confusing and/or hard to read #274

schellj · 2021-09-16T22:23:37Z

I've been working with a test apparatus for comparing JSON data. For ease-of-use and readability, I'd like to be able to compare prettified/indented JSON strings and print the differences in an easy-to-read way.

I've found that the cmp.Diff output can be quite varied even on similar inputs and can be either easy-to-read, somewhat hard to read, or hard to read and confusing depending on the exact input for non-obvious reasons.

See https://play.golang.org/p/mu-TsXoGEEL (pasted below):

Code:

package main

import (
	"fmt"

	"github.com/google/go-cmp/cmp"
)

func main() {
	fmt.Printf("Confusing diff output:\n%s\n", cmp.Diff(`{
  "id": 1,
  "with_package_maps": true,
  "with_other_maps": true
}`, `{
  "id": 1434180,
  "with_package_maps": true,
  "with_other_maps": true
}`))

	fmt.Printf("Non-confusing, but somewhat hard to find differences diff output:\n%s\n", cmp.Diff(`{
  "id": 1,
  "foo": true,
  "bar": true,
}`, `{
  "id": 1434180,
  "foo": true,
  "bar": true,
}`))

	fmt.Printf("Non-confusing, easy to read and find differences diff output:\n%s\n", cmp.Diff(`
{
  "id": 1,
  "with_package_maps": true,
  "with_other_maps": true
}`, `
{
  "id": 1434180,
  "with_package_maps": true,
  "with_other_maps": true
}`))
}

Output:

Confusing diff output:
  strings.Join({
  	"{\n  \"id\": 1",
+ 	"434180",
  	",\n  \"with_package_maps\": true,\n  \"with_other_maps\": true\n}",
  }, "")

Non-confusing, easy to read and find differences diff output:
  (
  	"""
  	
  	{
- 	  "id": 1,
+ 	  "id": 1434180,
  	  "with_package_maps": true,
  	  "with_other_maps": true
  	}
  	"""
  )

Non-confusing, but somewhat harder to find differences diff output:
  string(
- 	"{\n  \"id\": 1,\n  \"foo\": true,\n  \"bar\": true,\n}",
+ 	"{\n  \"id\": 1434180,\n  \"foo\": true,\n  \"bar\": true,\n}",
  )

The text was updated successfully, but these errors were encountered:

dsnet · 2021-09-16T22:33:55Z

I've found that the cmp.Diff output can be quite varied even on similar inputs and can be either easy-to-read, somewhat hard to read, or hard to read and confusing depending on the exact input for non-obvious reasons.

When it comes to free-form text, the reporter logic is just a bunch of heuristics to try and choose between several different ways to present the data. We can adjust the heuristics, but it's a hard problem. Bug reports like this help us tune the heuristics.

That said, have you considered doing a more structured diff of the data? https://play.golang.org/p/29x4dsDurlQ
In the example, we pass a transformer to convert an arbitrary string to a structured map[string]interface{} so that cmp.Diff can do a much better job presenting the difference.

schellj · 2021-09-16T22:50:57Z

Hi @dsnet, thanks for the reply.

I have tried approaches similar to your example some, but didn't necessarily find them more useful due to times like below where the rendered representation shows integer data in scientific notation and/or as a float:

  string(Inverse(ParseJSON, map[string]interface{}{
- 	"id":                float64(1),
+ 	"id":                float64(1.43418e+06),
  	"with_other_maps":   bool(true),
  	"with_package_maps": bool(true),
  }))

I saw another approach that splits the input string on newlines (e.g., https://play.golang.org/p/opdEWiJjXCJ). The output is more consistent, but not as nice as the triple-quote format. However, I've had some difficultly getting the package to generate the triple-quote format.

That said, I understand that this is a hard problem. I found my examples particularly confusing because the inputs are fairly similar and could only generate the first example with the strings.Join output with the JSON keys like with_package_maps and with_other_maps for whatever reason.

dsnet · 2021-09-16T22:56:21Z

  strings.Join({
  	"{\n  \"id\": 1",
+ 	"434180",
  	",\n  \"with_package_maps\": true,\n  \"with_other_maps\": true\n}",
  }, "")

For this case, the cause is because the logic determined that it was "more efficient" to use that representation due to this heuristic:

go-cmp/cmp/report_slices.go

Lines 148 to 150 in 395a0ac

    
           efficiencyLines := float64(esLines.Dist()) / float64(len(esLines)) 
        
           efficiencyBytes := float64(esBytes.Dist()) / float64(len(esBytes)) 
        
           isPureLinedText = efficiencyLines < 4*efficiencyBytes

Technically, the heuristic is correct since this representation only has a single insertion operation, while the triple-quote representation would have had a insertion+deletion.

  string(
- 	"{\n  \"id\": 1,\n  \"foo\": true,\n  \"bar\": true,\n}",
+ 	"{\n  \"id\": 1434180,\n  \"foo\": true,\n  \"bar\": true,\n}",
  )

For the second case in your example, the cause is because we avoid special string formatting for anything shorter than 64 bytes:

go-cmp/cmp/report_slices.go

Lines 83 to 84 in 395a0ac

    
           const minLength = 64 
        
           return vx.Len() >= minLength && vy.Len() >= minLength

we can decrease this 32 or something.

schellj · 2021-09-16T23:01:36Z

Thanks for the information.

Another issue that I've noticed is that running the same test with the same input/output will sometimes output the strings.Join format and other times the triple-quote format.

dsnet · 2021-09-16T23:24:21Z

Another issue that I've noticed is that running the same test with the same input/output will sometimes output the strings.Join format and other times the triple-quote format.

Without looking into it, that's probably due to the deliberate non-determinism with the internal diffing algorithm where one way of doing it returns results that are slightly more optimal than another. It's been on my TODO list to make that algorithm more optimal. The current algorithm favors performance over being optimal. It is guaranteed to be O(n), while an optimal algorithm is O(n^2). It'd be nice if we can get closer to optimal without being O(n^2).

schellj · 2021-09-16T23:35:23Z

Understood re: non-determinism.

Outside of a magical way to solve this complex problem for all cases, It seems there isn't really anything concrete to do here, so I'll close the case.

Thanks for the discussion.

dsnet · 2021-09-16T23:41:02Z

I sent out #275 to fix the one that looks the worst.

schellj closed this as completed Sep 16, 2021

dsnet mentioned this issue Sep 16, 2021

Reduce minimum length for specialized string diffing #275

Merged

mgartner mentioned this issue Jan 14, 2022

roachtest: tlp failed cockroachdb/cockroach#74675

Closed

prashantv mentioned this issue Mar 21, 2024

String diff for UUIDs can use strings.Join syntax which makes it more confusing #356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff output is sometimes confusing and/or hard to read #274

Diff output is sometimes confusing and/or hard to read #274

schellj commented Sep 16, 2021 •

edited

Loading

dsnet commented Sep 16, 2021 •

edited

Loading

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

Diff output is sometimes confusing and/or hard to read #274

Diff output is sometimes confusing and/or hard to read #274

Comments

schellj commented Sep 16, 2021 • edited Loading

dsnet commented Sep 16, 2021 • edited Loading

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

schellj commented Sep 16, 2021

dsnet commented Sep 16, 2021

schellj commented Sep 16, 2021 •

edited

Loading

dsnet commented Sep 16, 2021 •

edited

Loading