Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff output is sometimes confusing and/or hard to read #274

Closed
schellj opened this issue Sep 16, 2021 · 7 comments · Fixed by #275
Closed

Diff output is sometimes confusing and/or hard to read #274

schellj opened this issue Sep 16, 2021 · 7 comments · Fixed by #275

Comments

@schellj
Copy link

schellj commented Sep 16, 2021

I've been working with a test apparatus for comparing JSON data. For ease-of-use and readability, I'd like to be able to compare prettified/indented JSON strings and print the differences in an easy-to-read way.

I've found that the cmp.Diff output can be quite varied even on similar inputs and can be either easy-to-read, somewhat hard to read, or hard to read and confusing depending on the exact input for non-obvious reasons.

See https://play.golang.org/p/mu-TsXoGEEL (pasted below):

Code:

package main

import (
	"fmt"

	"github.com/google/go-cmp/cmp"
)

func main() {
	fmt.Printf("Confusing diff output:\n%s\n", cmp.Diff(`{
  "id": 1,
  "with_package_maps": true,
  "with_other_maps": true
}`, `{
  "id": 1434180,
  "with_package_maps": true,
  "with_other_maps": true
}`))

	fmt.Printf("Non-confusing, but somewhat hard to find differences diff output:\n%s\n", cmp.Diff(`{
  "id": 1,
  "foo": true,
  "bar": true,
}`, `{
  "id": 1434180,
  "foo": true,
  "bar": true,
}`))

	fmt.Printf("Non-confusing, easy to read and find differences diff output:\n%s\n", cmp.Diff(`
{
  "id": 1,
  "with_package_maps": true,
  "with_other_maps": true
}`, `
{
  "id": 1434180,
  "with_package_maps": true,
  "with_other_maps": true
}`))
}

Output:

Confusing diff output:
  strings.Join({
  	"{\n  \"id\": 1",
+ 	"434180",
  	",\n  \"with_package_maps\": true,\n  \"with_other_maps\": true\n}",
  }, "")

Non-confusing, easy to read and find differences diff output:
  (
  	"""
  	
  	{
- 	  "id": 1,
+ 	  "id": 1434180,
  	  "with_package_maps": true,
  	  "with_other_maps": true
  	}
  	"""
  )

Non-confusing, but somewhat harder to find differences diff output:
  string(
- 	"{\n  \"id\": 1,\n  \"foo\": true,\n  \"bar\": true,\n}",
+ 	"{\n  \"id\": 1434180,\n  \"foo\": true,\n  \"bar\": true,\n}",
  )
@dsnet
Copy link
Collaborator

dsnet commented Sep 16, 2021

I've found that the cmp.Diff output can be quite varied even on similar inputs and can be either easy-to-read, somewhat hard to read, or hard to read and confusing depending on the exact input for non-obvious reasons.

When it comes to free-form text, the reporter logic is just a bunch of heuristics to try and choose between several different ways to present the data. We can adjust the heuristics, but it's a hard problem. Bug reports like this help us tune the heuristics.

That said, have you considered doing a more structured diff of the data? https://play.golang.org/p/29x4dsDurlQ
In the example, we pass a transformer to convert an arbitrary string to a structured map[string]interface{} so that cmp.Diff can do a much better job presenting the difference.

@schellj
Copy link
Author

schellj commented Sep 16, 2021

Hi @dsnet, thanks for the reply.

I have tried approaches similar to your example some, but didn't necessarily find them more useful due to times like below where the rendered representation shows integer data in scientific notation and/or as a float:

  string(Inverse(ParseJSON, map[string]interface{}{
- 	"id":                float64(1),
+ 	"id":                float64(1.43418e+06),
  	"with_other_maps":   bool(true),
  	"with_package_maps": bool(true),
  }))

I saw another approach that splits the input string on newlines (e.g., https://play.golang.org/p/opdEWiJjXCJ). The output is more consistent, but not as nice as the triple-quote format. However, I've had some difficultly getting the package to generate the triple-quote format.

That said, I understand that this is a hard problem. I found my examples particularly confusing because the inputs are fairly similar and could only generate the first example with the strings.Join output with the JSON keys like with_package_maps and with_other_maps for whatever reason.

@dsnet
Copy link
Collaborator

dsnet commented Sep 16, 2021

  strings.Join({
  	"{\n  \"id\": 1",
+ 	"434180",
  	",\n  \"with_package_maps\": true,\n  \"with_other_maps\": true\n}",
  }, "")

For this case, the cause is because the logic determined that it was "more efficient" to use that representation due to this heuristic:

go-cmp/cmp/report_slices.go

Lines 148 to 150 in 395a0ac

efficiencyLines := float64(esLines.Dist()) / float64(len(esLines))
efficiencyBytes := float64(esBytes.Dist()) / float64(len(esBytes))
isPureLinedText = efficiencyLines < 4*efficiencyBytes

Technically, the heuristic is correct since this representation only has a single insertion operation, while the triple-quote representation would have had a insertion+deletion.

  string(
- 	"{\n  \"id\": 1,\n  \"foo\": true,\n  \"bar\": true,\n}",
+ 	"{\n  \"id\": 1434180,\n  \"foo\": true,\n  \"bar\": true,\n}",
  )

For the second case in your example, the cause is because we avoid special string formatting for anything shorter than 64 bytes:

const minLength = 64
return vx.Len() >= minLength && vy.Len() >= minLength

we can decrease this 32 or something.

@schellj
Copy link
Author

schellj commented Sep 16, 2021

Thanks for the information.

Another issue that I've noticed is that running the same test with the same input/output will sometimes output the strings.Join format and other times the triple-quote format.

@dsnet
Copy link
Collaborator

dsnet commented Sep 16, 2021

Another issue that I've noticed is that running the same test with the same input/output will sometimes output the strings.Join format and other times the triple-quote format.

Without looking into it, that's probably due to the deliberate non-determinism with the internal diffing algorithm where one way of doing it returns results that are slightly more optimal than another. It's been on my TODO list to make that algorithm more optimal. The current algorithm favors performance over being optimal. It is guaranteed to be O(n), while an optimal algorithm is O(n^2). It'd be nice if we can get closer to optimal without being O(n^2).

@schellj
Copy link
Author

schellj commented Sep 16, 2021

Understood re: non-determinism.

Outside of a magical way to solve this complex problem for all cases, It seems there isn't really anything concrete to do here, so I'll close the case.

Thanks for the discussion.

@dsnet
Copy link
Collaborator

dsnet commented Sep 16, 2021

I sent out #275 to fix the one that looks the worst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants