Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff-match-patch: use line mode for large screen updates #12133

Closed
wants to merge 6 commits into from

Conversation

codeofdusk
Copy link
Contributor

Link to issue number:

Closes #12130.

Summary of the issue:

In some terminal sessions (such as git log output), partial output (such as parts of words) is sometimes read.

Description of how this pull request fixes the issue:

This PR includes an updated version of nvda_dmp. For each diff, nvda_dmp starts at line level. If only one line of output is new, nvda_dmp returns a character diff (using the previous algoriothm). Otherwise, the line mode diff is returned.

Testing strategy:

Known issues with pull request:

None.

Change log entry:

None needed.

Code Review Checklist:

This checklist is a reminder of things commonly forgotten in a new PR.
Authors, please do a self-review and confirm you have considered the following items.
Mark items you have considered by checking them.
You can do this when editing the Pull request description with an x: [ ] becomes [x].
You can also check the checkboxes after the PR is created.

  • Pull Request description is up to date.
  • Unit tests.
  • System (end to end) tests.
  • Manual tests.
  • User Documentation.
  • Change log entry.
  • Context sensitive help for GUI changes.

@codeofdusk
Copy link
Contributor Author

Cc @feerrenrut, @michaelDCurran.

@codeofdusk
Copy link
Contributor Author

Also CC @seanbudd.

@AppVeyorBot
Copy link

See test results for failed build of commit 4844497dd1

@seanbudd seanbudd self-requested a review March 17, 2021 04:43
@feerrenrut
Copy link
Contributor

The nvda_dmp repo has had some non-trival changes. The approach is much more involved now and will need to be carefully considered for bugs (especially buffer overruns) and performance. Unit tests for nvda_dmp would help.

@codeofdusk
Copy link
Contributor Author

The protocol and NVDA-facing code have had no changes from the version on master – only the internal logic has changed.

@feerrenrut
Copy link
Contributor

Regardless, a crash or performance issue in in nvda_dmp will cause significant issues for NVDA users. For us to update to newer version we need to be confident that it won't have problematic corner cases. Automated tests will help to verify this.

@codeofdusk
Copy link
Contributor Author

I have added automated tests to clarify the newly expected behaviour of nvda_dmp.

@AppVeyorBot
Copy link

@feerrenrut
Copy link
Contributor

Thanks @codeofdusk. Could you add performance tests as well? This will demonstrate the improvements this PR claims to provide.

For instance. you could have a test that shows that performance scales with input data according to some mathematical function by adding a test that records the time to diff several lines, then compares the time to diff twice that, ten times that, a hundred times that, and a thousand times that.

If you do this for each of the diff types, the performance implications will be clear. It would also be good to see what happens when nvda_dmp is pushed past it's limits. Eg does the application handle it gracefully or crash. This needs to be taken into account in NVDA.

@codeofdusk
Copy link
Contributor Author

OK, I've ran tests with the DMP test text files. Times listed are for running each algorithm 100 times.

  • nvda_dmp (char mode) is current (before this PR) behaviour when setting "diff algorithm" to "prefer diff-match-patch".
  • nvda_dmp (hybrid mode) is the algorithm implemented in this PR (line mode for changes in multiple lines, char mode otherwise). nvda_dmp uses the line mode algorithm in this case, as more than one line is different.
  • All nvda_dmp algorithms are dramatically faster than Difflib.
Old text has 12979 characters, new text has 11918.
Trying nvda_dmp (character mode)...
nvda_dmp (character mode) ran in 8.7300149 seconds.
Trying nvda_dmp (line mode)...
nvda_dmp (line mode) ran in 0.10564380000000106 seconds.
Trying nvda_dmp (hybrid mode)...
nvda_dmp (hybrid mode) ran in 0.10549289999999978 seconds.
Trying NVDA Difflib-based algorithm...
NVDA Difflib-based algorithm ran in 144.0994433 seconds.
Winner: nvda_dmp (hybrid mode)

@feerrenrut
Copy link
Contributor

Could you add these as automated tests please? This PR adds a significant change in complexity to the nvd_dmp repository, with the claim of performance improvement. Adding automated tests for performance will demonstrate that and allow us to inspect the methodology used to test.

Making performance tests can be hard if you try to work with absolute values, eg an function may run in 0.01 seconds on your machine and but take 0.05 on my machine. Rather than using absolute values, you can use relative values compared to some baseline. Realistically this is what we care about anyway, how the performance of the algorithm changes as the size of the data it has to work on changes.

You might then be able to show that a baseline for char mode with small data takes X seconds

  • Bigger data with char mode is approximately 2 X seconds
  • Much bigger data with char mode is approximately 10 X seconds
  • Huge data with Char mode is approximately 100 X seconds
  • Bigger data with line mode is approximately 1.9 X seconds
  • Much bigger data with line mode is approximately 5 X seconds
  • Huge data with line mode is approximately 30 X seconds
  • Bigger data with hybrid mode is approximately 1.9 X seconds
  • Much bigger data with hybrid mode is approximately 5 X seconds
  • Huge data with hybrid mode is approximately 30 X seconds

Then, it would be wise to talk about the trade-offs for these different approaches. Automated tests can help to demonstrate this, and will likely be more clear.

@codeofdusk
Copy link
Contributor Author

Sorry, how are you suggesting I implement this? Should I commit the speedtest script to the repo?

@feerrenrut
Copy link
Contributor

I don't know what the speedtest script is, I haven't seen it. I assume it runs NVDA? I don't think NVDA should be involved, and I would implement these as unit tests.

Suggestion:

  • Extend the unit tests that you have already added to nvda_dmp. These don't need to test with nvda, just testing each of the methods in isolation.
  • You will need to use a perfTimer to check the performance.
  • You'll need some test data, due to size it is probably best to store it in independent files, these should be committed to the repo.

Tests might look like this:

def setupTestFixture(self):
   self.baseline_small = self.charMode_baseline_small()
   self.baseline_medium = self.charMode_baseline_medium()
  ... etc

def charMode_baseline_small(self):
    beforeBuf, afterBuf, expectedDiffBuf = loadSmallChangesFile() # loads testdata from file for a "small" before, after, and the expected diff buffer
   startTime = time.perf_counter()
   actualDiffBuf = nvda_dmp._char_mode(beforeBuf, afterBuf)
   endTime = time.perf_counter()
   self.assertEqual(actualDiffBuf, expectedDiffBuf) # double check for regressions
  return endTime - startTime

def lineMode_small(self):
    beforeBuf, afterBuf, expectedDiffBuf = loadSmallChangesFile() # loads testdata from file for a "small" before, after, and the 
    startTime = time.perf_counter()
    actualDiffBuf = nvda_dmp._line_mode(beforeBuf, afterBuf)
    endTime = time.perf_counter()
    self.assertEqual(actualDiffBuf, expectedDiffBuf) # double check for regressions
    timeTaken = endTime - startTime
    expectedChangeInAlgoComplexity = 0.9
    self.assertAlmostEqual(self.baseline_small * expectedChangeInAlgoComplexity , timeTaken, delta = somethingReasonable)

... etc

@codeofdusk
Copy link
Contributor Author

The speedtest script doesn't run NVDA, but it does include some NVDA code (notably the existing Difflib algorithm modified to run standalone). Since NVDA code is included in the script, I haven't included it in the repo as I'm not sure how to handle the licensing.

The new algorithm is behind a feature flag, in an unreleased version of NVDA, and this PR provides significant accuracy improvements with little to no performance impact as demonstrated by the speedtest output. If you're still concerned about performance, could this be investigated in a follow-up PR making DMP default once users have had time to test these new changes on master?

@feerrenrut
Copy link
Contributor

I'm concerned about the introduced complexity, without adequate automated testing, and without reproducible proof of the benefit. Adding automated tests can resolve these concerns.

@codeofdusk
Copy link
Contributor Author

Here are automated tests verifying accuracy. The performance test script is below (speedtest text files here):

import difflib
import nvda_dmp
import timeit

fin = open("speedtest1.txt", errors="ignore")
old = fin.read()
fin.close()
fin = open("speedtest2.txt", errors="ignore")
new = fin.read()
fin.close()


def difflib_nvda(oldLines, newLines):
    "The current Difflib algorithm in NVDA."
    outLines = []

    prevLine = None
    for line in difflib.ndiff(oldLines, newLines):
        if line[0] == "?":
            # We're never interested in these.
            continue
        if line[0] != "+":
            # We're only interested in new lines.
            prevLine = line
            continue
        text = line[2:]
        if not text or text.isspace():
            prevLine = line
            continue

        if prevLine and prevLine[0] == "-" and len(prevLine) > 2:
            # It's possible that only a few characters have changed in this line.
            # If so, we want to speak just the changed section, rather than the entire line.
            prevText = prevLine[2:]
            textLen = len(text)
            prevTextLen = len(prevText)
            # Find the first character that differs between the two lines.
            for pos in range(min(textLen, prevTextLen)):
                if text[pos] != prevText[pos]:
                    start = pos
                    break
            else:
                # We haven't found a differing character so far and we've hit the end of one of the lines.
                # This means that the differing text starts here.
                start = pos + 1
            # Find the end of the differing text.
            if textLen != prevTextLen:
                # The lines are different lengths, so assume the rest of the line changed.
                end = textLen
            else:
                for pos in range(textLen - 1, start - 1, -1):
                    if text[pos] != prevText[pos]:
                        end = pos + 1
                        break

            if end - start < 15:
                # Less than 15 characters have changed, so only speak the changed chunk.
                text = text[start:end]

        if text and not text.isspace():
            outLines.append(text)
        prevLine = line

    return outLines


def test_diff_algo(name, func):
    print(f"Trying {name}...")
    test_time = timeit.timeit(lambda: func(old, new), number=100)
    print(f"{name} ran in {test_time} seconds.")
    return (name, test_time)


def test():
    print(f"Old text has {len(old)} characters, new text has {len(new)}.")
    t = [
        test_diff_algo("nvda_dmp (character mode)", nvda_dmp._char_mode),
        test_diff_algo("nvda_dmp (line mode)", nvda_dmp._line_mode),
        test_diff_algo("nvda_dmp (hybrid mode)", nvda_dmp._hybrid_mode),
        test_diff_algo("NVDA Difflib-based algorithm", difflib_nvda),
    ]
    res = min(t, key=lambda t: t[1])
    print(f"Winner: {res[0]}")


if __name__ == "__main__":
    test()

@AppVeyorBot
Copy link

@codeofdusk
Copy link
Contributor Author

Rethinking this PR, as even difflib seems to miss text (i.e. unlike what I previously thought, #12130 isn't DMP specific but seems to be worse with DMP). Will run with DMP in consoles again for a while and see what I find.

@codeofdusk codeofdusk closed this May 4, 2021
@codeofdusk codeofdusk deleted the dmp-update-linemode branch May 4, 2021 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Diff-match-patch: choppy reading in pager output
3 participants