Harden heuristics against `Regexp::TimeoutError` errors #6518

lildude · 2023-08-16T13:59:19Z

Description

Ruby 3.2 allows you to set a process global Regexp.timeout = <time> to limit the impact of poorly written regexes and prevent ReDoS attacks. GitHub is now enforcing a timeout which has revealed...

a) we're not playing nicely and rescuing the Regexp::TimeoutError errors so GitHub may sometime respond with a 500 error or not render a file at all
b) #6417 introduced a regex which suffers from catastrophic backtracking which ultimately causes the Regexp::TimeoutError errors. This was missed as the issue only comes to light when analysing largish files.

This PR addresses both by rescuing the Regexp::TimeoutError exceptions when analysing using the heuristics strategy - we return an empty result as misidentifcation is better than a 500 error - and replaces the bad regex that brought this to light.

Checklist:

I am adding new or changing current functionality
- I have added or updated the tests for the new or changed functionality.

Ruby 3.2 allows you to set a process global `Regexp.timeout = <time>` to limit the impact of poor regex and prevent ReDoS attacks. GitHub is now enforcing a timeout so we need to be sure to play nicely.

DecimalTurn · 2023-08-16T15:02:32Z

This looks good to me. On top of solving backtracking issues, [\s\S]* also makes things much more readable as it's clear to me that the original regex was written to match "anything including line endings" until we get to endsolid.

In a future PR, we might also want to replace other instances of (?:.|[\r\n])* even if they aren't causing trouble just yet.

EDIT: Minor difference I see is that the original insisted that endsolid be on a seperate line. The new regex (by removing the ^) allows solid and endsolid to be on the same line and still matches.

lildude · 2023-08-16T15:40:06Z

In a future PR, we might also want to replace other instances of (?:.|[\r\n])* even if they aren't causing trouble just yet.

Looks like we might only have two more uses of this pattern.

EDIT: Minor difference I see is that the original insisted that endsolid be on a seperate line. The new regex (by removing the ^) allows solid and endsolid to be on the same line and still matches.

Oh yes. I'll add back the ^ - it doesn't make much difference adding it back.

Alhadis · 2023-08-19T14:31:38Z

lib/linguist/heuristics.yml

@@ -686,7 +686,7 @@ disambiguations:
 - extensions: ['.stl']
  rules:
  - language: STL
-    pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)'
+    pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)'


First of all, solid[\s\S] will match stuff like solidobject (which I presume isn't valid STL syntax), so either (?=$|\s) or (?:$|\s) is necessary here.

Second, because this heuristic is anchored to the first-line of input (\A), it's possible to match the terminating endsolid directive using a separate regex:

Suggested change

pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)'

and:

- pattern: '\A\s*solid(?:$|\s)'

- pattern: '^\s*endsolid(?:$|\s)'

(Nota bene: I haven't tested these changes locally)

Is there a reason why (?:$|\s) is preferred over \b in this case?

2. Second, because this heuristic is anchored to the first-line of input (\A), it's possible to match the terminating endsolid directive using a separate regex:

Splitting things makes this much more expensive. The second regex here is really expensive (11144 steps) and quite possibly going to lead to another case of things running amok as the files grow compared my initial suggestion (29 steps), which now you point it out, needs improving for the situation you mentioned.

These all seem to match as be much more performant:

\A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s) => 33 steps

\A\s*solid\b[\s\S]*^endsolid(?:$|\s) => 30 steps

\A\s*solid\b[\s\S]*^endsolid\b => 28 steps

Sorry for the perf problems, everyone.

Does regex101 reliably approximate the cost to run these regexps in Ruby? The good original, before my bad version, was '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)'; the closest thing that works in regex101 is /\A\s*solid(?=$|\s)(?:.*?)\nendsolid(?:$|\s)/gms, and it claims that is super expensive (107792 steps).

OK, I tested the regexps in Ruby.

None of them perform badly on the test case, or on a 5000 line file I found.

Here's how long each regex takes to run on the 5000-line file, in milliseconds:

original: 1.9683429999859072 current: 2.581663000048138 proposed: 1.2971370000159368

regex101 is using a different regex engine.

The numbers above are for Ruby 3.2.2 on my laptop. Here's the code:

source code

require 'benchmark' ORIGINAL_RE = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/ CURRENT_RE = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/ PROPOSED_RE = /\A\s*solid[\s\S]*^endsolid(?:$|\s)/ text = File.read("SV05_bed.stl") t = Benchmark.measure { 1000.times do ORIGINAL_RE.match?(text) end } puts "original: #{t.real}" t = Benchmark.measure { 1000.times do CURRENT_RE.match?(text) end } puts "current: #{t.real}" t = Benchmark.measure { 1000.times do PROPOSED_RE.match?(text) end } puts "proposed: #{t.real}"

regex101 is using a different regex engine.

Ruby and PCRE differ in their interpretation of "multiline" modifiers (the parts scoped with (?m:…)). Specifically:

(?m:...)
means…
(?s:...)
means…

Ruby . matches any character, including newlines N/A (No such modifier)

PCRE/Perl ^ and $ respectively match the beginning and end of each line, rather than the input. . matches any character, including newlines; implies (?m).

See, the behaviour that (?m) normally enables in Perl/PCRE is instead the default in Ruby; the \A and \Z/\z assertions are used to match the start and end of the input string, respectively (which would normally be achieved using ^ and $ without multiline mode).

Now, why is this relevant? Because benchmarks could be skewed in favour of whichever regex happens to fail first (if it won't match . because the dotall modifier is needed, then the regex engine logically has no reason to continue reading past the first newline).

Alhadis · 2023-08-19T14:37:30Z

@lildude Hope you don't mind my changing the PR's title; it exceeded the 72-character limit of well-formed Git commit-messages. 😉

Alhadis · 2023-08-26T06:53:42Z

Linguist's heuristics are trying to

Scale well (natively-supported regex constructs like \R and . in multiline mode will likely be more performant than something like (?:.|\r?\n)*),
Be implemented purely as regular expressions, something that greatly restricts our available options when dealing with issues like these. Recall that Linguist's heuristics were historically implemented in Ruby (à la Linguist::Generated), albeit mostly with regular expressions,
Satisfy the limitations of a foreign regex engine; i.e., by removing lookarounds and replacing \R with (?:.|\r?\n)*.

To quote Matthew 6:24-34: No one can serve two masters. If @jesuschrist were active on GitHub, I'm sure he'd agree that we're trying to serve three masters (i.e., conflicting objectives). That third objective is starting to become a liability, IMHO.

/cc @jorendorff

DecimalTurn · 2023-08-26T23:21:34Z

... by removing lookarounds and replacing \R with (?:.|\r?\n)*

I thought we were replacing \R by (?:\r?\n|\r) like discussed here.

In the case that this PR addresses, we tried replacing (?m).* with (?:.|[\r\n])*, but it turns out that [\s\S]* is much better for performance.

Based on the Regex101 debugger, it seems like [\s\S]* allows the engine to skip all the way to the end (just like with (?m).*) which is quite helpful when the file is large:

jorendorff · 2023-08-28T18:16:57Z

Scale well (natively-supported regex constructs like \R and . in multiline mode will likely be more performant than something like (?:.|\r?\n)*),

The portable regex in this PR is 50% faster than the original (which uses \R and . in multiline mode). So there aren't really conflicting priorities here.

I didn't meant to introduce one -- of course Linguist is a Ruby library above all, and you must do what is right for the project in that light.

lib/linguist/heuristics.yml

DecimalTurn · 2023-09-06T17:57:41Z

@lildude are you considering a minor release for these changes (ie. v7.26.1) since this can affect other projects (such as go-enry) that are dependant on Linguist or were you planning to wait for the next major release?

Also, should there be a warning about this issue in the notes for https://github.com/github-linguist/linguist/releases/tag/v7.26.0 ?

Alhadis · 2023-09-07T04:13:54Z

lildude · 2023-09-07T08:15:52Z

@lildude are you considering a minor release for these changes (ie. v7.26.1) since this can affect other projects (such as go-enry) that are dependant on Linguist or were you planning to wait for the next major release?

@DecimalTurn I was going to make a patch release, however it's taken so long to finish this PR (I've been distracted by higher priorities of my day job) that it's now time for a major release anyway as the freeze date for the next GitHub Enterprise Server (GHES) release is fast approaching and I like the Linguist updates to bake for a bit in prod before the freeze in case any niggles pop-up (it's a PITA to backport them to GHES 😁).

Gonna merge this and start getting things in line for a new release early next week.

lildude added 2 commits August 16, 2023 13:50

Rescue Regexp::TimeoutError errors

8591817

Ruby 3.2 allows you to set a process global `Regexp.timeout = <time>` to limit the impact of poor regex and prevent ReDoS attacks. GitHub is now enforcing a timeout so we need to be sure to play nicely.

Replace catastrophic backreferencing regex

51cafe7

lildude requested a review from a team as a code owner August 16, 2023 13:59

Skip test if Ruby < 3.2.0

abf654e

Add back ^

5c46551

Alhadis requested changes Aug 19, 2023

View reviewed changes

Alhadis changed the title ~~Improve heuristic strategy to rescue Regexp::TimeoutError errors and fix a bad regex~~ Harden heuristics against Regexp::TimeoutError errors Aug 19, 2023

lildude mentioned this pull request Aug 25, 2023

Add hosts to Host file aliases #6486

Merged

Update regex

5c825cf

jorendorff reviewed Aug 29, 2023

View reviewed changes

lib/linguist/heuristics.yml Show resolved Hide resolved

lildude requested a review from Alhadis August 31, 2023 02:35

DecimalTurn mentioned this pull request Sep 6, 2023

Update Linguist to v7.26.0 go-enry/go-enry#169

Merged

Alhadis approved these changes Sep 7, 2023

View reviewed changes

lildude merged commit a0a6d59 into master Sep 7, 2023

lildude deleted the lildude/regex-improvements branch September 7, 2023 08:22

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden heuristics against `Regexp::TimeoutError` errors #6518

Harden heuristics against `Regexp::TimeoutError` errors #6518

lildude commented Aug 16, 2023 •

edited

Loading

DecimalTurn commented Aug 16, 2023 •

edited

Loading

lildude commented Aug 16, 2023

Alhadis Aug 19, 2023

DecimalTurn Aug 20, 2023

lildude Aug 21, 2023

jorendorff Aug 24, 2023 •

edited

Loading

jorendorff Aug 25, 2023 •

edited

Loading

jorendorff Aug 25, 2023 •

edited

Loading

Alhadis Aug 26, 2023

`(?m:...)`

`(?s:...)`

Alhadis commented Aug 19, 2023

Alhadis commented Aug 26, 2023

DecimalTurn commented Aug 26, 2023 •

edited

Loading

jorendorff commented Aug 28, 2023

DecimalTurn commented Sep 6, 2023 •

edited

Loading

Alhadis commented Sep 7, 2023

lildude commented Sep 7, 2023

-    pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)'
+    and:
+    - pattern: '\A\s*solid(?:$|\s)'
+    - pattern: '^\s*endsolid(?:$|\s)'

	`(?m:...)` means…	`(?s:...)` means…
Ruby	`.` matches any character, including newlines	N/A (No such modifier)
PCRE/Perl	`^` and `$` respectively match the beginning and end of each line, rather than the input.	`.` matches any character, including newlines; implies `(?m)`.

Harden heuristics against Regexp::TimeoutError errors #6518

Harden heuristics against Regexp::TimeoutError errors #6518

Conversation

lildude commented Aug 16, 2023 • edited Loading

Description

Checklist:

DecimalTurn commented Aug 16, 2023 • edited Loading

lildude commented Aug 16, 2023

Alhadis Aug 19, 2023

Choose a reason for hiding this comment

DecimalTurn Aug 20, 2023

Choose a reason for hiding this comment

lildude Aug 21, 2023

Choose a reason for hiding this comment

jorendorff Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

jorendorff Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

jorendorff Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Alhadis Aug 26, 2023

Choose a reason for hiding this comment

(?m:...)

(?s:...)

Alhadis commented Aug 19, 2023

Alhadis commented Aug 26, 2023

DecimalTurn commented Aug 26, 2023 • edited Loading

jorendorff commented Aug 28, 2023

DecimalTurn commented Sep 6, 2023 • edited Loading

Alhadis commented Sep 7, 2023

lildude commented Sep 7, 2023

Harden heuristics against `Regexp::TimeoutError` errors #6518

Harden heuristics against `Regexp::TimeoutError` errors #6518

lildude commented Aug 16, 2023 •

edited

Loading

DecimalTurn commented Aug 16, 2023 •

edited

Loading

jorendorff Aug 24, 2023 •

edited

Loading

jorendorff Aug 25, 2023 •

edited

Loading

jorendorff Aug 25, 2023 •

edited

Loading

`(?m:...)`

`(?s:...)`

DecimalTurn commented Aug 26, 2023 •

edited

Loading

DecimalTurn commented Sep 6, 2023 •

edited

Loading