-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden heuristics against Regexp::TimeoutError
errors
#6518
Conversation
Ruby 3.2 allows you to set a process global `Regexp.timeout = <time>` to limit the impact of poor regex and prevent ReDoS attacks. GitHub is now enforcing a timeout so we need to be sure to play nicely.
This looks good to me. On top of solving backtracking issues, In a future PR, we might also want to replace other instances of EDIT: Minor difference I see is that the original insisted that |
Looks like we might only have two more uses of this pattern.
Oh yes. I'll add back the |
lib/linguist/heuristics.yml
Outdated
@@ -686,7 +686,7 @@ disambiguations: | |||
- extensions: ['.stl'] | |||
rules: | |||
- language: STL | |||
pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)' | |||
pattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
First of all,
solid[\s\S]
will match stuff likesolidobject
(which I presume isn't valid STL syntax), so either(?=$|\s)
or(?:$|\s)
is necessary here. -
Second, because this heuristic is anchored to the first-line of input (
\A
), it's possible to match the terminatingendsolid
directive using a separate regex:Suggested changepattern: '\A\s*solid[\s\S]*^endsolid(?:$|\s)' and: - pattern: '\A\s*solid(?:$|\s)' - pattern: '^\s*endsolid(?:$|\s)' (Nota bene: I haven't tested these changes locally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why (?:$|\s)
is preferred over \b
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Second, because this heuristic is anchored to the first-line of input (
\A
), it's possible to match the terminatingendsolid
directive using a separate regex:
Splitting things makes this much more expensive. The second regex here is really expensive (11144 steps) and quite possibly going to lead to another case of things running amok as the files grow compared my initial suggestion (29 steps), which now you point it out, needs improving for the situation you mentioned.
These all seem to match as be much more performant:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the perf problems, everyone.
Does regex101 reliably approximate the cost to run these regexps in Ruby? The good original, before my bad version, was '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)'
; the closest thing that works in regex101 is /\A\s*solid(?=$|\s)(?:.*?)\nendsolid(?:$|\s)/gms
, and it claims that is super expensive (107792 steps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I tested the regexps in Ruby.
None of them perform badly on the test case, or on a 5000 line file I found.
Here's how long each regex takes to run on the 5000-line file, in milliseconds:
original: 1.9683429999859072
current: 2.581663000048138
proposed: 1.2971370000159368
regex101 is using a different regex engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers above are for Ruby 3.2.2 on my laptop. Here's the code:
source code
require 'benchmark'
ORIGINAL_RE = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/
CURRENT_RE = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/
PROPOSED_RE = /\A\s*solid[\s\S]*^endsolid(?:$|\s)/
text = File.read("SV05_bed.stl")
t = Benchmark.measure {
1000.times do
ORIGINAL_RE.match?(text)
end
}
puts "original: #{t.real}"
t = Benchmark.measure {
1000.times do
CURRENT_RE.match?(text)
end
}
puts "current: #{t.real}"
t = Benchmark.measure {
1000.times do
PROPOSED_RE.match?(text)
end
}
puts "proposed: #{t.real}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regex101 is using a different regex engine.
Ruby and PCRE differ in their interpretation of "multiline" modifiers (the parts scoped with (?m:…)
). Specifically:
(?m:...)means… |
(?s:...)means… |
|
---|---|---|
Ruby | . matches any character, including newlines |
N/A (No such modifier) |
PCRE/Perl | ^ and $ respectively match the beginning and end of each line, rather than the input. |
. matches any character, including newlines; implies (?m) . |
See, the behaviour that (?m)
normally enables in Perl/PCRE is instead the default in Ruby; the \A
and \Z
/\z
assertions are used to match the start and end of the input string, respectively (which would normally be achieved using ^
and $
without multiline mode).
Now, why is this relevant? Because benchmarks could be skewed in favour of whichever regex happens to fail first (if it won't match .
because the dotall modifier is needed, then the regex engine logically has no reason to continue reading past the first newline).
Regexp::TimeoutError
errors
@lildude Hope you don't mind my changing the PR's title; it exceeded the 72-character limit of well-formed Git commit-messages. 😉 |
Linguist's heuristics are trying to
To quote Matthew 6:24-34: /cc @jorendorff |
I thought we were replacing In the case that this PR addresses, we tried replacing Based on the Regex101 debugger, it seems like |
The portable regex in this PR is 50% faster than the original (which uses I didn't meant to introduce one -- of course Linguist is a Ruby library above all, and you must do what is right for the project in that light. |
@lildude are you considering a minor release for these changes (ie. v7.26.1) since this can affect other projects (such as go-enry) that are dependant on Linguist or were you planning to wait for the next major release? Also, should there be a warning about this issue in the notes for https://github.com/github-linguist/linguist/releases/tag/v7.26.0 ? |
@DecimalTurn I was going to make a patch release, however it's taken so long to finish this PR (I've been distracted by higher priorities of my day job) that it's now time for a major release anyway as the freeze date for the next GitHub Enterprise Server (GHES) release is fast approaching and I like the Linguist updates to bake for a bit in prod before the freeze in case any niggles pop-up (it's a PITA to backport them to GHES 😁). Gonna merge this and start getting things in line for a new release early next week. |
Description
Ruby 3.2 allows you to set a process global
Regexp.timeout = <time>
to limit the impact of poorly written regexes and prevent ReDoS attacks. GitHub is now enforcing a timeout which has revealed...a) we're not playing nicely and rescuing the
Regexp::TimeoutError
errors so GitHub may sometime respond with a 500 error or not render a file at allb) #6417 introduced a regex which suffers from catastrophic backtracking which ultimately causes the
Regexp::TimeoutError
errors. This was missed as the issue only comes to light when analysing largish files.This PR addresses both by rescuing the
Regexp::TimeoutError
exceptions when analysing using the heuristics strategy - we return an empty result as misidentifcation is better than a 500 error - and replaces the bad regex that brought this to light.Checklist: