-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expressions in DNR API (regexFilter) #344
Comments
Chrome's current implementation is hostile for anything but trivial expressions because its internal limit manifests randomly, see this example:
A better approach might be to limit the average It'd be also helpful if we can specify both urlFilter and regexFilter so that |
About the spec of Let's see why this is a problem:
If you ask me, the only sensible solution here would be to make the |
This has been requested before, e.g. at crbug.com/1249915. I'm not opposed, but support for it would not make it easier to arrive at a decision for the desired syntax of
The These validation tools being part of the extension API is a non-issue. The lack of clarity on the syntax of
I can see how attractive that is to extension developers and filter list maintainers, because that format is somewhat well-documented and generally well understood. That advantage in the API design should also be weighted against the runtime impact when the API is used in practice. Besides memory cost, even seemingly small regular expressions can result in catastrophic performance (ReDoS). If we are able to come up with a subset of regular expressions that list authors (and other DNR extensions) actually need, then we can try to reach consensus on the desired syntax that are supported by the DNR API across all browsers. A limited target is easier to optimize for than something as broad as "JS regex". For example, |
@gorhill shared a list of regular expressions that are incompatible with Chrome's current declarativeNetRequest regular expression handling: uBlockOrigin/uBOL-home#15 (comment)
|
Here a full list of unsupported RegExps (with errors codes from Chrome) from our AdGuard filters: 160 errors
We set some limits (chosen empirically) for RegExps groups and maximum length here to try to prevent memory overflow. |
It looks like last report was a bit out of date, so I double checked it in the latest version of chrome (version 110.0.5481.96 arm64) with our latest filters. Here it is: Full list of unsupported regexps (116 rules) with error codes grouped by filters lists
AnalysisUnsupported syntax (syntaxError): 29 errorsFirst of all, these are regular expressions with negative lookahead. Rules that cannot be implemented using current declarativeNetRequest: 21 rules
Some of the rules can be rewritten using Rules that can be fixed: 9 rules
Too complicated regex (memoryLimitExceeded): 87 errorsMost of the rules cannot be converted and it’s unclear how to rewrite them in order to fix the error. We thought there might be some sort of global flag "/g" for the RE2 syntax, because it seems that in the PCRE engine the global flag allows you to increase the range group by several orders of magnitude, but we didn't find a global flag here. |
In the case of Safari there are many more invalid regular expressions in the filters. Maybe instead of discussing just Chrome, we could use Safari as a base line and demonstrate what else we would like to have and why? Negative lookahead
Here's how negative lookahead is typically used. Block everything on a website save for several allowed resourcesA rule would look like this: This use case is covered by Block every first-party request save for several locationsHere's a typical rule for this use case: It's tempting to rewrite it into a pair of rules (blocking + unblocking):
However, it is not exactly the same as it unblocks the second location completely. Filter maintainers usually resort to using negative lookahead when completely unblocking something is a problem. If we had something like
|
WebKit bug tracking this issue: |
I'd just like to add a specific issue I'm running into with this not being specified properly. As an extension developer using DNR, the majority of rules aren't dynamic. They're static, and need to be bundled together with the extension. Doing that bundling is typically done through some good old command line build tools, like Webpack. However, at the point of doing that bundling, there isn't really a good way to check the regex format. It isn't well specified as this issue says. The DNR APIs to check if it's valid like This similarly makes it difficult for any filter list authors to have, for example, CI jobs that ensure that regex filters are valid. At eyeo, we literally maintain a Node package which uses the C++ native extension that imports RE2 and recreates the memory limits used to see if Chrome will call a regex valid (https://www.npmjs.com/package/@eyeo/abp2dnr). This is of course quite fragile because the limits aren't specified, they're just implementation details. Any changes to RE2 or Chrome could bring our regex validation out of sync with Chrome's actual validation. |
I noticed only negative lookahead was mentioned in the issue so just in case want to note that we actively use positive lookahead too. |
The provided example can also be solved by using A DNR implementation can itself extract a token from a regex, which token can then be used as precursor test against the target URL -- this is what uBO does internally to avoid executing regex-based filters and this allows to skip testing a majority of regex-based filters for any given target URL. In my opinion such optimization detail is best left to the DNR engine rather than to ask extensions to deal with this. |
The DNR API currently has three primary ways to match requests by their URL:
urlFilter
- match URL by a literal substring with optional wildcards; Chrome's docs for urlFilter, and also Firefox's source code docs for some undocumented cases.requestDomains
- match domain or superdomain by a literal string; Chrome's docs for requestDomains.regexFilter
- match URL by a regular expression; Chrome's docs for regexFilter.The format of
regexFilter
is poorly specified. This issue is about coming up with whatregexFilter
should support, potentially beyond what the current implementations offer, following the 2023-01-19 WECG meeting (meeting notes will be available once #343 is merged).In Chrome,
regexpFilter
is basically the syntax of the underlying RE2 library plus the additional implementation-dependent constraint that the memory usage of an individual regex may not exceed 2kb (source). Chrome heavily relied on the RE2 library for its implementation, not just for matching, but also for extra optimizations (source code comment in regex_rules_matcher.h). Chrome limits the number of regexFilter rules to 1000.In Safari, all DNR rules are internally converted to regular expressions and the maximum number of supported rules is 150k, without a specific smaller limit of regexFilter rules. Its supported regexp syntax is documented at https://developer.apple.com/documentation/safariservices/creating_a_content_blocker#3030754 . This linked documentation is not Safari's DNR docs, but the underlying Content blocker API that Safari uses. There is a human-readable high-level description of this content blocker API at https://webkit.org/blog/4062/targeting-domains-with-content-blockers/ . Internally, Safari compiles the regular expressions to a set of deterministic finite automatons (DFA). These DFAs are optimized and compiled to bytecode. An interpreter uses this bytecode to find the desired actions for a given URL. This implementation is not a generic regexp engine, so any regexp feature needs to be carefully examined before it can be supported.
In Firefox,
regexFilter
is not implemented yet. Ideally we can reach a resolution here so that extensions don't have to encounter breaking changes.In this issue, the goal is to determine the desired syntax of
regexFilter
, in a way that is useful to extension developers and feasible to implement optimally in web browsers.The text was updated successfully, but these errors were encountered: