-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UTF-8 label matchers: Add new parser #3453
Support UTF-8 label matchers: Add new parser #3453
Conversation
f446c6d
to
3f0e90d
Compare
I forgot to signoff the commit, forced pushed. |
3f0e90d
to
24b7b52
Compare
bb4e042
to
e975ad6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping the first half of my review - it's mostly nits. I still need to go through parse.go
and its tests but so far so good (with the caveat that I don't know much about parsers but hope that by the end of this feature, I'm more educated on the matter 😄 )
da17465
to
687e615
Compare
I changed the name of the PR to start with "Support UTF-8 matchers". I will use this on all PRs for this ongoing work to make it easier to find PRs related to this epic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job so far, your test coverage is truly impressive - I played around modifying things as I went along and every time I changed something there were justified cases against them.
I've dropped you a set of extra nits that I'd love to discuss - I'm not done with parse.go
as it's very dense but we're getting there.
matchers/parse/parse.go
Outdated
// and TokenNone is not one of the accepted kinds. It is possible to use either | ||
// Scan() or Peek() as fn depending on whether accept should consume or peek | ||
// the next token. | ||
func (p *Parser) accept(fn func() (Token, error), kind ...TokenKind) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need to try and keep some level of parity between accept
and expect
but I don't see the correlation between them.
My understanding of the current usage of accept
is Peek
at the next token and tell me if it's the kind I expect. As such, it feels clear to me to just inline this as it's only relevant on the parseOpenParen
function and there's no need for an extraction.
I propose the following:
// In parse.go
func (p *Parser) PeekNext() (Token, error) {
t, err := p.lexer.Peek()
if err != nil {
return Token{}, err
}
if t.Kind == TokenNone {
return Token{}, fmt.Errorf("0:%d: %w", len(p.input), ErrEOF)
}
return t, nil
}
// in token.go
// IsAny verifies that the token is of any specified TokenKinds.
func (t Token) IsAny(kinds ...TokenKind) bool {
for _, k := range kinds {
if t.Kind == k {
return true
}
}
return false
}
And finally, change parseOpenParen
func (p *Parser) parseOpenParen(l *Lexer) (parseFn, error) {
// Can start with an optional open brace.
currentToken, err := p.peekNext()
if err != nil {
if errors.Is(err, ErrEOF) {
return p.parseEOF, nil
}
return nil, err
}
p.hasOpenParen = currentToken.IsAny(TokenOpenBrace)
// If the token was an open brace it must be scanned so the token
// following it can be peeked.
if p.hasOpenParen {
if _, err = l.Scan(); err != nil {
panic("Unexpected error scanning open brace")
}
// If the next token is a close brace there are no matchers in the input,
// and we can just parse the close brace.
currentToken, err = p.peekNext()
if err != nil {
return nil, fmt.Errorf("%s: %w", err, ErrNoCloseBrace)
}
if currentToken.IsAny(TokenCloseBrace) {
return p.parseCloseParen, nil
}
}
if currentToken.IsAny(TokenCloseBrace) {
return p.parseCloseParen, nil
}
return p.parseLabelMatcher, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I'm thinking if I should just update the lexer to return an ErrorEOF
error at the same time as TokenNone
if there is no more input. I think doing so would remove the need for peekNext
and instead we can just call l.Peek()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, agreed that we'll move the ErrorEOF
errors into the lexer to simplify what's going on the parser. I'll wait on that before I continue with the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm going to continue with moving ErrorEOF
into the lexer, it doesn't work quite as well as I had expected. I think instead I'm going to go with the other option where the lexer makes its position available via a public method and we remove input
from the parser.
5dd3efa
to
858262d
Compare
Without having done a thorough review, just a thought:
I like this as it matches the Prometheus naming restrictions. And even with the upcoming introduction of UTF-8 strings as names, the requirement to quote the name would be exactly the same. Yay, consistency! However, the AM matchers aren't really fully consistent with PromQL anyway (e.g. single quotes and backticks aren't allowed). How about going the other way and only require double-quoting if a specific subset of characters is contained? Which would be something like |
Speaking of escape sequences: I see cases like |
Single quotes and backticks can be supported if needed. The
I don't think it's that useful because in the case of regex there will be some regexes that work without double quoting and others that don't, depending on the contents of the regex. For example |
And you also needed to change whether escape sequences are honored or not.
I think it's mostly useful because fewer current use cases would break. I leave it to you and @gotjosh to make the trade-off here. Just saying that technically, every string is a regexp, even without any characters that have a special meaning in regexps. So even with the more restrictive character set as in this PR, I'm not feeling strongly either way. I just want you to make an informed call. "Consistency with Prometheus" and "break as few existing use cases as possible" are arguments in different directions, and I would weigh each one much heavier than "characters with special meaning in regexps should always require quoting". |
Ah! Yes, I remember! Perhaps we can add that in future if there is demand for it, as it won't be a breaking change.
The concern I have here is that I don't think we can give useful error messages to users should they change their regex from For example, the error message will be something like:
The user will then change their regex to
It doesn't know if the users intention was to have a regex What do you think @gotjosh? |
Are you saying that the current state of this PR always requires quoting for regexps? E.g. |
Yes that's correct! It should all be mentioned under Breaking changes at the top of the PR (see All non [a-zA-Z_:][a-zA-Z0-9_:]* values must be double quoted). |
The reasoning behind this is that the original goals of this project were 1. to eliminate parsing ambiguities in the current parser and 2. to support UTF-8. If we allow unquoted control characters we have parsing ambiguities. For example |
But my counter proposal does require quoting for all characters that are potentially part of the comparison operator ( |
Yes I understand (don't forget commas need to be quoted too). The concern I have is about error messages. If we relax the grammar so just
and then can change it to an invalid matcher by adding a comparison operator such as:
giving the error message:
The main discussion point for me is that I'm not sure this error message is useful enough to help the user understand they need to add double quotes. But because the lexer is designed around this being double quoted, to facilitate better error messages for this case it will need to be rewritten to be aware of the structure of a matcher rather than just a dumb tokenizer. The options I see are the following:
|
Maybe that error message isn't the worst. It says that a label value is expected, and what it got was a character that is invalid in a label value unless the label value is quoted. In different news, I don't think this problem is linked to regexp matching. BTW: None of these says that quoting is needed. And my proposal (to be more liberal with special characters and only ban those that create ambiguities) would actually make both examples pass, which would clearly be less confusing than a confusing error message. |
You make a good point about the error messages for the cases. When I first started this work (this was before the proposal to allow UTF-8 in Prometheus/Alertmanager was accepted) I had started with a much stricter grammar where all text had to be double quoted. We later relaxed this to Given that the grammar has been relaxed before I think we can relax it further provided doing so does not add parsing ambiguities to the grammar - which this change should not. At this time I'm impartial to doing so and am willing to be persuaded for either case. If we choose to relax the grammar as suggested, I suppose the question remaining is how can we add this to the lexer. To answer that question I think we can do something like the following: diff --git a/matchers/parse/lexer.go b/matchers/parse/lexer.go
index cdf25161..a7546540 100644
--- a/matchers/parse/lexer.go
+++ b/matchers/parse/lexer.go
@@ -32,6 +32,10 @@ func isNum(r rune) bool {
return r >= '0' && r <= '9'
}
+func isReserved(r rune) bool {
+ return unicode.IsSpace(r) || strings.ContainsRune("{}!=~,", r)
+}
+
// ExpectedError is returned when the next rune does not match what is expected.
type ExpectedError struct {
input string
@@ -168,7 +172,7 @@ func (l *Lexer) Scan() (Token, error) {
l.rewind()
tok, l.err = l.scanQuoted()
return tok, l.err
- case r == '_' || isAlpha(r):
+ case !isReserved(r):
l.rewind()
tok, l.err = l.scanIdent()
return tok, l.err
@@ -191,7 +195,7 @@ func (l *Lexer) Scan() (Token, error) {
func (l *Lexer) scanIdent() (Token, error) {
for r := l.next(); r != eof; r = l.next() {
- if !isAlpha(r) && !isNum(r) && r != '_' && r != ':' {
+ if isReserved(r) {
l.rewind()
break
} |
This would also allow unquoted matchers in non-Latin alphabets, such as:
which if Google translate is correct is the Chinese word for test. |
3e6453c
to
daa7cdf
Compare
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Thank you very much for your contribution. |
I did some additional verification this morning and found a couple of examples we might want to fix.
|
Do you mean backslash? About the OM vs. |
Yes that's what I meant! 😄 |
Thanks for the clarification on |
Here is the PR that rejects backslashes outside double quotes #3571 for reference. |
* [CHANGE] Deprecate and remove api/v1/ #2970 * [CHANGE] Remove unused feature flags #3676 * [CHANGE] Newlines in smtp password file are now ignored #3681 * [CHANGE] Change compat metrics to counters #3686 * [CHANGE] Do not register compat metrics in amtool #3713 * [CHANGE] Remove metrics from compat package #3714 * [CHANGE] Mark muted alerts #3793 * [FEATURE] Add metric for inhibit rules #3681 * [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572 * [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565 * [FEATURE] Add date and tz functions to templates #3812 * [FEATURE] Add limits for silences #3852 * [FEATURE] Add time helpers for templates #3863 * [FEATURE] Add auto GOMAXPROCS #3837 * [FEATURE] Add auto GOMEMLIMIT #3895 * [FEATURE] Add Jira receiver integration #3590 * [ENHANCEMENT] Add the receiver name to notification metrics #3045 * [ENHANCEMENT] Add the route ID to uuid #3372 * [ENHANCEMENT] Add duration to the notify success message #3559 * [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555 * [ENHANCEMENT] Add debug logs for muted alerts #3558 * [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610 * [ENHANCEMENT] Add summary to msteams notification #3616 * [ENHANCEMENT] Add context reasons to notifications failed counter #3631 * [ENHANCEMENT] Add optional native histogram support to latency metrics #3737 * [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638 * [ENHANCEMENT] Allow webex roomID from template #3801 * [BUGFIX] Add missing integrations to notify metrics #3480 * [BUGFIX] Add missing ttl in pushhover #3474 * [BUGFIX] Fix scheme required for webhook url in amtool #3409 * [BUGFIX] Remove duplicate integration from metrics #3516 * [BUGFIX] Reflect Discord's max length message limits #3597 * [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683 * [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718 * [BUGFIX] Fix log line in featurecontrol #3719 * [BUGFIX] Fix panic in acceptance tests #3592 * [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722 * [BUGFIX] Fix crash on errors when url_file is used #3800 * [BUGFIX] Fix race condition in dispatch.go #3826 * [BUGFIX] Fix race conditions in the memory alerts store #3648 * [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887 * [BUGFIX] Fix invalid silence causes incomplete updates #3898 * [BUGFIX] Fix leaking of Silences matcherCache entries #3930 * [BUGFIX] Close SMTP submission correctly to handle errors #4006 Signed-off-by: SuperQ <superq@gmail.com>
What this pull request does
This pull request adds the new label matchers parser as proposed in #3353. Included is a number of compliance tests comparing the grammar supported in the new parser with the existing parser in pkg/labels. The compliance tests can be run passing the "compliance" tag when running go test.
Motivation
The original motivation for writing this parser was to add support for matching label names containing
.
and spaces to grafana/grafana. However, about the same time I learned that Prometheus maintainers agreed to add support for UTF-8 labels in Alertmanager, and so I decided to further the work to see if it could be upstreamed to Alertmanager instead.The original source code can be found at grobinson-grafana/matchers.
Supported grammar
This LL(1) parser in its current version is not 100% compatible with the existing regular expression, although it is close and can be modified if required. The grammar can be understood as follows:
Here are some examples of valid inputs:
and some examples of invalid inputs:
Breaking changes
All ^{}!=~,"'`\ and whitespace must be double quoted
It is possible to use UTF-8 on both sides of the expression. However, label names and label values that contain one or more ^{}!=~,"'` characters or whitespace must be double quoted.
#### Expressions must start and end with open and closing bracesAll expressions must start and end with{
and}
, although this can be relaxed if required. For examplefoo=bar
is not valid, it must be{foo=bar}
.#### Trailing commas are not permittedTrailing commas are not permitted. For example{foo=bar,}
is not valid, it must be{foo=bar}
.#### All non[a-zA-Z_:][a-zA-Z0-9_:]*
values must be double quotedThe set of unquoted characters is now the same on both sides of the expression. In other words, both label names and label values without double quotes must match the regular expression[a-zA-Z_:][a-zA-Z0-9_:]*
. For example{foo=!bar}
is not valid, it must be{foo="!bar"}
. In current versions of Alertmanager, unquoted label values can contain all UTF-8 code points with the exception of comma, such as{foo=!bar}
.There are two reasons for this:1. It's no longer possible to write ambiguous matchers which I feel is something Alertmanager should fix. For example is{foo=~}
equivalent to{foo="~"}
or{foo=~""}
?2. If we restrict the=
,!
,~
characters to double quotes we can keep the grammar LL(1). Without this restriction lookahead/backtrack is required to parse matchers such as{foo==~!=!~bar}
which are valid in current versions of Alertmanager.Errors
One of the goals with this LL(1) parser is to provide better error messages than what is possible using just a regular expression. For example:
Benchmarks
I've also provided a number of benchmarks of both the LL(1) parser and regex parser that supports UTF-8. These can be found at grobinson-grafana/matchers-benchmarks. However, to run them
go.mod
must be updated to use the branch https://github.com/grafana/prometheus-alertmanager/tree/yuri-tceretian/utf-8-label-names here.