Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend allowed characters for grok field names (#21745) #31653

Merged
merged 3 commits into from
Jun 29, 2018

Conversation

original-brownbear
Copy link
Member

Fixes #21745
I couldn't make it as lenient as matching every visible char because whatever is matched by the subname group is itself used as the name of a matcher group (so things like \ won't work without additional escaping).

So I left level of leniency out for now (I can look into making that work as well, but it's going to be a trickier change potentially) and used @clintongormley 's suggestion here (#21745 (comment)) of simply using unicode properties (:alnum: matches all letters and numbers here) + allowed punctuation (I defined allowed as exactly what LS grok allows so only [\[\]@] was added) for now.

let me know what you think :)

@original-brownbear original-brownbear added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@@ -439,4 +439,13 @@ public void testExponentialExpressions() {
run.set(false);
assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms"));
}

public void testUnicodeFieldnames() {
Copy link
Contributor

@talevy talevy Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add more tests here for when things are expected to fail and/or succeed.

maybe pick random unicode codepoints between a certain range that is known to be :alphanum: and
have explicit checks for random ones that are out of the alphanum range. I do not know too much about these ranges, so I understand if that is not practical.

at least we can add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy as far as I know the [:alphanum:] range isn't contiguous (it def. isn't at the boundary between letters and numbers I think).
So:

add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work

seems the way to go imo (obviously could choose a different range as well but I don't think it would be logically different from just testing the old range + 2 concrete examples of things that broke before).

Are you sure we want to test the negative case here too? So basically just use some char that we def don't support in the filename like say (field and match for null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a sanity check that declares we do not support arbitrary unicode. I don't think we have that around

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy sure, will add that (though come to think of it, the behaviour there kind of sucks in that we're just quietly returning null for fieldnames that can't even work to begin with).

I wonder if we should maybe add some validation for the grokPattern to the Grok constructor (not necessarily in this PR ... more thinking in general)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behaviour there kind of sucks in that we're just quietly returning null for fieldnames that can't even work to begin

This definitely sounds like something that should be fixed in a followup.

@jasontedor jasontedor requested a review from rjernst June 28, 2018 14:56
@jasontedor
Copy link
Member

@rjernst Pinging for review.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, please don't title commits as "be more lenient". Leniency has a negative connotation among the team. Instead, something like "Extend allowed characters for grok field names" is more apt.


public void testUnicodeFieldnames() {
for(String fieldName : Arrays.asList("@metadata", "@metädata", "@metädat[a]")) {
String line = "foo";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this put into a local var? It's only used once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -439,4 +439,13 @@ public void testExponentialExpressions() {
run.set(false);
assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms"));
}

public void testUnicodeFieldnames() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that more tests would be good. A common pattern we have in tests is to add a special assert method for something like this, then call that method from different tests cases. I would have tests for random alphanumeric, special characters (including each that are defined in the regex), and then something outside of the characters defined in the regex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst done I think :)

Note:

  • The : is in the regex but not a supported part of a field name because of the possibility of type definition in the Grok section making that impossible.
  • Added a simple check for the unsupported () as an example of a char falling outside of the allowed range.

@original-brownbear original-brownbear changed the title Be more lenient with fieldnames in Grok Processor (#21745) Extend allowed characters for grok field names (#21745) Jun 28, 2018
@original-brownbear
Copy link
Member Author

@rjernst all points addressed I think/hope :)

@original-brownbear
Copy link
Member Author

test failure looks like #30869

@original-brownbear
Copy link
Member Author

Jenkins test this

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one caveat: why is colon in the regex if colon is not actually supported?

I think the test failures you have will be fixed by merging in latest master (those tests have been @AwaitsFix)

@original-brownbear
Copy link
Member Author

@rjernst

thanks, will squash+merge when green then :)

with one caveat: why is colon in the regex if colon is not actually supported?

We need to match on the colon there to match two colon cases (with type) like %{NUMBER:rating:double}. We then resolve the correct name eventually by calling the eager match methods later in the execution but we can't actually resolve a name with colon in it because its indistinguishable from the type portion (at least not without introducing escaping into the syntax).

I think the test failures you have will be fixed by merging in latest master (those tests have been @awaitsfix)

rebased now, let's see :)

@rjernst
Copy link
Member

rjernst commented Jun 29, 2018

rebased now

In the future, please do not use rebases. Merging allows reviewers to more easily incrementally look at changes, and the elasticsearch project is configured to only do squash merges when pushing the merge button, so there is no need to squash on your own.

@original-brownbear original-brownbear merged commit b7b413e into elastic:master Jun 29, 2018
@original-brownbear original-brownbear deleted the 21745 branch June 29, 2018 07:12
@original-brownbear
Copy link
Member Author

@rjernst backport this to 6.x?

dnhatn added a commit that referenced this pull request Jun 29, 2018
* master:
  Do not check for object existence when deleting repository index files (#31680)
  Remove extra check for object existence in repository-gcs read object (#31661)
  Support multiple system store types (#31650)
  [Test] Clean up some repository-s3 tests (#31601)
  [Docs] Use capital letters in section headings (#31678)
  [DOCS] Add PQL language Plugin (#31237)
  Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (#31607)
  TEST: Fix test task invocation (#31657)
  Revert "[TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg"
  Fix RealmInteg test failures
  Extend allowed characters for grok field names (#21745) (#31653)
  [DOCS] Fix licensing API details (#31667)
  [TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg
  Fix CreateSnapshotRequestTests Failure (#31630)
  Configurable password hashing algorithm/cost (#31234)
  [TEST] Mute failing NamingConventionsTaskIT tests
  [DOCS] Replace CONFIG_DIR with ES_PATH_CONF (#31635)
  Core: Require all actions have a Task (#31627)
@rjernst
Copy link
Member

rjernst commented Jun 29, 2018

Yes 6.x should be fine. Please also put version labels for each branch this is checked into. For example, this PR would have v7.0.0 and v6.4.0 (because 6.x is currently 6.4).

@original-brownbear
Copy link
Member Author

@rjernst thanks, back ported in #31722 and labels added :)

dnhatn added a commit that referenced this pull request Jul 4, 2018
* 6.x:
  Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) (#31789)
  [Docs] Correct default window_size (#31582)
  S3 fixture should report 404 on unknown bucket (#31782)
  [ML] Limit ML filter items to 10K (#31731)
  Fixture for Minio testing (#31688)
  [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647)
  [DOCS] Add missing get mappings docs to HLRC (#31765)
  [DOCS] Starting Elasticsearch (#31701)
  Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747)
  Painless: Complete Removal of Painless Type (#31699)
  Consolidate watcher setting update registration (#31762)
  [DOCS] Adds empty 6.3.1 release notes page
  ingest: Introduction of a bytes processor (#31733)
  [test] don't run bats tests for suse boxes (#31749)
  Add analyze API to high-level rest client (#31577)
  Implemented XContent serialisation for GetIndexResponse (#31675)
  [DOCS] Typos
  DOC: Add examples to the SQL docs (#31633)
  Add support for AWS session tokens (#30414)
  Watcher: Reenable start/stop yaml tests (#31754)
  JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735)
  Support multiple system store types (#31650)
  Add write*Blob option to replace existing blob (#31729)
  Split CircuitBreaker-related tests (#31659)
  Painless: Add Context Docs (#31190)
  Docs: Remove missing reference
  Migrate scripted metric aggregation scripts to ScriptContext design (#30111)
  Watcher: Fix chain input toXcontent serialization (#31721)
  Remove _all example (#31711)
  rest-high-level: added get cluster settings (#31706)
  Docs: Match the examples in the description (#31710)
  [Docs] Correct typos (#31720)
  Extend allowed characters for grok field names (#21745) (#31653) (#31722)
  [DOCS] Check for Windows and *nix file paths (#31648)
  [ML] Validate ML filter_id (#31535)
  Fix gradle4.8 deprecation warnings (#31654)
  Update numbers to reflect 4-byte UTF-8-encoded characters (#27083)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants