Extend allowed characters for grok field names (#21745) #31653

original-brownbear · 2018-06-28T12:42:04Z

Fixes #21745
I couldn't make it as lenient as matching every visible char because whatever is matched by the subname group is itself used as the name of a matcher group (so things like \ won't work without additional escaping).

So I left level of leniency out for now (I can look into making that work as well, but it's going to be a trickier change potentially) and used @clintongormley 's suggestion here (#21745 (comment)) of simply using unicode properties (:alnum: matches all letters and numbers here) + allowed punctuation (I defined allowed as exactly what LS grok allows so only [\[\]@] was added) for now.

let me know what you think :)

elasticmachine · 2018-06-28T12:54:20Z

Pinging @elastic/es-core-infra

talevy · 2018-06-28T14:19:33Z

libs/grok/src/test/java/org/elasticsearch/grok/GrokTests.java

@@ -439,4 +439,13 @@ public void testExponentialExpressions() {
        run.set(false);
        assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms"));
    }
+
+    public void testUnicodeFieldnames() {


I think it would be good to add more tests here for when things are expected to fail and/or succeed.

maybe pick random unicode codepoints between a certain range that is known to be :alphanum: and
have explicit checks for random ones that are out of the alphanum range. I do not know too much about these ranges, so I understand if that is not practical.

at least we can add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work

@talevy as far as I know the [:alphanum:] range isn't contiguous (it def. isn't at the boundary between letters and numbers I think).
So:

add tests for fieldnames with randomAlphas & randomInts to ensure that the existing A-z0-9 will continue to work

seems the way to go imo (obviously could choose a different range as well but I don't think it would be logically different from just testing the old range + 2 concrete examples of things that broke before).

Are you sure we want to test the negative case here too? So basically just use some char that we def don't support in the filename like say (field and match for null?

just as a sanity check that declares we do not support arbitrary unicode. I don't think we have that around

@talevy sure, will add that (though come to think of it, the behaviour there kind of sucks in that we're just quietly returning null for fieldnames that can't even work to begin with).

I wonder if we should maybe add some validation for the grokPattern to the Grok constructor (not necessarily in this PR ... more thinking in general)?

the behaviour there kind of sucks in that we're just quietly returning null for fieldnames that can't even work to begin

This definitely sounds like something that should be fixed in a followup.

jasontedor · 2018-06-28T14:56:59Z

@rjernst Pinging for review.

rjernst

In general, please don't title commits as "be more lenient". Leniency has a negative connotation among the team. Instead, something like "Extend allowed characters for grok field names" is more apt.

rjernst · 2018-06-28T15:18:51Z

libs/grok/src/test/java/org/elasticsearch/grok/GrokTests.java

+
+    public void testUnicodeFieldnames() {
+        for(String fieldName : Arrays.asList("@metadata", "@metädata", "@metädat[a]")) {
+            String line = "foo";


Why is this put into a local var? It's only used once.

rjernst · 2018-06-28T15:21:01Z

libs/grok/src/test/java/org/elasticsearch/grok/GrokTests.java

@@ -439,4 +439,13 @@ public void testExponentialExpressions() {
        run.set(false);
        assertThat(e.getMessage(), equalTo("grok pattern matching was interrupted after [200] ms"));
    }
+
+    public void testUnicodeFieldnames() {


I agree that more tests would be good. A common pattern we have in tests is to add a special assert method for something like this, then call that method from different tests cases. I would have tests for random alphanumeric, special characters (including each that are defined in the regex), and then something outside of the characters defined in the regex.

@rjernst done I think :)

Note:

The : is in the regex but not a supported part of a field name because of the possibility of type definition in the Grok section making that impossible.

Added a simple check for the unsupported () as an example of a char falling outside of the allowed range.

original-brownbear · 2018-06-28T17:20:12Z

@rjernst all points addressed I think/hope :)

original-brownbear · 2018-06-28T19:06:38Z

test failure looks like #30869

original-brownbear · 2018-06-28T21:58:08Z

Jenkins test this

rjernst

LGTM, with one caveat: why is colon in the regex if colon is not actually supported?

I think the test failures you have will be fixed by merging in latest master (those tests have been @AwaitsFix)

original-brownbear · 2018-06-29T04:23:17Z

@rjernst

thanks, will squash+merge when green then :)

with one caveat: why is colon in the regex if colon is not actually supported?

We need to match on the colon there to match two colon cases (with type) like %{NUMBER:rating:double}. We then resolve the correct name eventually by calling the eager match methods later in the execution but we can't actually resolve a name with colon in it because its indistinguishable from the type portion (at least not without introducing escaping into the syntax).

I think the test failures you have will be fixed by merging in latest master (those tests have been @awaitsfix)

rebased now, let's see :)

rjernst · 2018-06-29T04:38:48Z

rebased now

In the future, please do not use rebases. Merging allows reviewers to more easily incrementally look at changes, and the elasticsearch project is configured to only do squash merges when pushing the merge button, so there is no need to squash on your own.

original-brownbear · 2018-06-29T07:16:23Z

@rjernst backport this to 6.x?

* master: Do not check for object existence when deleting repository index files (#31680) Remove extra check for object existence in repository-gcs read object (#31661) Support multiple system store types (#31650) [Test] Clean up some repository-s3 tests (#31601) [Docs] Use capital letters in section headings (#31678) [DOCS] Add PQL language Plugin (#31237) Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (#31607) TEST: Fix test task invocation (#31657) Revert "[TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg" Fix RealmInteg test failures Extend allowed characters for grok field names (#21745) (#31653) [DOCS] Fix licensing API details (#31667) [TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg Fix CreateSnapshotRequestTests Failure (#31630) Configurable password hashing algorithm/cost (#31234) [TEST] Mute failing NamingConventionsTaskIT tests [DOCS] Replace CONFIG_DIR with ES_PATH_CONF (#31635) Core: Require all actions have a Task (#31627)

rjernst · 2018-06-29T15:45:22Z

Yes 6.x should be fine. Please also put version labels for each branch this is checked into. For example, this PR would have v7.0.0 and v6.4.0 (because 6.x is currently 6.4).

)

original-brownbear · 2018-07-02T12:37:37Z

@rjernst thanks, back ported in #31722 and labels added :)

* 6.x: Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) (#31789) [Docs] Correct default window_size (#31582) S3 fixture should report 404 on unknown bucket (#31782) [ML] Limit ML filter items to 10K (#31731) Fixture for Minio testing (#31688) [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647) [DOCS] Add missing get mappings docs to HLRC (#31765) [DOCS] Starting Elasticsearch (#31701) Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747) Painless: Complete Removal of Painless Type (#31699) Consolidate watcher setting update registration (#31762) [DOCS] Adds empty 6.3.1 release notes page ingest: Introduction of a bytes processor (#31733) [test] don't run bats tests for suse boxes (#31749) Add analyze API to high-level rest client (#31577) Implemented XContent serialisation for GetIndexResponse (#31675) [DOCS] Typos DOC: Add examples to the SQL docs (#31633) Add support for AWS session tokens (#30414) Watcher: Reenable start/stop yaml tests (#31754) JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735) Support multiple system store types (#31650) Add write*Blob option to replace existing blob (#31729) Split CircuitBreaker-related tests (#31659) Painless: Add Context Docs (#31190) Docs: Remove missing reference Migrate scripted metric aggregation scripts to ScriptContext design (#30111) Watcher: Fix chain input toXcontent serialization (#31721) Remove _all example (#31711) rest-high-level: added get cluster settings (#31706) Docs: Match the examples in the description (#31710) [Docs] Correct typos (#31720) Extend allowed characters for grok field names (#21745) (#31653) (#31722) [DOCS] Check for Windows and *nix file paths (#31648) [ML] Validate ML filter_id (#31535) Fix gradle4.8 deprecation warnings (#31654) Update numbers to reflect 4-byte UTF-8-encoded characters (#27083)

original-brownbear requested a review from talevy June 28, 2018 12:42

original-brownbear added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 28, 2018

talevy added >enhancement review labels Jun 28, 2018

talevy reviewed Jun 28, 2018

View reviewed changes

jasontedor requested a review from rjernst June 28, 2018 14:56

rjernst requested changes Jun 28, 2018

View reviewed changes

original-brownbear changed the title ~~Be more lenient with fieldnames in Grok Processor (#21745)~~ Extend allowed characters for grok field names (#21745) Jun 28, 2018

original-brownbear force-pushed the 21745 branch from 8c903ef to c8397fb Compare June 28, 2018 17:18

rjernst approved these changes Jun 28, 2018

View reviewed changes

original-brownbear added 3 commits June 29, 2018 06:16

Be more lenient with fieldnames in Grok Processor (elastic#21745)

5318e38

Be more lenient with fieldnames in Grok Processor (elastic#21745)

125933b

Be more lenient with fieldnames in Grok Processor (elastic#21745)

0ef98d6

original-brownbear force-pushed the 21745 branch from c8397fb to 0ef98d6 Compare June 29, 2018 04:17

original-brownbear merged commit b7b413e into elastic:master Jun 29, 2018

original-brownbear deleted the 21745 branch June 29, 2018 07:12

original-brownbear mentioned this pull request Jul 2, 2018

Extend allowed characters for grok field names (#21745) (#31653) #31722

Merged

original-brownbear added a commit that referenced this pull request Jul 2, 2018

Extend allowed characters for grok field names (#21745) (#31653) (#31722

db3d4e6

)

original-brownbear added v7.0.0 v6.4.0 labels Jul 2, 2018

Mpdreamz mentioned this pull request Sep 24, 2018

[meta] 6.4.0 release elastic/elasticsearch-net#3397

Closed

89 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend allowed characters for grok field names (#21745) #31653

Extend allowed characters for grok field names (#21745) #31653

original-brownbear commented Jun 28, 2018

elasticmachine commented Jun 28, 2018

talevy Jun 28, 2018 •

edited

Loading

original-brownbear Jun 28, 2018

talevy Jun 28, 2018

original-brownbear Jun 28, 2018

rjernst Jun 28, 2018

jasontedor commented Jun 28, 2018

rjernst left a comment

rjernst Jun 28, 2018

original-brownbear Jun 28, 2018

rjernst Jun 28, 2018

original-brownbear Jun 28, 2018

original-brownbear commented Jun 28, 2018

original-brownbear commented Jun 28, 2018

original-brownbear commented Jun 28, 2018

rjernst left a comment

original-brownbear commented Jun 29, 2018

rjernst commented Jun 29, 2018 •

edited

Loading

original-brownbear commented Jun 29, 2018

rjernst commented Jun 29, 2018

original-brownbear commented Jul 2, 2018

Extend allowed characters for grok field names (#21745) #31653

Extend allowed characters for grok field names (#21745) #31653

Conversation

original-brownbear commented Jun 28, 2018

elasticmachine commented Jun 28, 2018

talevy Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jun 28, 2018

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jun 28, 2018

original-brownbear commented Jun 28, 2018

original-brownbear commented Jun 28, 2018

rjernst left a comment

Choose a reason for hiding this comment

original-brownbear commented Jun 29, 2018

rjernst commented Jun 29, 2018 • edited Loading

original-brownbear commented Jun 29, 2018

rjernst commented Jun 29, 2018

original-brownbear commented Jul 2, 2018

talevy Jun 28, 2018 •

edited

Loading

rjernst commented Jun 29, 2018 •

edited

Loading