Combine date processor patterns into single parser #83942

danhermann · 2022-02-15T13:41:13Z

Combines all custom patterns into a single parser so that no more than a single exception is thrown while searching for a matching pattern. This significantly improves performance in scenarios where multiple patterns are attempted and was suggested by @joegallo as a possible alternative to #83801.

Relates to #73918

elasticmachine · 2022-02-15T13:41:17Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2022-02-15T13:41:38Z

Hi @danhermann, I've created a changelog YAML for you.

…n/elasticsearch into 73918_multiple_parsers

joegallo

Mostly LGTM -- but even so I don't think it's fair for me to give a load bearing +1 on this.

joegallo · 2022-02-15T14:50:45Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/DateProcessor.java

@@ -72,10 +72,22 @@
        this.targetField = targetField;
        this.formats = formats;
        this.dateParsers = new ArrayList<>(this.formats.size());
+        List<String> javaFormats = new ArrayList<>(this.formats.size());


Nit: this can be final

joegallo · 2022-02-15T14:51:06Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/DateProcessor.java

        for (String format : formats) {
            DateFormat dateFormat = DateFormat.fromString(format);
-            dateParsers.add((params) -> dateFormat.getFunction(format, newDateTimeZone(params), newLocale(params)));
+            if (DateFormat.Java == dateFormat) {
+                javaFormats.add(format);


It's probably worth doing our future selves a solid and adding a comment that explains what this is doing and why. Maybe something like:

// pull out the java formats separately so they can all be processed as a single combined date parser (see below)

joegallo

LGTM (but, like I said, I think you should get a second set of 👀 on this)

danhermann · 2022-02-18T14:30:14Z

The original motivation for this PR is the poor performance of the date processor when multiple formats are specified. For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

One potential solution proposed in #83801 is to provide a method that really does not throw exceptions on parsing failures. That solution would resolve all the performance problems with exceptions thrown for date parsing failures with no change in behavior for the date processor.

Another approach proposed originally by @joegallo groups all the Java time formats specified in the date processor's format option into a single JavaDateFormatter instance with multiple formats rather than creating a distinct JavaDateFormatter instance with a single format each. The JavaDateFormatter::doParse method throws an exception only if all of the supplied formats fail, so this would eliminate exceptions in the date processor that specify multiple Java time formats except in the case where none of the patterns match the input. Unfortunately, it does involve a potential change in behavior for the date processor because in addition to Java time formats, the date processor supports several "standardized" formats such as ISO8601, UNIX, and UNIX_MS. When any of those standardized formats are specified, a statically-initialized instance of JavaDateFormatter is used. If both standardized time formats and custom Java time formats are specified, the formats could be attempted in an order not specified by the user since all of the custom Java time formats are grouped together and attempted last. Note also that because the standardized formats use distinct JavaDateFormatter instances, any parse failures for standardized formats produce exceptions so a greater number of exceptions could be thrown by the date processor with this approach: (# of standardized formats) + 1 (for any custom Java time formats).

dakrone · 2022-02-23T22:02:07Z

For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

Can you explain this a little bit more? The code appears to only throw an exception once all parsers have been tried. I'm not super familiar with the way that the ingest node date processor, so is it that it throws an exception for each date processor? Also, how is this expensive? I would not expect throwing a single exception for a document to be so expensive, is this still a problem after #83764? Have we done another flame graph after that change?

danhermann · 2022-02-24T12:53:14Z

For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

Can you explain this a little bit more? The code appears to only throw an exception once all parsers have been tried. I'm not super familiar with the way that the ingest node date processor, so is it that it throws an exception for each date processor?

That's the core issue this PR is addressing -- the date processor creates a distinct JavaDataFormatter instance with a single DateTimeFormatter parser per date format so each attempt to match the supplied date string to a format results in either success or a thrown exception because all (one) of the parsers in the JavaDateFormatter instance failed.

Also, how is this expensive? I would not expect throwing a single exception for a document to be so expensive, is this still a problem after #83764? Have we done another flame graph after that change?

In addition to the typical reasons for Java exceptions being slow and not recommended for flow control in tight loops, ingest pipelines tend to have deep stack traces which are extra expensive to gather. We have both profiler results and a number of bug reports in which the date processor accounts for more running time than the other 15 or 20 processors in the pipeline combined.

Combine patterns into single parser

ef8f604

danhermann added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.2.0 labels Feb 15, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Feb 15, 2022

danhermann and others added 3 commits February 15, 2022 07:41

Update docs/changelog/83942.yaml

2526329

checkstyle

815b860

Merge branch '73918_multiple_parsers' of https://github.com/danherman…

ed2225d

…n/elasticsearch into 73918_multiple_parsers

joegallo requested changes Feb 15, 2022

View reviewed changes

review comments

71acd14

joegallo approved these changes Feb 15, 2022

View reviewed changes

dakrone assigned joegallo Mar 17, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine date processor patterns into single parser #83942

Combine date processor patterns into single parser #83942

danhermann commented Feb 15, 2022

elasticmachine commented Feb 15, 2022

elasticsearchmachine commented Feb 15, 2022

joegallo left a comment

joegallo Feb 15, 2022

joegallo Feb 15, 2022

joegallo left a comment

danhermann commented Feb 18, 2022

dakrone commented Feb 23, 2022

danhermann commented Feb 24, 2022

Combine date processor patterns into single parser #83942

Are you sure you want to change the base?

Combine date processor patterns into single parser #83942

Conversation

danhermann commented Feb 15, 2022

elasticmachine commented Feb 15, 2022

elasticsearchmachine commented Feb 15, 2022

joegallo left a comment

Choose a reason for hiding this comment

joegallo Feb 15, 2022

Choose a reason for hiding this comment

joegallo Feb 15, 2022

Choose a reason for hiding this comment

joegallo left a comment

Choose a reason for hiding this comment

danhermann commented Feb 18, 2022

dakrone commented Feb 23, 2022

danhermann commented Feb 24, 2022