Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine date processor patterns into single parser #83942

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

danhermann
Copy link
Contributor

Combines all custom patterns into a single parser so that no more than a single exception is thrown while searching for a matching pattern. This significantly improves performance in scenarios where multiple patterns are attempted and was suggested by @joegallo as a possible alternative to #83801.

Relates to #73918

@danhermann danhermann added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.2.0 labels Feb 15, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 15, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @danhermann, I've created a changelog YAML for you.

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM -- but even so I don't think it's fair for me to give a load bearing +1 on this.

@@ -72,10 +72,22 @@
this.targetField = targetField;
this.formats = formats;
this.dateParsers = new ArrayList<>(this.formats.size());
List<String> javaFormats = new ArrayList<>(this.formats.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be final

for (String format : formats) {
DateFormat dateFormat = DateFormat.fromString(format);
dateParsers.add((params) -> dateFormat.getFunction(format, newDateTimeZone(params), newLocale(params)));
if (DateFormat.Java == dateFormat) {
javaFormats.add(format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth doing our future selves a solid and adding a comment that explains what this is doing and why. Maybe something like:

// pull out the java formats separately so they can all be processed as a single combined date parser (see below)

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (but, like I said, I think you should get a second set of 👀 on this)

@danhermann
Copy link
Contributor Author

The original motivation for this PR is the poor performance of the date processor when multiple formats are specified. For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

One potential solution proposed in #83801 is to provide a method that really does not throw exceptions on parsing failures. That solution would resolve all the performance problems with exceptions thrown for date parsing failures with no change in behavior for the date processor.

Another approach proposed originally by @joegallo groups all the Java time formats specified in the date processor's format option into a single JavaDateFormatter instance with multiple formats rather than creating a distinct JavaDateFormatter instance with a single format each. The JavaDateFormatter::doParse method throws an exception only if all of the supplied formats fail, so this would eliminate exceptions in the date processor that specify multiple Java time formats except in the case where none of the patterns match the input. Unfortunately, it does involve a potential change in behavior for the date processor because in addition to Java time formats, the date processor supports several "standardized" formats such as ISO8601, UNIX, and UNIX_MS. When any of those standardized formats are specified, a statically-initialized instance of JavaDateFormatter is used. If both standardized time formats and custom Java time formats are specified, the formats could be attempted in an order not specified by the user since all of the custom Java time formats are grouped together and attempted last. Note also that because the standardized formats use distinct JavaDateFormatter instances, any parse failures for standardized formats produce exceptions so a greater number of exceptions could be thrown by the date processor with this approach: (# of standardized formats) + 1 (for any custom Java time formats).

@dakrone
Copy link
Member

dakrone commented Feb 23, 2022

For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

Can you explain this a little bit more? The code appears to only throw an exception once all parsers have been tried. I'm not super familiar with the way that the ingest node date processor, so is it that it throws an exception for each date processor? Also, how is this expensive? I would not expect throwing a single exception for a document to be so expensive, is this still a problem after #83764? Have we done another flame graph after that change?

@danhermann
Copy link
Contributor Author

For each format that fails to match the input, an exception is thrown by the JavaDateFormatter::doParse method (despite its somewhat confusing claim not to do so in the javadoc) and profiling that has shown that to be quite expensive in a number of common use cases.

Can you explain this a little bit more? The code appears to only throw an exception once all parsers have been tried. I'm not super familiar with the way that the ingest node date processor, so is it that it throws an exception for each date processor?

That's the core issue this PR is addressing -- the date processor creates a distinct JavaDataFormatter instance with a single DateTimeFormatter parser per date format so each attempt to match the supplied date string to a format results in either success or a thrown exception because all (one) of the parsers in the JavaDateFormatter instance failed.

Also, how is this expensive? I would not expect throwing a single exception for a document to be so expensive, is this still a problem after #83764? Have we done another flame graph after that change?

In addition to the typical reasons for Java exceptions being slow and not recommended for flow control in tight loops, ingest pipelines tend to have deep stack traces which are extra expensive to gather. We have both profiler results and a number of bug reports in which the date processor accounts for more running time than the other 15 or 20 processors in the pipeline combined.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.