Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest] Expose domainSplit() in ingest script processor and possibly aggregations #36359

Open
LucaWintergerst opened this issue Dec 7, 2018 · 5 comments
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team

Comments

@LucaWintergerst
Copy link
Contributor

Describe the feature:
The domainSplit() painless method allows to split domains into their parts (subdomain, tld, ... ). This was first introduced when Machine Learning was integrated into Elasticsearch. It was exposed as part of scripted fields to allow ML jobs to work if they need that information.

However, this functionality is also incredibly useful as part of ingest. No other part of our stack has a substitution for this (apart from packetbeat that does something similar by default).
There's also no good workaround as the public suffix list is required to do "good" domain splitting and scripted fields alone do not allow it being used in many parts of Kibana. Furthermore there's likely also a small performance hit.

@rjernst and @polyfractal discussed this briefly and agreed that it makes sense to have.

One remaining question to work out is if it also makes sense to have this available in scripted aggregations.

@rjernst rjernst added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Dec 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@rjernst rjernst added the discuss label Dec 7, 2018
@jakelandis
Copy link
Contributor

jakelandis commented Dec 14, 2018

We discussed this today in Fixit Friday and agreed that this would be useful in other parts of Elasticsearch, and something that we want to purse.

We still need discuss with @elastic/machine-learning team if they are agree-able to move this code from ml to a more common place in the source tree (and possibly require a re-license). We also need to discuss how to maintain the list of top level domains.

@droberts195
Copy link
Contributor

droberts195 commented Dec 14, 2018

We also need to discuss how to maintain the list of top level domains.

One option would be to work off the public suffix data file instead of the compressed version embedded in the code. We could ship public_suffix_list.dat as a resource file and parse it at startup. Then updating it would simply become a case of updating that file in the source tree. (Or we could ship it as a config file and parse it from the config directory if we wanted end users to be able to update it independent of a new release.)

We actually had some C++ code to do this in a previous product - I'll dig it out for you.

@jakelandis jakelandis added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Mar 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@rjernst rjernst added Team:Data Management Meta label for data/management team Team:Core/Infra Meta label for core/infra team labels May 4, 2020
@mbudge
Copy link

mbudge commented Oct 13, 2020

The public suffix file is the best way to get the top level domain, subdomain, registered domain, root domain and last but not least the domain.

@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jimczi jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

7 participants