Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Allow dissect to be type aware #10

Closed
vjsamuel opened this issue Apr 15, 2020 · 16 comments · Fixed by elastic/beats#18683
Closed

Allow dissect to be type aware #10

vjsamuel opened this issue Apr 15, 2020 · 16 comments · Fixed by elastic/beats#18683
Labels

Comments

@vjsamuel
Copy link

Grok processor currently allows the parsing to be type aware as seen here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/grok-processor.html#grok-basics

It would save the dissect processor some cycles on using convert processor to imply a non string type on the data.

@ph
Copy link

ph commented Apr 16, 2020

I was looking at the syntax of grok, we are using something like %{syntax:field:value}, when we look at the dissect syntax we often use a special char to define the meaning and separe the key from the actual behavior.

# : for string limited
/3 : for reordering of items.

So looking at this I think we would like to find something that has some meaning maybe we could use > as an analogy we cat into something which could be a type.

A few examples:

%{key>int} Key is a number
%{v1>bool} Key is either true, false, TRUE, FALSE, 1:0, f, t
%{v2>float} Key is a float

I think we should keep a limited list of types, I think grok have a good list of type: int, long, double, float and boolean.

cc @jsvd @jakelandis @andresrc

@ph ph added the discuss label Apr 16, 2020
@vjsamuel
Copy link
Author

thanks @ph. can we enhance the proposal to accommodate timestamp parsing as well? we have had a lot of customers ask us to be able to efficiently parse timestamps out

@ph
Copy link

ph commented Apr 16, 2020

Date or timestamp is a bigger problem but I think we could make it work with patterns maybe something like:

%{key>date} this would use a predefined list of format and we could expand it and allow to define parsing hints that could look something like %{key>date([dd/MM/yyyy hh:mm:s, dd/MM/yyyy])}, where the first matching is fine.

The main problem I see is how we define the format because we have multiples implementations: Java/Go and this will be more complex to converge.

Maybe the date is an exception and it should be a separate processor if you look at Grok it doesn't consider the date as a parsable primitive and you are encourage to delegate the work to a specialized processors: Date filter in Logstash and the Date processor in Elasticsearch.

@jsvd
Copy link
Member

jsvd commented Apr 20, 2020

So looking at this I think we would like to find something that has some meaning maybe we could use > as an analogy we cat into something which could be a type.

Dissect relies on -> to discard padding, so maybe | could be used instead as a "pipe-to-convert" metaphor. So in order to parse and convert this line format:

1   123.12    true  :size=32
999 12.1      false :weight=55

We'd use:

"%{id->|int} %{value->|float} %{buy->|float}:%{?a}=%{&a|float}"

Which would produce:

{ "id": 1, "value": 123.12, "buy": true, "size": 32.0 }

and

{ "id": 999, "value": 12.1, "buy": false, "weight": 55.0 }

@jsvd
Copy link
Member

jsvd commented Apr 20, 2020

Also, for reference, the logstash dissect implementation has a convert_datatype directive that is called after the mapping is processed.

@ph
Copy link

ph commented Apr 20, 2020

I like the ->| idea, we didn't implement the convert_data option on our end because we did have a processor to handle that. :)

@jsvd
Copy link
Member

jsvd commented Apr 20, 2020

Just to be clear, | would be the operator, ->| is just an example of how the operator is combined with other existing ones.

This reminds me that we also need to define where this operator can be placed within the expression (usually at the end) and which operators can it be combined with (e.g. maybe don’t allow combining with +?)

@ph
Copy link

ph commented Apr 20, 2020

Oh, I've misread that so ->| would mean discard padding than pipe to specified type.

For the position, your suggestion to have it at the end of the expression is a good thing.

Correct I think we should disallow it for + and /n because we would need to support more than just string append.

  • and /n Append with order - instructs the parser to append this key's to the value of the prior key with the same name based on order. The + modifier must be placed on the left of the key name and /n modifier placed to the right of the key name, where n = order. The order must start at 1. see example below

I haven't looked at this but maybe + /n and \ should be considered unique per expression, maybe we could call this family of symbol: "modifier"

@jsvd
Copy link
Member

jsvd commented Apr 20, 2020

we didn't implement the convert_data option on our end because we did have a processor to handle that. :)

And we also should consider that it could be the right call to make here. Implementing conversion brings many edge cases, which some we've handled in the mutate filter's convert (still has issues). Users already get hit by this disconnect when using logstash's dissect filter, example here. Even elasticsearch's convert processor has its own implementation details and feature requests.

Adding this capability to dissect may cause some expectation that all of these have parity between them, especially for use cases like this one, where a user is already relying on the convert processor and moves to the not-exactly-the-same dissect's convert feature.

@ph
Copy link

ph commented Apr 20, 2020

@jsvd maybe, this indeed brings a new set of challenges, what I am worried about is their interaction with templates, a coercing problem could indeed lead to an problem to indexing.

@jsvd By curiosity have you ever considered removing the convert from dissect?

@jsvd
Copy link
Member

jsvd commented Apr 20, 2020

@jsvd By curiosity have you ever considered removing the convert from dissect?

I haven't thought about it. it's been there since day 0, and we're not sure how many folks depend on it. we're spending some time thinking about telemetry and that could help us understand if this feature is used or if it should be deprecated.

@vjsamuel
Copy link
Author

@ph @jsvd is there reservations in doing this or is this something we can take up and submit PRs?

@jsvd
Copy link
Member

jsvd commented May 7, 2020

On the Logstash side it's unlikely we'll be implementing this in a near future, as logstash already provides the ability to do conversion within the dissect filter and on this side there hasn't been a demand for this either due to performance or reduced configuration size.

That said I don't want to block you from contributing this feature. If you agree with any of the syntax suggestions here - or have another - and have created a go implementation that demonstrates it working, we can review the PRs:

  • me/PH and Jake for the dissect-specification PR
  • someone on the Beats side for the beats processor change

Regarding feature parity across implementations, maybe we can start versioning this spec. For example, current version can be 1.0.0, and this new feature would land in 1.1.0, since it's a new capability that doesn't break previous ones.

@ph
Copy link

ph commented May 7, 2020

I agree with @jsvd proposal.

@vjsamuel
Copy link
Author

vjsamuel commented May 7, 2020

@ph @jsvd we are ok with the use of the specification that you have recommended and will go by the same.

@sandhyatallapanneni
Copy link

Hi There,

I am writing pattern in logstash version 7.10.2 or 7.15.2 for my log having different lines of pattern using Dissect filter with its key Modifier-> to remove space or spaces as per the log line before the next delimiter. But the Modifier-> is not working. Any suggestions? Please help

My log line is like below:

2020-07-03 14:49:02,003 INFO [org.jboss.as.ee] (MSC service thread 1-2) Configuring component class
2021-06-09 02:21:42,303 DEBUG [org.jboss.as.ee] (ServerService Thread Pool -- 56) Activating EE subsystem

My filter pattern with Dissect is like below:

filter {
        dissect {
mapping => { "message" => "%{logtime} %{+logtime} %{loglevel->} [%{modulename}] (%{thread}) %{logmessage}" }
}
    date {
                match => ["logtime", "ISO8601", "YYYY-MM-dd HH:mm:ss,SSS"]
    }
    mutate {
remove_field => ["path", "message"]
     }
}

I have two spaces after log level INFO and 1 space after log level DEBUG. So trying to use -> Modifier
Using grok I can get it working but want to know why is the Dissect Key Modifier -> failing and saying pattern not found.

Please help

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants