Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser to pass through data to processors #15694

Closed
juha-ylikoski opened this issue Jul 31, 2024 · 21 comments · Fixed by #15697
Closed

Parser to pass through data to processors #15694

juha-ylikoski opened this issue Jul 31, 2024 · 21 comments · Fixed by #15697
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@juha-ylikoski
Copy link

Use Case

I think it would be beneficial to be able to offload parsing of the data in inputs to external processor plugins. I have a use case where I would like to be able to parse cbor and transform it in processors before outputting it.

However, this currently would require me to write a parser plugin in go and to recompile telegraf with it instead of allowing me to write an external plugin like in case of processors.

Expected behavior

A parser which could read arbitary binary data into e.g. base64 encoded value which is passed into metric like:

mqtt_consumer,data-format=foo,topic=data passthrough="oWNmb29jYmFy" 1722329727288412521

Actual behavior

I need to write a custom cbor parser and recompile telegraf.

Additional info

I have written a parser like this, and I'm willing to create a pr for it if this project is willing to accept it.

@juha-ylikoski juha-ylikoski added the feature request Requests for new plugin and for new features to existing plugins label Jul 31, 2024
@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

Hi,

I believe our xpath parser can parse CBOR documents already. Have you looked at that? We added support after #13464 and resolved this in
#13480.

However, this currently would require me to write a parser plugin in go and to recompile telegraf with it instead of allowing me to write an external plugin like in case of processors.

In these types of cases what we have suggested is using the exec processor. You can post-process any additional fields using an external tool already.

Let me know what you think about the xpath parser!

@powersj powersj added the waiting for response waiting for response from contributor label Jul 31, 2024
@juha-ylikoski
Copy link
Author

I believe our xpath parser can parse CBOR documents already. Have you looked at that? We added support after #13464 and resolved this in #13480.

Thanks, I had not noticed this.

I looked into it, and I'm a bit confused about the documentation. To me, it looks like I have to know the structure of the data when it's incoming, and I cannot just pass all the data as fields / binary to parsers to post process (which is what I want).

Also, if I would use this xpath_cbor-parser, I could not receive anything but cbor anymore from that particular input (if I have understood correctly). If the parser would just wrap the input bytes into e.g. base64, it could be passed to an external processor which could based on e.g. input's tags do proper processing.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 31, 2024
@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

I have to know the structure of the data when it's incoming, and I cannot just pass all the data as fields / binary to parsers to post process (which is what I want).

In this case, please use the execd processor. This is really what this is for. After an input, you can pass your data to an external processor to do whatever you want with the data, including parsing out a field.

@powersj powersj added the waiting for response waiting for response from contributor label Jul 31, 2024
@juha-ylikoski
Copy link
Author

In this case, please use the execd processor. This is really what this is for. After an input, you can pass your data to an external processor to do whatever you want with the data, including parsing out a field.

I'm using execd processor but as far as I know I cannot have one as input parser (at least that's how I read this doc) and the input (in my case cbor coming from mqtt) has to be parsed into metric before I can use execd processor .

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 31, 2024
@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

Correct - you can continue to use your mqtt input to produce the example metric that you provided in your oringal post. Then the processors are run, and telegraf sends that data to your processor to do whatever processing you want on that field. That processor returns the new/updated metric.

@juha-ylikoski
Copy link
Author

Do you want me to create a pr for an input processor which enables this functionality, or do we continue to use a fork of telegraf which has this?

@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

I'm not following what your proposal is.

In your original message:

I would like to be able to parse cbor and transform it in processors before outputting it.

We have both a parser to parse CBOR already and the execd processor to let you transform this data all you want.

@juha-ylikoski
Copy link
Author

I looked into it, and I'm a bit confused about the documentation. To me, it looks like I have to know the structure of the data when it's incoming, and I cannot just pass all the data as fields / binary to parsers to post process (which is what I want).

If I understood the documentation correctly, I'm not able to parse arbitarily formatted cbor into metric for later post-processing as well as if I would use the parser, I would not be able to also receive json from the same input.

@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

I'm not able to parse arbitarily formatted cbor into metric for later post-processing as well as if I would use the parser, I would not be able to also receive json from the same input.

Correct, a parser only handles a single data format when parsing the data.

Are you wanting to set up an mqtt input to parse two types of data both JSON and cbor messages?

@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

What might help is if you could provide a better description of the entire scenario of what you are trying to do. What data is coming in, what do you use to read it, and what processing do you plan to do with it.

@powersj powersj added the waiting for response waiting for response from contributor label Jul 31, 2024
@juha-ylikoski
Copy link
Author

Yeah I might have been a bit unclear with the entire scenario.

I have an mqtt input which has multiple topics. Part of the topic identifies what kind of data format the payload should be (E.g. Foo/json1, bar/json2, foo/cbor).

I then need to parse and process content of these messages (with execd plugins) and ultimately output to influxdb (and maybe others).

I have created a parser plugin which creates a metric with single string field. This field contains b64 encoded bytes form the input (mqtt) and is passed to the processors which based on the mqtt topic (tag) process the payload.

Now as far as I know I cannot have execd plugin as parser and I cannot have a parser which accepts arbitrary binary and just puts it to a metric for later processing.

I hope this clarifies my scenario.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 31, 2024
powersj added a commit to powersj/telegraf that referenced this issue Jul 31, 2024
@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

Thanks for the flow, I am follow along a bit better.

I have created a parser plugin which creates a metric with single string field. This field contains b64 encoded bytes form the input (mqtt) and is passed to the processors which based on the mqtt topic (tag) process the payload.

When you say passed to processors, these are telegraf processors plugins that are acting on it, or additional processors that your custom parser call? I want to ensure I understand the flow of the metrics and order of calling in your custom build.

What if we added a base64 option to our value parser (#15697), which will take the message and encode it in base64 as the field. This sounds like what you are doing in your custom parser?

@juha-ylikoski
Copy link
Author

To be a bit more specific we have a starlark processor which extracts data format as a tag from the mqtt topic (tag) and then we have multiple execd processors which have filters for data formats (one parses json1, one parses json2 and one parses cbor).

Basically I also just took your value parser and removed some stuff and just put the input bytes to field after b64 encoding them to string.

@powersj
Copy link
Contributor

powersj commented Jul 31, 2024

To be a bit more specific we have a starlark processor which extracts data format as a tag from the mqtt topic (tag) and then we have multiple execd processors which have filters for data formats (one parses json1, one parses json2 and one parses cbor).

thanks for confirming the flow! It really does help understand what and how you are handling the data.

Basically I also just took your value parser and removed some stuff and just put the input bytes to field after b64 encoding them to string.

Ah ok! Would you be able to try out the artifacts in #15697 with the value parser and base64 data type?

@juha-ylikoski
Copy link
Author

I can try it tomorrow when I get back to work.

@juha-ylikoski
Copy link
Author

Ah ok! Would you be able to try out the artifacts in #15697 with the value parser and base64 data type?

The parser is almost exactly what I made and almost works, but I think that for binary formats, using the stripped string and then re-encoding it, might lose data / not work (like with cbor):

		value = base64.StdEncoding.EncodeToString([]byte(vStr))

And I would suggest doing this instead:

		value = base64.StdEncoding.EncodeToString(buf)

I changed this one line, recompiled, tested, and it worked now like my parser.

This is can be replicated with e.g. this data:

>>> import cbor2, base64
>>> cbor2.dumps({'foo': {'id': 217056256, 'data': [0, 0, 0, 40, 40, 0, 0, 0]}, 'timestamp': 1722494850.30815})
b'\xa2cfoo\xa2bid\x1a\x0c\xf0\x04\x00ddata\x88\x00\x00\x00\x18(\x18(\x00\x00\x00itimestamp\xfbA\xd9\xaa\xcb\xe0\x93\xb8\xbb'
>>> stripped = b'8AQAZGRhdGGIAAAAGCgYKAAAAGl0aW1lc3RhbXD7Qdmqy+CTuLs='
>>> buf = b'omNmb2+iYmlkGgzwBABkZGF0YYgAAAAYKBgoAAAAaXRpbWVzdGFtcPtB2arL4JO4uw=='
>>> cbor2.loads(base64.b64decode(stripped))
CBORSimpleValue(value=16)
>>> cbor2.loads(base64.b64decode(buf))
{'foo': {'id': 217056256, 'data': [0, 0, 0, 40, 40, 0, 0, 0]}, 'timestamp': 1722494850.30815}

@srebhan
Copy link
Member

srebhan commented Aug 1, 2024

@juha-ylikoski would you please try to push your change to @powersj's branch? Not sure if you got the permissions... Please make sure you signed the CLA before your push!

@powersj
Copy link
Contributor

powersj commented Aug 1, 2024

@juha-ylikoski would you please try to push your change to @powersj's branch?

Only maintainers can push to PRs, so I'll take a look at this shortly.

@powersj
Copy link
Contributor

powersj commented Aug 1, 2024

@juha-ylikoski,

I've pushed an update and new artifacts will be up in 20-30mins. It uses the raw string as you suggested. Let me know!

@juha-ylikoski
Copy link
Author

@powersj this seems to work as I would expect and does not trim the binary.

@powersj
Copy link
Contributor

powersj commented Aug 1, 2024

Awesome, thank you for confirming! I'll get this landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants