Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for multiple output connectors #306

Merged
merged 26 commits into from
Mar 7, 2023

Conversation

ekneg54
Copy link
Collaborator

@ekneg54 ekneg54 commented Jan 31, 2023

  • Add support for multiple output connectors
  • changes rule language of selective_extractor, pseudonymizer, pre_detector to support multiple outputs

TODO:

  • validation of output_names in selective_extractor, pseudonymizer, pre_detector against defined outputs
  • add storing extra_data with multiple outputs
  • add documentation for callback behavior if two default outputs are defined

ToDo but NotNow:

@ekneg54 ekneg54 self-assigned this Jan 31, 2023
@ekneg54 ekneg54 linked an issue Jan 31, 2023 that may be closed by this pull request
@ekneg54 ekneg54 changed the title set output as dict in pipeline add support for multiple output connectors Jan 31, 2023
@ekneg54 ekneg54 force-pushed the dev-add-multiple-output-processors branch from bc4698c to e9cc825 Compare February 11, 2023 15:18
@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2023

Codecov Report

Base: 91.25% // Head: 91.41% // Increases project coverage by +0.15% 🎉

Coverage data is based on head (3f5cac1) compared to base (a224468).
Patch coverage: 98.72% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #306      +/-   ##
==========================================
+ Coverage   91.25%   91.41%   +0.15%     
==========================================
  Files         120      120              
  Lines        8235     8293      +58     
==========================================
+ Hits         7515     7581      +66     
+ Misses        720      712       -8     
Impacted Files Coverage Δ
logprep/util/auto_rule_tester/auto_rule_tester.py 86.65% <50.00%> (-0.24%) ⬇️
logprep/processor/pre_detector/processor.py 94.00% <85.71%> (ø)
logprep/abc/output.py 100.00% <100.00%> (ø)
logprep/abc/processor.py 100.00% <100.00%> (ø)
logprep/framework/pipeline.py 95.04% <100.00%> (+0.25%) ⬆️
logprep/processor/clusterer/processor.py 93.54% <100.00%> (ø)
logprep/processor/pseudonymizer/processor.py 98.19% <100.00%> (ø)
logprep/processor/selective_extractor/processor.py 100.00% <100.00%> (ø)
logprep/processor/selective_extractor/rule.py 100.00% <100.00%> (ø)
logprep/util/configuration.py 97.65% <100.00%> (+0.38%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ekneg54 ekneg54 marked this pull request as ready for review February 12, 2023 15:18
Copy link
Collaborator

@ppcad ppcad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice feature, but would it be possible to have multiple outputs for extra data as well?
Furthermore, it is not verified if the output name for custom data is valid. Logprep will therefore run and throw a KeyError as soon as custom data is written.

@ekneg54
Copy link
Collaborator Author

ekneg54 commented Feb 14, 2023

Sure... I will give it a try.👍😁

@ppcad
Copy link
Collaborator

ppcad commented Feb 15, 2023

Thank you!

@ekneg54
Copy link
Collaborator Author

ekneg54 commented Feb 15, 2023

For the selective_extractor storing extra_output to multiple outputs should yet be possible by defining different rules for different outputs or topics.

we could make the same in the pseudonymizer and the pre_detector, that you are able to overwrite the the output and topic defined in the processor configuration on rule level.

this would be the most useful solution in my opinion.

@ppcad
Copy link
Collaborator

ppcad commented Feb 15, 2023

If the goal is to write extra data for the same event into multiple outputs, then having multiple rules with different outputs would have the problem that events need to be processed by each rule variant. This would multiply the rule count by the number of outputs, which would multiply the processing time of that processor and make it hard to maintain the rules.
Depending on the processor it might not work at all, since it could also change the event by a random value (hash/uuid) or it can't overwrite etc.

@ekneg54 ekneg54 force-pushed the dev-add-multiple-output-processors branch from 5fb9b87 to a8d00d7 Compare February 15, 2023 10:14
@ekneg54 ekneg54 marked this pull request as draft February 15, 2023 10:15
@ekneg54 ekneg54 force-pushed the dev-add-multiple-output-processors branch from a8d00d7 to 25d9d9a Compare February 15, 2023 10:18
@ekneg54
Copy link
Collaborator Author

ekneg54 commented Feb 15, 2023

ok... good point... then I have to dig a lot deeper keeping my fingers crossed to not get to a really ugly solution :-D

@ppcad
Copy link
Collaborator

ppcad commented Feb 15, 2023

Thank you very much!

Copy link
Collaborator

@herrfeder herrfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally with individiual selective_extractor configs on my host and the FileInputConnector.

Incompability with console_output

The output type console_output doesn't work with the new implementation at all as it will raise the following error:

2023-02-16 09:11:08,030 Pipeline ERROR   : A critical error occurred for processor SelectiveExtractor (selective_extractor) when processing an event, processing was aborted: (AttributeError: module 'sys' has no attribute 'first_topic') (59 in ~29.0 sek)

Apparently this output connector doesn't like the topic attribute but ommiting target_topic isn't possible which is also confusing when using file outputs.

Confusion about multi output selective extraction

According to my review comments I don't understand the actual intended test behaviour as I would assume with individual selective_extractor the affected outputs should be different.

Suggestion for optimized dealing with multiple outputs through consecutive pipeline processors

And I guess it touches somehow @ppcad comments I don't see the possibility to create individual streams when using multi-outputs which leads to the described behaviour that for each additional output all rules will be executed again.
An easy but not perfect possibility to deal with it in my opinion would be to introduce (just like in Logstash) Meta-Fields that can be filtered. When addressing multiple or even new outputs in a multi-output aware processor a meta-field could be added {"output":"output_stream_123"} which can be filtered in the following processors in the pipeline. This would still touch the filter of each following processor but doesn't execute the full processor behaviour.

tests/acceptance/test_multiple_outputs.py Show resolved Hide resolved
tests/acceptance/test_multiple_outputs.py Show resolved Hide resolved
@ppcad
Copy link
Collaborator

ppcad commented Feb 16, 2023

I tested locally with individiual selective_extractor configs on my host and the FileInputConnector.

Incompability with console_output

The output type console_output doesn't work with the new implementation at all as it will raise the following error:

2023-02-16 09:11:08,030 Pipeline ERROR   : A critical error occurred for processor SelectiveExtractor (selective_extractor) when processing an event, processing was aborted: (AttributeError: module 'sys' has no attribute 'first_topic') (59 in ~29.0 sek)

Apparently this output connector doesn't like the topic attribute but ommiting target_topic isn't possible which is also confusing when using file outputs.

Yes, I have also noticed that. It works if target_topic is set to stdout.

Confusion about multi output selective extraction

According to my review comments I don't understand the actual intended test behaviour as I would assume with individual selective_extractor the affected outputs should be different.

Suggestion for optimized dealing with multiple outputs through consecutive pipeline processors

And I guess it touches somehow @ppcad comments I don't see the possibility to create individual streams when using multi-outputs which leads to the described behaviour that for each additional output all rules will be executed again. An easy but not perfect possibility to deal with it in my opinion would be to introduce (just like in Logstash) Meta-Fields that can be filtered. When addressing multiple or even new outputs in a multi-output aware processor a meta-field could be added {"output":"output_stream_123"} which can be filtered in the following processors in the pipeline. This would still touch the filter of each following processor but doesn't execute the full processor behaviour.

Extra data doesn't get passed to the next processor, so it needs to be generated again for each output if the goal is to send the same extra data to different outputs. To generate it the whole event needs to be processed again.

@ekneg54
Copy link
Collaborator Author

ekneg54 commented Feb 16, 2023

It is a draft... I am not ready

@ekneg54 ekneg54 marked this pull request as ready for review February 22, 2023 08:22
@ekneg54
Copy link
Collaborator Author

ekneg54 commented Feb 22, 2023

@ppcad @herrfeder ... I'm ready now... please have a look

Copy link
Collaborator

@herrfeder herrfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we finished the review with at least 6 eyes gracefully, I can finish this review with minor request changes.

logprep/util/auto_rule_tester/auto_rule_tester.py Outdated Show resolved Hide resolved
logprep/util/configuration.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ppcad ppcad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes!

@ekneg54
Copy link
Collaborator Author

ekneg54 commented Mar 6, 2023

thank you for your approve. I will merge this, if we are ready. Because it will break our configuration.

@ekneg54 ekneg54 force-pushed the dev-add-multiple-output-processors branch from 9537f30 to fbb55d0 Compare March 7, 2023 17:12
@ekneg54 ekneg54 merged commit 49303b8 into main Mar 7, 2023
@ekneg54 ekneg54 deleted the dev-add-multiple-output-processors branch March 10, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for multiple input and output connectors
5 participants