Make kafka output store offset for successfully delivered events #516

ppcad · 2024-02-01T11:17:56Z

No description provided.

codecov-commenter · 2024-02-01T11:21:12Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (9e19caa) 91.56% compared to head (8c484b9) 91.65%.

Files	Patch %	Lines
logprep/connector/confluent_kafka/input.py	87.09%	4 Missing ⚠️
logprep/connector/confluent_kafka/output.py	97.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #516      +/-   ##
==========================================
+ Coverage   91.56%   91.65%   +0.09%     
==========================================
  Files         130      130              
  Lines        9496     9551      +55     
==========================================
+ Hits         8695     8754      +59     
+ Misses        801      797       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ekneg54

the idea is clear for me, but the mechanic is not and I think the problem is spread accross the output AND the input.

to get the possibility to add meta fields to the event is pretty cool as we need this in the http_input_connector too.

the guarantee of delivery is not optional. There is no possibility to opt out and to have a fire and forget kafka output as before. please consider to make the whole mechanic configurable.

As implemented for now I do not think, that this will do the job and I have big doubts on performance of this solution. please have a look on my remarks.

logprep/connector/confluent_kafka/input.py

ekneg54 · 2024-02-27T10:27:00Z

logprep/connector/confluent_kafka/input.py

+        for meta_field in ("last_partition", "last_offset"):
+            try:
+                del event_dict["_metadata"][meta_field]
+            except (TypeError, KeyError):
+                pass


Suggested change

for meta_field in ("last_partition", "last_offset"):

try:

del event_dict["_metadata"][meta_field]

except (TypeError, KeyError):

pass

metadata.pop("last_partition", None)

metadata.pop("last_offset", None)

easier to read. (you have to adjust the tests, because you give string data where a dict is expected)

I have changed it, but I kept try except to check in case _metadata is not a dict, since this field might already exist in the event and be of any type. I did not check for dict directly, since it is more likely to not happen and more performant this way.

ekneg54 · 2024-02-27T10:40:19Z

logprep/connector/confluent_kafka/input.py

+        if metadata is None:
+            raise FatalInputError(self, "Metadata for setting offsets can't be 'None'")


Suggested change

if metadata is None:

raise FatalInputError(self, "Metadata for setting offsets can't be 'None'")

if it can't be None, we should ensure it is not None to reflect the non optional type hint. But this seems to be the wrong place to check this. We should check this earlier to fail faster.

My suggestion is to set metadata to an empty dict in batch_finished_callback if it is None

Thanks, I changed it to be set in batch_finished_callback.

ekneg54 · 2024-02-27T10:49:37Z

logprep/connector/confluent_kafka/input.py

+        Should be called by output connectors if they are finished processing a batch of records.
        """


Suggested change

Should be called by output connectors if they are finished processing a batch of records.

"""

Should be called by output connectors if they are finished processing a batch of records.

"""

metadata = {} if metadata is None else metadata

so we ensure it can't be 'None' in further processing. Please adjust the type hints accordingly

Thanks, I've added that suggestion.

ekneg54 · 2024-02-27T11:34:04Z

logprep/connector/confluent_kafka/output.py

+
+    @Metric.measure_time()
+    def _write_backlog(self):
+        self._producer.flush(self._config.flush_timeout)


the flush_timeout in opensearch and elasticsearch is the time to guarantee message delivery. so this confuses me.

flush_timeout was already used for flush in shut_down and in case of a BufferError.
flush does internally call poll, until the internal buffer is empy, ensuring that all messages get sent.
We could rename it, but calling it flush_timeout for the flush method makes sense in my opinion.

yes it makes sense. but the other option is to rename the parameters in elasticsearch and opensearch parameters so that the term "flush_timeout" means globaly the same in logprep. I would prefer, to change it here and now to get rid of this inconsistency and to not raise another pull_request to change this anywhere else

ekneg54 · 2024-02-27T11:56:37Z

logprep/connector/confluent_kafka/output.py

+            if error:
+                raise FatalOutputError(output=self, message=error)
+            self.metrics.number_of_successfully_delivered_events += 1
+            self.input_connector.batch_finished_callback(metadata=partition_offset)


as I understand batch_finished_callback is called on every successful delivered message, right?
if so, this would decrease performance drastically, because on every successful delivery the GIL is on this method.

but the called method is named BATCH_finished_callback but now it is called on every single message?

consider using the equivalent mechanic as in the opensearch_output. Write all successful deliveries in a list. then if the list (you should use a deque for this) is full, get the last committable offset for all the partitions and call the batch_finished_callback with it.

here it is possible that you commit for messages that are not delivered.

example:

kafka topic partition offsets:
current: 0 committed: 0

you consume messages
current: 1 committed: 0
you consume further messages
current 2 committed: 0
you deliver message 2 and callback is called
current 2 committed 2
what is with the first message that was not delivered?

kafka thinks it is delivered now, but it is actual not

ekneg54 · 2024-02-27T12:00:19Z

logprep/connector/confluent_kafka/output.py

+                name="number_of_successfully_delivered_events",
+            )
+        )
+        """Number of events that were successfully delivered to Kafka"""


yes ok it is not your issue now, but could you please add the documentation for the other kafka_output config parameters?

I have added some documentation now.

Make kafka output store offset for successfully delivered events

8c484b9

ppcad requested review from dtrai2 and ekneg54 February 1, 2024 11:17

ppcad self-assigned this Feb 1, 2024

ppcad linked an issue Feb 1, 2024 that may be closed by this pull request

Ensure Correct Offsets for Kafka Output #511

Open

ppcad added 3 commits February 23, 2024 11:40

Add metadata preprocessor and make kafka set/use metadata for offsets

f85eb68

Fix metadata for existing offset fields

7afc92e

Update changelog

68e5b6d

ekneg54 requested changes Feb 27, 2024

View reviewed changes

ppcad added 5 commits March 1, 2024 07:15

Add documentation for kafka output parameters

5370158

Ensure _metadata in kafka input is not None

90246af

Refactor kafka input

4e60bbf

Refactor preparing _metadata in kafka input

9ef29c4

Add fixes, more tests and add scheduled flush to kafka output

3ec4e41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make kafka output store offset for successfully delivered events #516

Make kafka output store offset for successfully delivered events #516

ppcad commented Feb 1, 2024

codecov-commenter commented Feb 1, 2024

ekneg54 left a comment

ekneg54 Feb 27, 2024

ppcad Mar 1, 2024

ekneg54 Feb 27, 2024

ppcad Mar 1, 2024

ekneg54 Feb 27, 2024

ppcad Mar 1, 2024

ekneg54 Feb 27, 2024

ppcad Mar 1, 2024

ekneg54 Mar 1, 2024

ekneg54 Feb 27, 2024

ekneg54 Feb 27, 2024

ppcad Mar 1, 2024

		if metadata is None:
		raise FatalInputError(self, "Metadata for setting offsets can't be 'None'")

		Should be called by output connectors if they are finished processing a batch of records.
		"""

Make kafka output store offset for successfully delivered events #516

Are you sure you want to change the base?

Make kafka output store offset for successfully delivered events #516

Conversation

ppcad commented Feb 1, 2024

codecov-commenter commented Feb 1, 2024

Codecov Report

ekneg54 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment