USMON-1008: Parse Produce Response #28526

DanielLavie · 2024-08-18T08:29:20Z

What does this PR do?

This PR adds support for parsing Kafka produce responses in the Kernel as part of the USM Kafka monitoring feature in system-probe. It includes capturing error codes and measuring latency for Kafka produce requests.

Motivation

Capturing error codes and latency for Kafka produce requests is crucial, as it provides the same visibility for produce requests that we already offer for fetch requests. This ensures that customers can effectively monitor and analyze their Kafka traffic.

Additional Notes

Possible Drawbacks / Trade-offs

Currently, we only support parsing produce requests with a single partition, so we’ve also limited response parsing to single-partition produce requests. While this hasn't caused issues in our dogfooding environment or with customers so far, we may revisit this decision in the future. Expanding support for multiple partitions would add complexity to the code.

Describe how to test/QA your changes

Load test results:

Link to staging deployment

pr-commenter · 2024-08-18T12:34:36Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=43523913 --os-family=ubuntu

Note: This applies to commit c533f19

pr-commenter · 2024-08-18T13:11:22Z

Regression Detector

Regression Detector Results

Run ID: b1775dee-5c6d-4c43-b13c-778a9338d03a Metrics dashboard Target profiles

Baseline: b8d3295
Comparison: 6faf8b6

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	links
➖	pycheck_lots_of_tags	% cpu utilization	+2.10	[-0.44, +4.64]	Logs
➖	basic_py_check	% cpu utilization	+2.08	[-0.71, +4.87]	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.51	[-12.16, +13.17]	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.33	[-0.55, +1.21]	Logs
➖	otel_to_otel_logs	ingress throughput	+0.06	[-0.75, +0.87]	Logs
➖	idle	memory utilization	+0.06	[+0.01, +0.10]	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.01]	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.00, +0.00]	Logs
➖	file_tree	memory utilization	-0.55	[-0.64, -0.45]	Logs

Bounds Checks

perf	experiment	bounds_check_name	replicates_passed
❌	idle	memory_usage	9/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

…er issue

…as needed. Still, I don't see the error code in the user mode while running UT

…d to fix the UT though

…ests, and changed the helper functions to support this

…e_parse_response_partition_loop_fetch

pkg/network/usm/kafka_monitor_test.go

pkg/network/ebpf/c/protocols/kafka/kafka-parsing.h

pkg/network/protocols/kafka/protocol.go

…hould wait for the produce response

…will parse the response

DanielLavie · 2024-09-04T15:11:00Z

/merge

dd-devflow · 2024-09-04T15:11:10Z

🚂 MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

dd-devflow · 2024-09-04T15:55:43Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

DanielLavie added 2 commits August 15, 2024 17:11

Added the initial KAFKA_PRODUCE_RESPONSE states

b27063f

Laying the foundations for produce response parser programs

f16bc85

github-actions bot added the component/system-probe label Aug 18, 2024

DanielLavie added the team/usm The USM team label Aug 18, 2024

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

be23bde

DanielLavie added 10 commits August 21, 2024 15:55

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

e69c76b

Split kafka_continue_parse_response_partition_loop to fix 4.14 verifi…

70f8b62

…er issue

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

aeec6be

Fixed api_key to target_api_key when passing it as compile time value

6fca6e0

Getting to parse produce response and fixed debug log

d258c70

Parsing up to KAFKA_PRODUCE_RESPONSE_NUM_PARTITIONS

9c2a270

Added missing code for parsing produce response

80f6f22

Fixed a bug in produce parsing and update the produce response state …

d8956b6

…as needed. Still, I don't see the error code in the user mode while running UT

Now seeing produce error code and latency in the user mode, still nee…

c93a607

…d to fix the UT though

Added the ability to add a response function to the ParseProduceRaw t…

753fa33

…ests, and changed the helper functions to support this

DanielLavie added changelog/no-changelog qa/done Skip QA week as QA was done before merge and regressions are covered by tests labels Aug 26, 2024

DanielLavie changed the title ~~USMON-1008 produce response step by step~~ USMON-1008: Parse Produce Response Aug 26, 2024

DanielLavie added 5 commits August 26, 2024 12:02

Removed unnecessary produce response states in the Kernel

afbd91e

Fixed switch indent and casing

97d79f2

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

ca896a5

Renamed kafka_continue_parse_response_partition_loop to kafka_continu…

9e85cbf

…e_parse_response_partition_loop_fetch

Fixed another switch-case indentation

07c8fad

DanielLavie marked this pull request as ready for review August 26, 2024 09:39

DanielLavie requested a review from a team as a code owner August 26, 2024 09:39

Removed redundant comment

fe1da1b

vitkyrka approved these changes Aug 27, 2024

View reviewed changes

pkg/network/usm/kafka_monitor_test.go Show resolved Hide resolved

pkg/network/ebpf/c/protocols/kafka/kafka-parsing.h Show resolved Hide resolved

pkg/network/protocols/kafka/protocol.go Show resolved Hide resolved

Added missing fallthrough

1ae207f

DanielLavie added 2 commits August 27, 2024 15:52

Added "split" tests for raw produce

28b3d48

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

71bd319

DanielLavie mentioned this pull request Sep 1, 2024

Add UT for Kafka Producer with noAcks Configuration #28944

Merged

DanielLavie added 8 commits September 2, 2024 13:34

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

a8fc495

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

e54420d

Using the required ack field in the produce request to decide if we s…

d940d4b

…hould wait for the produce response

Fixed produce raw tests to include request acks, so the Kernel logic …

b120beb

…will parse the response

Added "produce no required acks" counter to the Kernel telemetry

18ddc9d

Merge branch 'main' into daniel.lavie/USMON-1008-produce-response-sbs

c9dc25c

Incrementing produce_no_required_acks only for produce requests now

6faf8b6

Updated a comment in kafka statkeeper.go

c533f19

dd-mergequeue bot merged commit 01d4ab5 into main Sep 4, 2024
290 of 296 checks passed

dd-mergequeue bot deleted the daniel.lavie/USMON-1008-produce-response-sbs branch September 4, 2024 16:19

github-actions bot added this to the 7.58.0 milestone Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USMON-1008: Parse Produce Response #28526

USMON-1008: Parse Produce Response #28526

DanielLavie commented Aug 18, 2024 •

edited

Loading

pr-commenter bot commented Aug 18, 2024 •

edited

Loading

pr-commenter bot commented Aug 18, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

DanielLavie commented Sep 4, 2024

dd-devflow bot commented Sep 4, 2024

dd-devflow bot commented Sep 4, 2024

USMON-1008: Parse Produce Response #28526

USMON-1008: Parse Produce Response #28526

Conversation

DanielLavie commented Aug 18, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Aug 18, 2024 • edited Loading

Test changes on VM

pr-commenter bot commented Aug 18, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Bounds Checks

Explanation

DanielLavie commented Sep 4, 2024

dd-devflow bot commented Sep 4, 2024

dd-devflow bot commented Sep 4, 2024

DanielLavie commented Aug 18, 2024 •

edited

Loading

pr-commenter bot commented Aug 18, 2024 •

edited

Loading

pr-commenter bot commented Aug 18, 2024 •

edited

Loading