Batch lambda logs API data before sending to APM-Server #314

lahsivjar · 2022-09-22T07:51:16Z

Motivation

The PR is motivated by 2 factors:

We send data collected from logs API one by one as they are received. This PR allows batching of the data before sending it to APM-Server. The batches are explicitly sized based on the number of events as well as the age of the batch based on the first entry added to it.
Before this PR, the extension drops data until metadata from agents is available. This PR adds logic to buffer events in the extension and if the buffer is full, pushback on Lambda Logs API until metadata is available. Due to this behavior, the lifecycle of the extension might also be affected since we push back on all log events from the Logs API including the platform.runtimeDone. In case lambda Logs API drops logs due to resource constraints generated by our pushback, a new event platform.logsDropped is published which we log from the extension. The retention of this log event by the logs API is not clear, it might be possible that this log event is also dropped but it hasn't occurred in any of my tests so far. However, assuming that metadata is available without much delay this might not be a deal-breaker.

Related Issues

Closes #311

apmmachine · 2022-09-22T07:53:04Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-10-19T07:33:07.255+0000
Duration: 8 min 20 sec

Test stats 🧪

Test	Results
Failed	0
Passed	176
Skipped	2
Total	178

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmproxy/batch.go

apmproxy/apmserver.go

lahsivjar · 2022-09-29T03:35:47Z

main_test.go

@@ -692,7 +694,7 @@ func TestInfoRequestHangs(t *testing.T) {
 	lambdaServerInternals := newMockLambdaServer(t, logsapiAddr, eventsChannel, l)

 	eventsChain := []MockEvent{
-		{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 500},
+		{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 5},


[For Reviewers] I am not exactly sure about the purpose of this test. Based on the comment, it seems to test that extension times out in case APMServer hangs. Before this PR, the event process loop for this test was terminated by sending of RuntimeDone event since the configured timeout in the test was 500 however, after the implementation of pushback, if no metadata is available RuntimeDone event is no longer processed. I have updated the Timeout to 5 so that it times out based on the deadline condition.

(for ref: the logic to terminate the event processing loop)

axw

Generally looks good, but I'd prefer if we could avoid having the logsapi code be aware of metadata. I think ideally logsapi would just send log events to a channel, and if that channel is full then the log handler can eventually time out and cause the Lambda platform to retry later.

logsapi/event.go

logsapi/route_handlers.go

apmproxy/batch.go

apmproxy/apmserver.go

simitt

Only left mostly cosmetic comments for now.

apmproxy/apmserver.go

apmproxy/batch.go

simitt · 2022-09-30T07:03:16Z

apmproxy/client.go

@@ -46,13 +46,15 @@ const (
 	defaultDataForwarderTimeout time.Duration = 3 * time.Second
 	defaultReceiverAddr                       = ":8200"
 	defaultAgentBufferSize      int           = 100
+	defaultMaxBatchSize         int           = 50
+	defaultMaxBatchAge          time.Duration = 10 * time.Second


This seems like a long period of time for the extension. How did you come up with 10 seconds?

Nice catch. Considering the nature of the lambda function the duration is definitely too big. My first intention of adding age factor to the batch was to account for functions with very low throughput. I got caught up in keeping the batch to be of good size.

Ideally, I think this value should be chosen based on the type of workload. So maybe we can expose configuration options for this(?)

For a good default, we need to balance the batch size and the age. Lambda provides a max time of 15 minutes for a function to run but I think most functions would be < 1 second(?). So, we can keep the default as 2 seconds in aim to flush everything by the second invocation maximum. What is your recommendation here?

2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

@simitt

I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

FYI, the batching also applies to platform metrics. Is this as per your expectations?

apmproxy/client.go

simitt · 2022-10-05T09:33:34Z

apmproxy/client.go

@@ -46,13 +46,15 @@ const (
 	defaultDataForwarderTimeout time.Duration = 3 * time.Second
 	defaultReceiverAddr                       = ":8200"
 	defaultAgentBufferSize      int           = 100
+	defaultMaxBatchSize         int           = 50
+	defaultMaxBatchAge          time.Duration = 10 * time.Second


2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

felixbarny · 2022-10-17T10:07:41Z

I've not been following this too closely so sorry if that was mentioned already. The logs API itself is already buffered. See also https://docs.aws.amazon.com/lambda/latest/dg/runtimes-logs-api.html#runtimes-logs-api-buffering. Could we just rely on this config rather than the user having to worry about buffering configs in both the lambda logs API settings and the APM extension settings?

lahsivjar · 2022-10-17T10:33:22Z

@felixbarny This PR aims to utilize lambda APIs buffering instead of buffering of events in the extension (prior to this PR we were buffering lambda logs in the extension). So, for lambda logs, if the extension is not able to consume the events then we return a 5xx to the lambda API making lambda buffer the logs (or drop them if buffer settings are exceeded). The config ELASTIC_APM_LAMBDA_AGENT_DATA_BUFFER_SIZE is for buffering events from APM-agents. The batch config introduced in this PR is for the number of events in the payload to APM-Server.

Let me know if this doesn't address your concern properly.

felixbarny · 2022-10-17T12:59:23Z

Thanks for bringing me up to speed. Makes a lot of sense!

lahsivjar · 2022-10-18T11:05:16Z

@axw I have updated the PR to avoid logsapi being aware of the metadata availability. Pushback to lambda in case of errors is not complete yet, I will create a separate issue to track that. The below diagram gives a gist of implementation in the current PR:

To summarize:

What current PR does?

Batches events for the request to APM-Server
Buffers event in the extension until metadata is available
If the buffer is full, the extension pushes back on lambda API

What current PR doesn't do?

Push back on Lamba logs API for cases other than the internal buffer full.

axw

Looks good overall. I'd like to get rid of the metadataAvailable channel if possible. I think it's something that can be encapsulated within the apmproxy package.

logsapi/event.go

apmproxy/apmserver.go

apmproxy/batch_test.go

apmproxy/apmserver_test.go

axw

Just noticed one other issue, otherwise LGTM.

apmproxy/apmserver.go

Introduce apm data type

e66fd25

github-actions bot added the aws-λ-extension AWS Lambda Extension label Sep 22, 2022

Batch lambda logs API data before sending to APM-Server

5346d96

lahsivjar force-pushed the 311_buffer_data_tk2 branch from b270e41 to 5346d96 Compare September 22, 2022 08:07

lahsivjar requested a review from a team September 22, 2022 08:32

axw reviewed Sep 28, 2022

View reviewed changes

lahsivjar added 4 commits September 29, 2022 10:28

Pushback on lambda logs API until metadata is available

3ff264b

Tweak default buffering settings

d5d624e

Refactor metadata handling and add age to batch

3d3c1b9

Fix batch creation and main_test logic

e42a3df

lahsivjar commented Sep 29, 2022

View reviewed changes

axw reviewed Sep 29, 2022

View reviewed changes

logsapi/event.go Outdated Show resolved Hide resolved

logsapi/route_handlers.go Outdated Show resolved Hide resolved

apmproxy/batch.go Outdated Show resolved Hide resolved

apmproxy/batch.go Outdated Show resolved Hide resolved

apmproxy/apmserver.go Outdated Show resolved Hide resolved

Address review comments

632d395

simitt reviewed Sep 30, 2022

View reviewed changes

simitt reviewed Oct 5, 2022

View reviewed changes

lahsivjar added 4 commits October 18, 2022 11:17

Refactor logsapi to be independent of metadata status

8898854

Address review comments

2d7eeb7

Add buffering behavior for main_test

62415bb

Remove Type field from APMData

7b8afdc

lahsivjar requested a review from axw October 18, 2022 11:05

Minor refactor

add4178

axw reviewed Oct 19, 2022

View reviewed changes

Remove metadata available channel and other minor issues

ff98be9

lahsivjar requested a review from axw October 19, 2022 05:06

axw approved these changes Oct 19, 2022

View reviewed changes

apmproxy/apmserver.go Outdated Show resolved Hide resolved

lahsivjar added 3 commits October 19, 2022 15:04

Fix context done handling

8892605

Merge branch 'main' into 311_buffer_data_tk2

a6e054b

Add changelog

fa41c11

lahsivjar merged commit b54644c into elastic:main Oct 25, 2022

lahsivjar deleted the 311_buffer_data_tk2 branch October 25, 2022 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch lambda logs API data before sending to APM-Server #314

Batch lambda logs API data before sending to APM-Server #314

lahsivjar commented Sep 22, 2022 •

edited

Loading

apmmachine commented Sep 22, 2022 •

edited

Loading

Build stats

Test stats 🧪

lahsivjar Sep 29, 2022 •

edited

Loading

axw left a comment

simitt left a comment

simitt Sep 30, 2022

lahsivjar Sep 30, 2022

simitt Oct 5, 2022

lahsivjar Oct 19, 2022

simitt Oct 5, 2022

felixbarny commented Oct 17, 2022

lahsivjar commented Oct 17, 2022 •

edited

Loading

felixbarny commented Oct 17, 2022

lahsivjar commented Oct 18, 2022 •

edited

Loading

axw left a comment

axw left a comment

Batch lambda logs API data before sending to APM-Server #314

Batch lambda logs API data before sending to APM-Server #314

Conversation

lahsivjar commented Sep 22, 2022 • edited Loading

Motivation

Related Issues

apmmachine commented Sep 22, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

lahsivjar Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

simitt left a comment

Choose a reason for hiding this comment

simitt Sep 30, 2022

Choose a reason for hiding this comment

lahsivjar Sep 30, 2022

Choose a reason for hiding this comment

simitt Oct 5, 2022

Choose a reason for hiding this comment

lahsivjar Oct 19, 2022

Choose a reason for hiding this comment

simitt Oct 5, 2022

Choose a reason for hiding this comment

felixbarny commented Oct 17, 2022

lahsivjar commented Oct 17, 2022 • edited Loading

felixbarny commented Oct 17, 2022

lahsivjar commented Oct 18, 2022 • edited Loading

axw left a comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

lahsivjar commented Sep 22, 2022 •

edited

Loading

apmmachine commented Sep 22, 2022 •

edited

Loading

lahsivjar Sep 29, 2022 •

edited

Loading

lahsivjar commented Oct 17, 2022 •

edited

Loading

lahsivjar commented Oct 18, 2022 •

edited

Loading