Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch lambda logs API data before sending to APM-Server #314

Merged
merged 16 commits into from
Oct 25, 2022

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented Sep 22, 2022

Motivation

The PR is motivated by 2 factors:

  1. We send data collected from logs API one by one as they are received. This PR allows batching of the data before sending it to APM-Server. The batches are explicitly sized based on the number of events as well as the age of the batch based on the first entry added to it.
  2. Before this PR, the extension drops data until metadata from agents is available. This PR adds logic to buffer events in the extension and if the buffer is full, pushback on Lambda Logs API until metadata is available. Due to this behavior, the lifecycle of the extension might also be affected since we push back on all log events from the Logs API including the platform.runtimeDone. In case lambda Logs API drops logs due to resource constraints generated by our pushback, a new event platform.logsDropped is published which we log from the extension. The retention of this log event by the logs API is not clear, it might be possible that this log event is also dropped but it hasn't occurred in any of my tests so far. However, assuming that metadata is available without much delay this might not be a deal-breaker.

Related Issues

Closes #311

@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Sep 22, 2022
@apmmachine
Copy link

apmmachine commented Sep 22, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-10-19T07:33:07.255+0000

  • Duration: 8 min 20 sec

Test stats 🧪

Test Results
Failed 0
Passed 176
Skipped 2
Total 178

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@lahsivjar lahsivjar requested a review from a team September 22, 2022 08:32
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/apmserver.go Outdated Show resolved Hide resolved
@@ -692,7 +694,7 @@ func TestInfoRequestHangs(t *testing.T) {
lambdaServerInternals := newMockLambdaServer(t, logsapiAddr, eventsChannel, l)

eventsChain := []MockEvent{
{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 500},
{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 5},
Copy link
Contributor Author

@lahsivjar lahsivjar Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For Reviewers] I am not exactly sure about the purpose of this test. Based on the comment, it seems to test that extension times out in case APMServer hangs. Before this PR, the event process loop for this test was terminated by sending of RuntimeDone event since the configured timeout in the test was 500 however, after the implementation of pushback, if no metadata is available RuntimeDone event is no longer processed. I have updated the Timeout to 5 so that it times out based on the deadline condition.

(for ref: the logic to terminate the event processing loop)

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but I'd prefer if we could avoid having the logsapi code be aware of metadata. I think ideally logsapi would just send log events to a channel, and if that channel is full then the log handler can eventually time out and cause the Lambda platform to retry later.

logsapi/event.go Outdated Show resolved Hide resolved
logsapi/route_handlers.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
apmproxy/apmserver.go Outdated Show resolved Hide resolved
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only left mostly cosmetic comments for now.

apmproxy/apmserver.go Outdated Show resolved Hide resolved
apmproxy/batch.go Outdated Show resolved Hide resolved
@@ -46,13 +46,15 @@ const (
defaultDataForwarderTimeout time.Duration = 3 * time.Second
defaultReceiverAddr = ":8200"
defaultAgentBufferSize int = 100
defaultMaxBatchSize int = 50
defaultMaxBatchAge time.Duration = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a long period of time for the extension. How did you come up with 10 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Considering the nature of the lambda function the duration is definitely too big. My first intention of adding age factor to the batch was to account for functions with very low throughput. I got caught up in keeping the batch to be of good size.

Ideally, I think this value should be chosen based on the type of workload. So maybe we can expose configuration options for this(?)

For a good default, we need to balance the batch size and the age. Lambda provides a max time of 15 minutes for a function to run but I think most functions would be < 1 second(?). So, we can keep the default as 2 seconds in aim to flush everything by the second invocation maximum. What is your recommendation here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simitt

I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

FYI, the batching also applies to platform metrics. Is this as per your expectations?

apmproxy/client.go Show resolved Hide resolved
@@ -46,13 +46,15 @@ const (
defaultDataForwarderTimeout time.Duration = 3 * time.Second
defaultReceiverAddr = ":8200"
defaultAgentBufferSize int = 100
defaultMaxBatchSize int = 50
defaultMaxBatchAge time.Duration = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.

@felixbarny
Copy link
Member

I've not been following this too closely so sorry if that was mentioned already. The logs API itself is already buffered. See also https://docs.aws.amazon.com/lambda/latest/dg/runtimes-logs-api.html#runtimes-logs-api-buffering. Could we just rely on this config rather than the user having to worry about buffering configs in both the lambda logs API settings and the APM extension settings?

@lahsivjar
Copy link
Contributor Author

lahsivjar commented Oct 17, 2022

@felixbarny This PR aims to utilize lambda APIs buffering instead of buffering of events in the extension (prior to this PR we were buffering lambda logs in the extension). So, for lambda logs, if the extension is not able to consume the events then we return a 5xx to the lambda API making lambda buffer the logs (or drop them if buffer settings are exceeded). The config ELASTIC_APM_LAMBDA_AGENT_DATA_BUFFER_SIZE is for buffering events from APM-agents. The batch config introduced in this PR is for the number of events in the payload to APM-Server.

Let me know if this doesn't address your concern properly.

@felixbarny
Copy link
Member

Thanks for bringing me up to speed. Makes a lot of sense!

@lahsivjar
Copy link
Contributor Author

lahsivjar commented Oct 18, 2022

@axw I have updated the PR to avoid logsapi being aware of the metadata availability. Pushback to lambda in case of errors is not complete yet, I will create a separate issue to track that. The below diagram gives a gist of implementation in the current PR:

pr314

To summarize:

What current PR does?

  1. Batches events for the request to APM-Server
  2. Buffers event in the extension until metadata is available
  3. If the buffer is full, the extension pushes back on lambda API

What current PR doesn't do?

  1. Push back on Lamba logs API for cases other than the internal buffer full.

@lahsivjar lahsivjar requested a review from axw October 18, 2022 11:05
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I'd like to get rid of the metadataAvailable channel if possible. I think it's something that can be encapsulated within the apmproxy package.

logsapi/event.go Outdated Show resolved Hide resolved
logsapi/event.go Outdated Show resolved Hide resolved
apmproxy/apmserver.go Outdated Show resolved Hide resolved
apmproxy/apmserver.go Outdated Show resolved Hide resolved
apmproxy/apmserver.go Outdated Show resolved Hide resolved
apmproxy/batch_test.go Outdated Show resolved Hide resolved
apmproxy/batch_test.go Outdated Show resolved Hide resolved
apmproxy/apmserver_test.go Outdated Show resolved Hide resolved
@lahsivjar lahsivjar requested a review from axw October 19, 2022 05:06
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed one other issue, otherwise LGTM.

apmproxy/apmserver.go Outdated Show resolved Hide resolved
@lahsivjar lahsivjar merged commit b54644c into elastic:main Oct 25, 2022
@lahsivjar lahsivjar deleted the 311_buffer_data_tk2 branch October 25, 2022 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch data collected by log streams before sending to APM-Server
5 participants