-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch lambda logs API data before sending to APM-Server #314
Conversation
b270e41
to
5346d96
Compare
@@ -692,7 +694,7 @@ func TestInfoRequestHangs(t *testing.T) { | |||
lambdaServerInternals := newMockLambdaServer(t, logsapiAddr, eventsChannel, l) | |||
|
|||
eventsChain := []MockEvent{ | |||
{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 500}, | |||
{Type: InvokeStandardInfo, APMServerBehavior: Hangs, ExecutionDuration: 1, Timeout: 5}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[For Reviewers] I am not exactly sure about the purpose of this test. Based on the comment, it seems to test that extension times out in case APMServer hangs. Before this PR, the event process loop for this test was terminated by sending of RuntimeDone
event since the configured timeout in the test was 500
however, after the implementation of pushback, if no metadata is available RuntimeDone
event is no longer processed. I have updated the Timeout
to 5 so that it times out based on the deadline condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, but I'd prefer if we could avoid having the logsapi code be aware of metadata. I think ideally logsapi would just send log events to a channel, and if that channel is full then the log handler can eventually time out and cause the Lambda platform to retry later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only left mostly cosmetic comments for now.
apmproxy/client.go
Outdated
@@ -46,13 +46,15 @@ const ( | |||
defaultDataForwarderTimeout time.Duration = 3 * time.Second | |||
defaultReceiverAddr = ":8200" | |||
defaultAgentBufferSize int = 100 | |||
defaultMaxBatchSize int = 50 | |||
defaultMaxBatchAge time.Duration = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a long period of time for the extension. How did you come up with 10 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Considering the nature of the lambda function the duration is definitely too big. My first intention of adding age
factor to the batch was to account for functions with very low throughput. I got caught up in keeping the batch to be of good size.
Ideally, I think this value should be chosen based on the type of workload. So maybe we can expose configuration options for this(?)
For a good default, we need to balance the batch size and the age. Lambda provides a max time of 15 minutes for a function to run but I think most functions would be < 1 second(?). So, we can keep the default as 2 seconds in aim to flush everything by the second invocation maximum. What is your recommendation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge
becomes an issue for customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather suggest to disable the collection if the maxBatchAge becomes an issue for customers.
FYI, the batching also applies to platform metrics. Is this as per your expectations?
apmproxy/client.go
Outdated
@@ -46,13 +46,15 @@ const ( | |||
defaultDataForwarderTimeout time.Duration = 3 * time.Second | |||
defaultReceiverAddr = ":8200" | |||
defaultAgentBufferSize int = 100 | |||
defaultMaxBatchSize int = 50 | |||
defaultMaxBatchAge time.Duration = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds sounds reasonable. We might need to expose a config option for this at some point. Right now, given that the logs collection is in Tech Preview, I'd rather suggest to disable the collection if the maxBatchAge
becomes an issue for customers.
I've not been following this too closely so sorry if that was mentioned already. The logs API itself is already buffered. See also https://docs.aws.amazon.com/lambda/latest/dg/runtimes-logs-api.html#runtimes-logs-api-buffering. Could we just rely on this config rather than the user having to worry about buffering configs in both the lambda logs API settings and the APM extension settings? |
@felixbarny This PR aims to utilize lambda APIs buffering instead of buffering of events in the extension (prior to this PR we were buffering lambda logs in the extension). So, for lambda logs, if the extension is not able to consume the events then we return a 5xx to the lambda API making lambda buffer the logs (or drop them if buffer settings are exceeded). The config Let me know if this doesn't address your concern properly. |
Thanks for bringing me up to speed. Makes a lot of sense! |
@axw I have updated the PR to avoid To summarize: What current PR does?
What current PR doesn't do?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I'd like to get rid of the metadataAvailable
channel if possible. I think it's something that can be encapsulated within the apmproxy package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed one other issue, otherwise LGTM.
Motivation
The PR is motivated by 2 factors:
platform.runtimeDone
. In case lambda Logs API drops logs due to resource constraints generated by our pushback, a new eventplatform.logsDropped
is published which we log from the extension. The retention of this log event by the logs API is not clear, it might be possible that this log event is also dropped but it hasn't occurred in any of my tests so far. However, assuming that metadata is available without much delay this might not be a deal-breaker.Related Issues
Closes #311