Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Buffer debug logs and only emit on exception #4432

Open
2 tasks done
hiselitelordship opened this issue May 31, 2024 · 9 comments
Open
2 tasks done
Assignees
Labels

Comments

@hiselitelordship
Copy link

Use case

CloudWatch Logs are vital, but expensive. To control costs you switch log level from DEBUG to INFO. When an exception happens you have very little context to debug what happened. You swtich back to DEBUG level and hope(!) the error happens again.

Solution/User Experience

By configuring a buffer, debug logs would not be written to CloudWatch Logs unless an exception is encountered. In this way, CWL is low cost and minimalist. But, like a stack trace, when an exception is encountered the buffer is flushed to CWL and all the context of the preceding events is captured.

The buffer should not create memory pressure and become the cause of exceptions. Using something like a ring buffer oldest entries will be lost in favour of the freshest. There is likely a threshold in which a flush to CWL could take too long as an execution approaches the 900s exection threshold. Therefore the memory allocation or number of buffer lines could be configurable.

Alternative solutions

No response

Acknowledgment

@hiselitelordship hiselitelordship added feature-request feature request triage Pending triage from maintainers labels May 31, 2024
Copy link

boring-cyborg bot commented May 31, 2024

Thanks for opening your first issue here! We'll come back to you as soon as we can.
In the meantime, check out the #python channel on our Powertools for AWS Lambda Discord: Invite link

@heitorlessa
Copy link
Contributor

(haven't forgotten, adjusting to jet lag from ANZ as I'm back today) - this is a great idea and we were waiting from customer demand to consider it.

I'll share thoughts and questions tomorrow

@heitorlessa heitorlessa removed the triage Pending triage from maintainers label Jun 3, 2024
@heitorlessa heitorlessa moved this from Triage to Ideas in Powertools for AWS Lambda (Python) Jun 4, 2024
@heitorlessa
Copy link
Contributor

unless an exception is encountered

This is tricky part. As of now, I can think of these scenarios:

  • Handled exceptions. Code is within a try/except block and customers could do one of the following to flush the buffer:
    • [explicit] call logger.hypothetical_flush_debug_buffer() as part of the Except block
    • [implicit] call logger.exception("something went wrong") that by default captures and formats a stacktrace, but we'd transparently call logger.hypothetical_flush_debug_buffer(). Here be dragons as some customers could love or hate this transparent fashion (e.g., how do I disable this?).
  • Unhandled exceptions. Function crashed due to unexpected and unhandled exception. Lambda doesn't support unhandled exception mechanism (Bug: log_uncaught_exceptions only works locally and not on lambda #1798), but we could workaround in two ways:
    • If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.
    • We could register a no-op internal extension (separate thread), giving us 500ms for any cleanup upon timeout and such, including background jobs. That however would come at a cost (latency to be measured).

Did I miss anything?

I'd love to hear your thoughts on both.

@hiselitelordship
Copy link
Author

I imagine the new behaviour would be Logger(debug_on_exception=True). Explicit and backward compatible. Existing code would therefore not buffer and every debug line is published as expected. No surprises for anyone already using Powertools.

If I set this new option, I expect the logger.exception to flush out the buffer like a stack trace – that is why I set the option. If I didn’t set the option above, then I don’t want this buffering feature and there will be no stack trace or buffering.

But your point remains. If I want to selectively stop publishing the “stack trace” I can’t.

@heitorlessa heitorlessa moved this from Ideas to Backlog in Powertools for AWS Lambda (Python) Jun 7, 2024
@heitorlessa
Copy link
Contributor

heitorlessa commented Jun 7, 2024

I expect the logger.exception to flush out the buffer like a stack trace
Explicit and backward compatible

Yeah, that's the direction we'd go given our tenets. On Monday (17th), I'll summarize actionable points and tests we've made while we wait for customers +1.

NOTE: we'll implement this across all languages once we have sufficient demand.

@heitorlessa heitorlessa added help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision labels Jun 9, 2024
@xkortex
Copy link

xkortex commented Jun 11, 2024

First, I love the idea of this feature, it's definitely a "have your observability cake and eat it too" moment.

Logger(debug_on_exception=True)

+1, that seems like the best way to implement to start. An env var which also sets it would be additionally welcome, as then users could click-ops turn on debug tracing in response to an unforeseen issue without a code deploy.

If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.

This seems to me like the least surprising behavior, given a function wrapped with @logger.inject_lambda_context. That seems the most logical hardpoint to catch broad exceptions.

As for in-line vs separate thread, whew, that's a challenging one. I definitely see the appeal of the lambda returning as fast as possible on error, but you really don't want your error handling code to itself get swallowed by the void, so you'd want to especially make sure the thread kicked off has a very high degree of success at emitting the error. As long as extension threads are able to keep the lambda hot long enough to complete the handler, I'd be concerned about the lambda being scaled-in mid-process. But I would think the error packaging would be fast enough that it'd be unlikely to drop (I'm just not super knowledgeable on lambda innards to know what keeps one alive).

Perhaps you could define a handler interface with a default stderr impl and let users inject their own function to run such that they could override the error rollup behavior, use threads or not, customize the debug sink (maybe they want to dump to S3 or datadog or whatever).

@heitorlessa
Copy link
Contributor

@sthulb when you can, would you mind dropping the memory test you did for verbose log records if kept in a circular buffer for this feature?

We're handling some critical operational bits this week hence my extended delay in responding it. We're still committed to do this feature and plan to deprecate the log sampling (as this is a much better alternative)

@leandrodamascena
Copy link
Contributor

Hey everyone! We started a RFC draft in this discussion: aws-powertools/powertools-lambda-typescript#3410

Even though this discussion belongs to the Powertools TypeScript repository, this will be implemented in all runtimes - if the RFC is accepted.

Anyone who wants to participate in the discussion is very welcome and we intend to continue the discussions until mid-January and decide how the implementation will look.

Thank you

@leandrodamascena leandrodamascena moved this from Backlog to Pending review in Powertools for AWS Lambda (Python) Dec 30, 2024
@dreamorosi dreamorosi moved this from Pending review to Ideas in Powertools for AWS Lambda (Python) Jan 24, 2025
@leandrodamascena
Copy link
Contributor

Hi everyone, I'm here with some updates on this issue. We were discussing the implementation and design API in this discussion and came to some decisions. I'll copy @dreamorosi's original message and change the examples/changes to Python instead of TypeScript.

Thanks for the thoughtful answer.

After some additional discussion with the team, I think we settled on an API that should accommodate all the requirements that we discussed.

Below are the decision points that we can use to guide the implementation:

When do we buffer

With this in mind, I think my final position would be to follow these rules:

  • We use _X_AMZN_TRACE_ID (as you suggested) as key to group the logs that belong to an invocation and buffer them
  • If you emit logs during the INIT phase (aka outside of the scope of an invocation) these logs are never buffered, this is because _X_AMZN_TRACE_ID is not yet present there

Regarding the second point, the experience would look like this:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config, level="DEBUG")

logger.debug("init complete")  # Not buffered

def lambda_handler(event: dict, context: LambdaContext) -> dict:
    logger.debug("start request")  # Buffered

which translates, in your logs to this:

Request 1

{ "message": "init complete", "level": "DEBUG", ...other keys }

Request 2

// no logs

What do we buffer

By default, when buffering is enabled, we'll buffer only log levels DEBUG and TRACE. We settled on this default because we want to nudge customers to make better design decisions and emit logs at the correct log level. In principle, diagnostic data that might be superfluous unless there's an error should be emitted at one of these levels, while less verbose levels like INFO, and WARN should be used for data that should be logged also during regular operations.

With this in mind, logger would work like this when buffering is enabled:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.info('I am an info'); # NOT buffered
  logger.warn('I am a warning'); # NOT buffered
  logger.error('i am an error'); # NOT buffered

We however understand that some customers might want to make different tradeoffs, either because they don't fully control the logs or because they want to reduce costs at minimum and really only emit certain logs, but still have them in a buffer for when things go wrong.

To accommodate this use case, we're also giving the flexibility to change the level at which the logger starts buffering. For example, if I want to buffer up to WARN, I can do so by setting an option in the buffer configuration:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, buffer_at_verbosity="WARN")
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.info('I am an info'); # buffered
  logger.warn('I am a warning'); # buffered
  logger.error('i am an error'); # NOT buffered

How do we empty the buffer

When it comes to flushing the log buffer, after a lot of discussion with @leandrodamascena I think we'd like to settle on three APIs:

  1. manual imperative flushing
  2. flush on logger.error() (enabled by default when buffering is active, can be opted-out)
  3. flush on unhandled error when using decorator (enabled by default when buffering is active, can be opted-out)

manual flushing

The first one is pretty straightforward, customers should be able to flush the buffer at any time by calling a method:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.flush_buffer()

Flush on error log

When buffering is enabled, we'll flush the buffer whenever customers call logger.error(). This allows customers to get all the data about an invocation when they need it the most, aka when there's an error.

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered

  try:
     function_doesnt_existts()
  except Exception:
     logger.error('unable to perform operation'); # buffer is flushed 

Customers can also opt-out from this if they want to have tight control over the buffering mechanism:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, flush_on_error_log=False) # True by default
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered

  try:
     function_doesnt_existts()
  except Exception:
     logger.error('unable to perform operation'); # not buffered, write normal log

  logger.flush_buffer()

flush on unhandled error

When buffering is enabled, and customers are using our inject_lambda_context decorator, we can also intercept unhandled errors and flush the buffer before re-throwing the customer error and flushing the buffer.

What happens when the buffer is full

As discussed above/in other threads, we'll use a circular buffer structure. This means that depending on the maxBytes setting and how many logs are actually buffered, some of the logs previously buffered might be evicted before the buffer is flushed.

To avoid customer confusion, when this happens we'll emit a warning log when the buffer is being flushed. This allows us to respect the customer buffering level while still informing them about the missing logs.

For example:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug 1'); # buffered
  logger.debug('I am a debug 2'); # buffered
  logger.debug('I am a debug 3'); # buffer was full, so "I am a debug 1" was evicted
  logger.flush_buffer()

which results in:

{ "message": "I am a debug 2", "level": "DEBUG", ...other keys }
{ "message": "I am a debug 3", "level": "DEBUG", ...other keys }
{ "message": `One or more log entries were evicted from the buffer because it got full. Consider increasing the "max_bytes" setting if you want to retain more logs`, "level": "WARN", ...other keys }

How do you optimize for buffer storage

By default we'll set the buffer to a relatively low amount (TBD) and while customers can configure this amount, certain use cases might require additional optimization. For these cases we'll allow customers to enable buffer compression which optimizes for space, at the cost of a CPU overhead (its actual impact will depend on the Lambda function allocated resources).

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, compress: True) # False by default
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

When this setting is enabled, we'll use the zlib module from the standard library to compress the logs as they are added to the buffer, and decompress them if/when the buffer is flushed.

@leandrodamascena leandrodamascena added logger and removed help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision labels Feb 8, 2025
@leandrodamascena leandrodamascena self-assigned this Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

4 participants