Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-logs] Add Transform API on batch #5733

Closed

Conversation

CodeBlanch
Copy link
Member

@CodeBlanch CodeBlanch commented Jul 1, 2024

Changes

  • Added Transform API on Batch<T> which can be used to modify and/or filter
    records contained in a batch

Details

Ran into an issue with a customer doing this (more or less):

        private ExportResult Export(in Batch<LogRecord> batch)
        {
            LogRecord[] storage = ArrayPool<LogRecord>.Shared.Rent((int)batch.Count);

            try
            {
                int storageCount = 0;
                int redactedFieldCount = 0;

                foreach (var logRecord in batch) // Drains the batch
                {
                    redactedFieldCount += ExecuteRedactionEngine(logRecord);
                    storage[storageCount++] = logRecord;
                }

                if (redactedFieldCount > 0)
                {
                    LogRedactedFieldSummary(redactedFieldCount);
                }          

                // Call inner exporter to actually perform the export
                return innerExporter.Export(
                    new Batch<LogRecord>(storage, storageCount)); // Create a new batch
            }
            finally
            {
                ArrayPool<LogRecord>.Shared.Return(storage);
            }
        }

ExecuteRedactionEngine is a bunch of logic for inspecting the log record and removing sensitive information. Customer specifically does this over the batch at the time of export to not block the application when it is creating the telemetry.

The issue here is while processing foreach (var logRecord in batch) the completed LogRecords may be returned to the pool (if they were rented from it).

What this PR is proposing to solve this is add a new API on Batch<T> which can be used to do this processing safely by mutating an instance of Batch<T>:

    private ExportResult Export(in Batch<LogRecord> batch)
    {
        var transformationState = new TransformationState
        {
             RedactionEngine = this.RedactionEngine,
        };

        batch.Transform(OnTransformLogRecord, ref this.RedactionEngine);

        if (transformationState.RedactedFieldCount > 0)
        {
            LogRedactedFieldSummary(transformationState.RedactedFieldCount);
        }   

        return innerExporter.Export(in batch);
    }

    private static bool OnTransformLogRecord(LogRecord logRecord, ref TransformationState state)
    {
        state.RedactedFieldCount += state.RedactionEngine.ExecuteRedactionEngine(logRecord);
        return true;
    }

    private struct TransformationState
    {
        public RedactionEngine RedactionEngine;
        public int RedactedFieldCount;
    }

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

@CodeBlanch CodeBlanch added the logs Logging signal related label Jul 1, 2024
@github-actions github-actions bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label Jul 1, 2024
@CodeBlanch CodeBlanch changed the title [sdk-logs] Add WithProcessor API on batch [sdk-logs] Add Transform API on batch Jul 5, 2024
Copy link

codecov bot commented Jul 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.34%. Comparing base (6250307) to head (5dbdced).
Report is 285 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5733      +/-   ##
==========================================
+ Coverage   83.38%   86.34%   +2.96%     
==========================================
  Files         297      254      -43     
  Lines       12531    11120    -1411     
==========================================
- Hits        10449     9602     -847     
+ Misses       2082     1518     -564     
Flag Coverage Δ
unittests ?
unittests-Project-Experimental 86.32% <100.00%> (?)
unittests-Project-Stable 86.01% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/OpenTelemetry/Batch.cs 99.42% <100.00%> (+1.88%) ⬆️
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs 80.16% <100.00%> (-1.74%) ⬇️
src/OpenTelemetry/Logs/Pool/LogRecordPoolHelper.cs 100.00% <ø> (ø)

... and 196 files with indirect coverage changes

@CodeBlanch
Copy link
Member Author

@reyang I made some changes to the design after our initial offline discussion. Please take a look when you have a moment and let me know how you feel about it now.

  • Renamed from WithProcessor -> Transform

  • Switched the design from executing the transformation as items were read to doing all the work upfront. I liked the previous design better but my concern was it could lead to a case where the batch returns a count which differs from the number of items which would actually be read out (in the case of filtering). Probably wouldn't be an issue for most exporters, but it could be an issue for something so I went with a safe-by-default design.

  • In order to do the work upfront I utilized the ArrayPool to store the transformed/filtered batch. The rented array has to be returned when the batch processing is done. We could have relied on users calling Dispose to do that, but I felt most users would miss that. In order to make it easy & safe I switched the design from returning a new batch to mutating the existing batch. This way users performing transformation/filtering don't need to worry about disposing anything, the batch processor which created it initially will handle that.

Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 13, 2024
@github-actions github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 16, 2024
@CodeBlanch CodeBlanch marked this pull request as ready for review July 17, 2024 22:33
@CodeBlanch CodeBlanch requested a review from a team July 17, 2024 22:33
@Kielek
Copy link
Contributor

Kielek commented Jul 22, 2024

@CodeBlanch, the feature looks good, but after short verification with https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0 there is no such transformer. I think that it should be defined on the specification level, then we could implement it. For now, it can be hidden by the unstable features.

@pellared, FYI as you working on similar API for the Go SDK.

@pellared
Copy link
Member

pellared commented Jul 22, 2024

FYI as you working on similar API for the Go SDK.

This looks like violation of the SDK specification which does not handle filtering.
CC @open-telemetry/technical-committee

@CodeBlanch
Copy link
Member Author

@pellared I hear you, but I don't think it is that cut and dry. This issue is caused by implementation details in .NET. Do other languages have a batch implementation based on a circular buffer backed by records from a pool? 😄 I looked at Java a bit. It seems the batch there is a simple ArrayList or Collection. If users want to massage that batch and then delegate the processing work to some other exporter they may filter, transform, or augment the items, no? This will not work as expected in .NET due to the implementation. The spec doesn't really say much at all about what shape the batch should have. Do we need it to be more opinionated? IMO the situation is we made some perf optimizations which lead to surprising behavior in code which would work with other SDKs. This API is to give users a way to do it safely. IMO it is well within our purview to do that as it is to specialize our implementation(s) in the first place.

PS: Just to be clear this isn't a hypothetical exercise. There are users doing this today and they are currently experiencing data loss 😢

@pellared
Copy link
Member

pellared commented Jul 22, 2024

Go batching log processor has also a circular buffer. I will do my best to take a look later and compare the implementations. If you want you can look as well: https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/batch.go and https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/exporter.go

then delegate the processing work to some other exporter they may filter, transform, or augment the items, no?

These are scenarios not supported by the specification (sic!) and probably by many SDKs as well. A little reference open-telemetry/opentelemetry-specification#4067 (comment)

@pellared
Copy link
Member

This API is to give users a way to do it safely. IMO it is well within our purview to do that as it is to specialize our implementation(s) in the first place.

May there other ways on how to do it?

The biggest concern is that you are not only adding "transform" capabilities but also "filtering" which is not covered by the specification. Why wouldn't you like to filter the records before they reach the batching processor?

I assume that adding a "transform" (without filtering capabilities) to the batching exporter will be totally acceptable given the implementation specifics (as you are using a pool).

@CodeBlanch
Copy link
Member Author

@pellared

Why wouldn't you like to filter the records before they reach the batching processor?

I kind of touched on this in the description:

Customer specifically does this over the batch at the time of export to not block the application when it is creating the telemetry.

But let me expand on it.

Users could perform transformation via a processor before things hit the batch. This is actually way easier than filtering to do 🤣 But that is done synchronously before the "Log" call returns. In this case the transformation is complex. Lots of scanning and regex execution and who-knows-what-else. The user decided to do this over the batch instead because in the case of batch export processor this is done lazily on the dedicated thread. The idea is to defer the expensive processing so the impact is not felt every time a log is emitted. Perhaps the spec is a bit naive in its prescription of where this type of thing must be done?

I assume that adding a "transform" (without filtering capabilities) to the batching exporter will be totally acceptable given the implementation specifics (as you are using a pool).

This particular user actually does not need filtering. They just need transformation. I added the filtering bit. Because, why not? 😄 In my mind expensive filtering can also benefit from being performed lazily against the batch. If we are to add this API it felt natural to also let it filter. If we come back later and need a separate API to filter that makes it (potentially) two calls and some awkwardness in that we would have a transform API and a filter API but the filter API can always transform given LogRecord is mutable. I would be willing to drop filtering but I would prefer to be pragmatic (help users accomplish their scenarios) vs dogmatic (not have it just because the spec doesn't say we must/may/should).

May there other ways on how to do it?

Thinking out of the box. We could expose LogRecord.AddReference. If correctly called that would prevent the LogRecords from going back to the pool when taken from one batch and put into another. But I do feel that is a worse design. It isn't super friendly and leaks implementation details. Currently none of the pooling is exposed.

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 1, 2024
Copy link
Contributor

github-actions bot commented Aug 9, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logs Logging signal related pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package Stale Issues and pull requests which have been flagged for closing due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants