Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: AWS SQS MD5 Hash Mismatch #1572

Merged

Conversation

ppittle
Copy link
Member

@ppittle ppittle commented Feb 6, 2024

Fixes #1492

Changes

A recent change to the AWS SDK implementation of SQS client changes wireline protocol from XML to Json. This causes the SDK's internal request pipeline to behave differently, causing AWSTracingPipelineHandler to no longer be able to manipulate SendMessageRequest.MessageAttributes after the Marshalling step has been completed.

This change moves AWSTracingPipelineHandler to run before Marshalling (ie Serialization) so that Open Telemetry headers can be correctly injected.

Injecting propagation context into outgoing requests have been moved to a new handler, AWSPropagatorPipelineHandler, which runs later in the SDK request pipeline.

Service Name Mapping

Moving AWSTracingPipelineHandler ahead in the AWS SDK Pipeline made it necessary to also tweak AWSServiceHelper to use requestContext.ServiceMetaData.ServiceId instead of requestContext.Request.ServiceName as the Request has not yet been populated at this point.

ServiceId and ServiceName can differ for some AWS Services, requiring an update to AWSServiceType. I have manually verified all 3 Service Names in AWSServiceType for correctness.

@ppittle ppittle requested a review from a team February 6, 2024 22:25
@ppittle
Copy link
Member Author

ppittle commented Feb 6, 2024

Would it be possible to add @normj or @birojnayak as optional reviewers?

Copy link
Contributor

@normj normj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I recommend adding more context to the PR description for other reviewers that are not as familiar with the internals of the AWS .NET SDK.

@birojnayak
Copy link
Contributor

@ppittle let's try to add a test if it's possible ? Customer has reported this as bug, so see if we can add a test.

@Kielek Kielek added the comp:instrumentation.aws Things related to OpenTelemetry.Instrumentation.AWS label Feb 9, 2024
@ppittle ppittle force-pushed the ppittle/bug/aws-sqs-md5-hash-mismatch branch 2 times, most recently from 00b6d81 to 6ec938b Compare February 10, 2024 01:25
@ppittle
Copy link
Member Author

ppittle commented Feb 10, 2024

@ppittle let's try to add a test if it's possible ? Customer has reported this as bug, so see if we can add a test.

I have reworked the unit tests to reflect the change in implementation.

I also looked at adding a proper regression test - https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/1572/files#diff-e5d83ef47b6e8c2bac87983d6e1ede500c9c42506c262afb183d4ff7bb557db2R177

However, this doesn't fully cover the issue raised. The exception is caused because the older version didn't send all Attributes to the SQS Service. When SQS replied with the hash of what it had received, this did not match the hash calculated by the SDK and an exception was raised by the user.

The unit tests intercept the SDK Pipeline to Set a Mock Web Response rather than making a live call to a SQS instance. Covering this issue in a unit test would require the test to precalculate the expected hash and inject that value into the mock response. In my opinion, the maintenance cost for that test code is high and would not provide a high degree of certainty of being able to detect future changes to the SDK Pipeline's internals.

It would be more appropriate to have an integration test, but without an AWS Account to test against, this is also not a feasible option.

I recommend we leave the test coverage as it is, but I welcome a second opinion

@ppittle ppittle force-pushed the ppittle/bug/aws-sqs-md5-hash-mismatch branch from 6ec938b to 31344ee Compare February 10, 2024 01:46
@@ -198,8 +199,8 @@ private void ValidateAWSActivity(Activity aws_activity, Activity parent)

private void ValidateDynamoActivityTags(Activity ddb_activity)
{
Assert.Equal("DynamoDBv2.Scan", ddb_activity.DisplayName);
Assert.Equal("DynamoDBv2", Utils.GetTagValue(ddb_activity, "aws.service"));
Assert.Equal("DynamoDB.Scan", ddb_activity.DisplayName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this assert changed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWSServiceName tag is set based on the result from calling AWSServiceHelper.GetAWSServiceName. The implementation of GetAWServiceName was changed in this PR to use ServiceMetaData.ServiceId instead of requestContext.Request.ServiceName; moving the handler code earlier in the SDK Pipeline means the requestContext.Request object isn't yet available.

ServiceId and ServiceName are sometimes subtly different, as is the case with dynamo: DynamoDB vs DynamoDBv2.

I could see an argument that this is a breaking change - existing users would see the new tag value after upgrading. To prevent it, I'd need to hardcode a mapping table for dynamodb and sns. When I considered this, my opinion was the name change was acceptable and the maintenance overhead of the mapping table was not worth it.

But I'd very much welcome additional opinions, especially if CNCF has guidance around this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, the name shouldn't change, while we can change the logic... But I see here similar changes. Approving and let's mark this as breaking changes. @Kielek FYI..

@birojnayak
Copy link
Contributor

@Kielek or anyone , could you add tag to mark this change as breaking change ? or let us know the procedure...

@Kielek
Copy link
Contributor

Kielek commented Feb 12, 2024

@Kielek or anyone , could you add tag to mark this change as breaking change ? or let us know the procedure...

Typically, you should add entry to the Changelog.
Consider following format

* **Breaking Change**: Your message here
  ([#PR_ID](https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/PR_ID))

@ppittle ppittle force-pushed the ppittle/bug/aws-sqs-md5-hash-mismatch branch from 31344ee to 87085d1 Compare February 12, 2024 20:20
@ppittle
Copy link
Member Author

ppittle commented Feb 12, 2024

@Kielek or anyone , could you add tag to mark this change as breaking change ? or let us know the procedure...

Typically, you should add entry to the Changelog. Consider following format

* **Breaking Change**: Your message here
  ([#PR_ID](https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/PR_ID))

Updated Changelog

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 20, 2024
@ppittle
Copy link
Member Author

ppittle commented Feb 20, 2024

@Kielek - looks like this PR has a pending reviewer:

open-telemetry/dotnet-contrib-approvers was requested for review as a code owner

Are you able to approve, or point me towards someone who can?

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (71655ce) 73.91% compared to head (1ed3a85) 80.64%.
Report is 150 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1572      +/-   ##
==========================================
+ Coverage   73.91%   80.64%   +6.73%     
==========================================
  Files         267      114     -153     
  Lines        9615     3080    -6535     
==========================================
- Hits         7107     2484    -4623     
+ Misses       2508      596    -1912     
Flag Coverage Δ
unittests-Solution 80.64% <97.61%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...rumentation.AWS/Implementation/AWSServiceHelper.cs 100.00% <100.00%> (ø)
...strumentation.AWS/Implementation/AWSServiceType.cs 66.66% <ø> (ø)
...AWS/Implementation/AWSTracingPipelineCustomizer.cs 90.90% <100.00%> (+3.40%) ⬆️
...on.AWS/Implementation/AWSTracingPipelineHandler.cs 85.41% <100.00%> (-2.34%) ⬇️
...tion.AWS/Implementation/SnsRequestContextHelper.cs 81.81% <100.00%> (-13.19%) ⬇️
...tion.AWS/Implementation/SqsRequestContextHelper.cs 81.81% <100.00%> (-13.19%) ⬇️
...rumentation.AWS/TracerProviderBuilderExtensions.cs 100.00% <100.00%> (ø)
...AWS/Implementation/AWSPropagatorPipelineHandler.cs 95.23% <95.23%> (ø)

... and 221 files with indirect coverage changes

@utpilla utpilla removed the Stale label Feb 20, 2024
@utpilla
Copy link
Contributor

utpilla commented Feb 20, 2024

@ppittle Could you please fix the CI errors? We can then merge the PR.

@ppittle ppittle force-pushed the ppittle/bug/aws-sqs-md5-hash-mismatch branch from 61c8b0e to 5b860ad Compare February 22, 2024 04:10
…arlier in the pipeline. This is necessary to support sqs becoming a json service and no longer supporting writing to MessageAttributes after Marshalling has occured.
…tes after RequestContext.Request has been built so that the Propagator can inject request headers.
@ppittle ppittle force-pushed the ppittle/bug/aws-sqs-md5-hash-mismatch branch from 5b860ad to 867ad33 Compare February 22, 2024 04:25
@ppittle
Copy link
Member Author

ppittle commented Feb 22, 2024

@ppittle Could you please fix the CI errors? We can then merge the PR.

I've pushed an update. @utpilla, can you approve the CI/CD run?

@utpilla utpilla merged commit bfe42e2 into open-telemetry:main Feb 22, 2024
33 checks passed
@ppittle ppittle deleted the ppittle/bug/aws-sqs-md5-hash-mismatch branch February 23, 2024 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:instrumentation.aws Things related to OpenTelemetry.Instrumentation.AWS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SendMessage with MessageAttributes on latest version of AWS SDK throws "Attribute MD5 hash mismatch" exception
6 participants