Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the AWS SDK dependencies issue causing the "not found, ResolveEndpointV2" error #39454

Merged
merged 4 commits into from
May 15, 2024

Conversation

gpop63
Copy link
Contributor

@gpop63 gpop63 commented May 7, 2024

Overview

When trying to use aws module with metricsets like cloudwatch, rds etc. we get some error Exiting: 1 error: error creating aws metricset: failed DescribeRegions: not found, ResolveEndpointV2. This happens for 8.15. 8.14 seems fine - tested by me and @lucian-ioan.

The solution is to upgrade the AWS Go SDKs and fix a minor pointer value.

Solution provided by @agithomas

go get -u http://github.com/aws/aws-sdk-go-v2/...
go clean -modcache
go mod tidy
mage build

aws/aws-sdk-go-v2#2397
aws/aws-sdk-go-v2#2370

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

We are going to test the following input and metricset to ensure they work as expected with the updated dependencies:

Filebeat inputs:

Input Assignee Test Notes
awscloudwatch @aliabbas-elastic OK
awss3 @zmoog OK Polling and SQS modes

Metricbeat metricsets

Metricset Assignee Test Notes
aws / awshealth @agithomas OK
aws / billing @gpop63 OK 8.14 & 8.15
aws / cloudwatch @lucian-ioan OK
awsfargate / task_stats @zmoog - task_stats does not call AWS APIs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 7, 2024
@botelastic
Copy link

botelastic bot commented May 7, 2024

This pull request doesn't have a Team:<team> label.

@mergify mergify bot assigned gpop63 May 7, 2024
Copy link
Contributor

mergify bot commented May 7, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gpop63? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented May 7, 2024

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-05-08T10:44:35.144+0000

  • Duration: 363 min 9 sec

Test stats 🧪

Test Results
Failed 0
Passed 29051
Skipped 2024
Total 31075

Steps errors 2

Expand to view the steps failures

Installing Node.js 18.17.1
  • Took 3 min 27 sec . View more details here
  • Description: set -e export NVM_DIR="${HOME}/.nvm" [ -s "${NVM_DIR}/nvm.sh" ] && . "${NVM_DIR}/nvm.sh" nvm install 18.17.1 nvm version | head -n1 > "/var/lib/jenkins/workspace/PR-39454-2-0bce2cb8-3465-44c5-94ad-3bf3cd9deea7/src/github.com/elastic/beats@tmp/.nvm-node-version"
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@ishleenk17
Copy link
Contributor

@gpop63 : the issue is seen only with 8.15 ?
All prior stacks are fine?
Did something change in the SDK's being used for AWS between 8.14 and 8.15 ?

@zmoog
Copy link
Contributor

zmoog commented May 8, 2024

Mixing some old & new AWS SDK dependencies brings the clients into an inconsistent state, causing the error.

@zmoog zmoog self-requested a review May 8, 2024 10:42
@zmoog
Copy link
Contributor

zmoog commented May 8, 2024

I created a branch to fix the same problem, but I'll drop my branch and focus on reviewing this PR instead.

@zmoog
Copy link
Contributor

zmoog commented May 8, 2024

The best summary of this issue comes from a comment in the second link in the issue description:

this error occurs when using a version of the core aws-sdk-go-v2 module at or above v1.23.0 (essentially, released on or after 11/15/23) in combination with a service module released before that date. The most recent comment from HemantDubey-ACS is an example of this - the service/secretsmanager v1.21.3 is from before that breaking point.

Upgrading all modules under the aws-sdk-go-v2 namespace will resolve this issue.

A few days ago, we upgraded the core aws-sdk-go-v2 module to 1.26.1, crossing the red line. Since most other module versions stayed the same, the issue happened.

We must upgrade the service modules to match the breaking point introduced by the core aws-sdk-go-v2 module.

Reviewing changelogs so far, I found the following three breaking changes:

  • "Bump minimum go version to 1.19"
  • "BREAKFIX: Change in MaxResults datatype from value to pointer type in cognito-sync service."
  • "BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162."

At first glance, the other two breaking changes seem minor and something we should catch at compilation time. But it's worth a closer look.

Breaking changes review

Bump minimum go version to 1.19

The minimum go version to 1.19 is fine. We currently build Beats using 1.21.9.

BREAKFIX: Change in MaxResults datatype from value to pointer type in cognito-sync service

A small-ish breaking change that modified a few values to pointers.

Related to https://github.com/aws/aws-sdk-go-v2/releases/tag/release-2023-09-18

BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services

This was a small-ish breaking change that modified a few values to pointers. For example, an update to IncludeLinkedAccounts is required due to this "breakfix". See aws/aws-sdk-go-v2@abd72f5#diff-3605753f6f3e0d05dadd669940add74574c80af02a1bdaa72ee5d2bf35ab5e30 for more.


Here is the list of dependencies upgrade:

Module From To Changelog Breaking Changes
github.com/aws/aws-sdk-go-v2/config v1.17.7 v1.27.11 https://github.com/aws/aws-sdk-go-v2/blob/main/config/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/credentials v1.12.20 v1.17.11 https://github.com/aws/aws-sdk-go-v2/blob/main/credentials/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/cloudwatch v1.26.0 v1.38.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/cloudwatch/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs v1.15.5 v1.35.1 https://github.com/aws/aws-sdk-go-v2/blob/main/service/cloudwatchlogs/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/costexplorer v1.18.4 v1.38.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/costexplorer/CHANGELOG.md BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162.
github.com/aws/aws-sdk-go-v2/service/ec2 v1.36.1 v1.160.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/ec2/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/elasticloadbalancingv2 v1.18.4 v1.30.5 https://github.com/aws/aws-sdk-go-v2/blob/main/service/elasticloadbalancingv2/CHANGELOG.md BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162.
github.com/aws/aws-sdk-go-v2/service/iam v1.18.4 v1.32.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/iam/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/organizations v1.15.2 v1.27.3 https://github.com/aws/aws-sdk-go-v2/blob/main/service/organizations/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/rds v1.20.1 v1.78.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/rds/CHANGELOG.md BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162.
github.com/aws/aws-sdk-go-v2/service/resourcegroupstaggingapi v1.13.5 v1.21.4 https://github.com/aws/aws-sdk-go-v2/blob/main/service/resourcegroupstaggingapi/CHANGELOG.md BREAKFIX: Change in MaxResults datatype from value to pointer type in cognito-sync service.
github.com/aws/aws-sdk-go-v2/service/s3 v1.27.11 v1.53.1 https://github.com/aws/aws-sdk-go-v2/blob/main/service/s3/CHANGELOG.md BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162.
github.com/aws/aws-sdk-go-v2/service/sqs v1.18.4 v1.31.4 https://github.com/aws/aws-sdk-go-v2/blob/main/service/sqs/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/sts v1.16.19 v1.28.6 https://github.com/aws/aws-sdk-go-v2/blob/main/service/sts/CHANGELOG.md BREAKFIX: Change in MaxResults datatype from value to pointer type in cognito-sync service.
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.12.17 v1.16.1 https://github.com/aws/aws-sdk-go-v2/blob/main/feature/ec2/imds/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.11.33 v1.16.15 https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/CHANGELOG.md BREAKING CHANGE Correct nullability of a large number of S3 structure fields. See #2162.
github.com/aws/aws-sdk-go-v2/service/cloudformation v1.20.4 v1.50.0 https://github.com/aws/aws-sdk-go-v2/blob/main/service/cloudformation/CHANGELOG.md BREAKFIX: Correct nullability and default value representation of various input fields across a large number of services. See 2162.
github.com/aws/aws-sdk-go-v2/service/health v1.17.0 v1.24.4 https://github.com/aws/aws-sdk-go-v2/blob/main/service/health/CHANGELOG.md
github.com/aws/aws-sdk-go-v2/service/kinesis v1.15.8 v1.27.4 https://github.com/aws/aws-sdk-go-v2/blob/main/service/kinesis/CHANGELOG.md BREAKFIX: Change in MaxResults datatype from value to pointer type in cognito-sync service.

Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we upgraded the core aws-sdk-go-v2 module to 1.26.1, we have two scenarios:

  1. Revert the core module upgrade
  2. Upgrade the services

Since we still have time before 8.15, upgrading the services seems the most appropriate response.

We could upgrade the services to the minimum version to restore functionality or target the latest and greatest stable versions.

I searched the changelogs of the upgraded modules and found only a few minor and manageable breaking changes. For the list and comments, see #39454 (comment).

@zmoog zmoog changed the title [metricbeat] [aws] fix aws sdk ResolveEndpointV2 Fix the AWS SDK dependencies issue causing the "not found, ResolveEndpointV2" error May 8, 2024
@zmoog zmoog marked this pull request as ready for review May 8, 2024 15:38
@zmoog zmoog requested review from a team as code owners May 8, 2024 15:38
@lucian-ioan lucian-ioan self-requested a review May 8, 2024 15:48
@ishleenk17
Copy link
Contributor

Since we upgraded the core aws-sdk-go-v2 module to 1.26.1, we have two scenarios:

  1. Revert the core module upgrade
  2. Upgrade the services

Since we still have time before 8.15, upgrading the services seems the most appropriate response.

We could upgrade the services to the minimum version to restore functionality or target the latest and greatest stable versions.

I searched the changelogs of the upgraded modules and found only a few minor and manageable breaking changes. For the list and comments, see #39454 (comment).
@zmoog
Since we have time for 8.15, I think it should be ok to upgrade the service dependencies.
However, we need to ensure that sufficient testing of the modules happens after that. Would the Integration testing be sufficient?
@tommyers-elastic : :Let me know if you think otherwise.

@zmoog
Copy link
Contributor

zmoog commented May 9, 2024

However, we need to ensure that sufficient testing of the modules happens after that. Would the Integration testing be sufficient?

Earlier today, we discussed the need to add automated smoke tests daily or weekly. Smoke tests are essential to detect this kind of failure. I'll create an issue for this.

In the meantime, since not all modules have system tests, I guess we need to run a round of manual tests. I am currently checking the aws-s3 and cloudwatch inputs.

@agithomas
Copy link
Contributor

Earlier today, we discussed the need to add automated smoke tests daily or weekly. Smoke tests are essential to detect this kind of failure. I'll create an issue for this.

@zmoog , as part of the current CI/CD pipeline test, we run integration tests for all modules. Have you come across the reason why this issue is not captured by the existing tests ?

@zmoog
Copy link
Contributor

zmoog commented May 13, 2024

as part of the current CI/CD pipeline test, we run integration tests for all modules. Have you come across the reason why this issue is not captured by the existing tests ?

Not yet. I need to dig deeper to understand why tests didn't fail in the PR that introduced the core aws-sdk-go-v2 module update.

@agithomas
Copy link
Contributor

Validated for AWS Health. No issues noted.

go clean -modcache && go mod tidy && mage build && metricbeat -e -v

image

@zmoog
Copy link
Contributor

zmoog commented May 13, 2024

Testing awss3 in "polling mode" using the following config:

  - module: aws  
    cloudtrail:  
      enabled: true  
  
      # AWS S3 bucket arn
      var.bucket_arn: 'arn:aws:s3:::<redacted>'
  
      # Number of workers on S3 bucket
      var.number_of_workers: 5

      # Use access_key_id, secret_access_key and/or session_token instead of shared credential file
      var.access_key_id: "<redacted>"
      var.secret_access_key: "<redacted>"

CleanShot 2024-05-13 at 10 47 13@2x

The input seems able to process S3 objects successfully.

@lucian-ioan
Copy link
Contributor

I've also tested AWS CloudWatch metrics and it works fine:

- module: aws
  period: 5m
  access_key_id: <redacted>
  secret_access_key: <redacted>
  metricsets:
    - cloudwatch
  metrics:
    - namespace: AWS/EC2
      name: ["CPUUtilization", "DiskWriteOps"]
      resource_type: ec2:instance
image

@ali786XI
Copy link
Contributor

ali786XI commented May 13, 2024

Tested the aws-cloudwatch input with the config as follows:-

filebeat.inputs:
- type: aws-cloudwatch
  log_group_arn: arn:aws:logs:eu-north-1:627286350114:log-group:aws-cloudtrail-logs-627286350134-53d435bd:* #masked
  scan_frequency: 24h
  start_position: end
  access_key_id: <redacted>
  secret_access_key: <redacted>

image

@zmoog
Copy link
Contributor

zmoog commented May 13, 2024

Testing awss3 in "SQS mode" using the following config:

  - module: aws  
    cloudtrail:  
      enabled: true  
  
      # AWS SQS queue url
      var.queue_url: https://sqs.eu-north-1.amazonaws.com/<redacted>/mbranca-cloudtrail-logs

      # Use access_key_id, secret_access_key and/or session_token instead of shared credential file
      var.access_key_id: "<redacted>"
      var.secret_access_key: "<redacted>"

CleanShot 2024-05-13 at 12 36 56@2x

The input seems able to successfully process S3 objects from notification via SQS message.

@zmoog zmoog force-pushed the fix_aws-sdk_ResolveEndpointV2 branch from a05331f to b4d242d Compare May 13, 2024 10:43
Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Manual tests show that this PR restored inputs/metricsets functionality.

@gpop63
Copy link
Contributor Author

gpop63 commented May 13, 2024

AWS billing working fine.

WhatsApp Image 2024-05-13 at 14 08 38_f9f5de0d

Copy link
Contributor

@ali786XI ali786XI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@agithomas agithomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, based on the manual testing

Copy link
Contributor

@lucian-ioan lucian-ioan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiyan-sheng kaiyan-sheng mentioned this pull request May 13, 2024
6 tasks
@zmoog
Copy link
Contributor

zmoog commented May 14, 2024

Hey @elastic/beats-tech-leads, whenever you have a moment, could you please take a look at this pull request? 🙇

@gpop63 gpop63 merged commit 35ed4d7 into elastic:main May 15, 2024
220 checks passed
v1v added a commit to v1v/beats that referenced this pull request May 15, 2024
…-actions

* upstream/main: (313 commits)
  github-action: delete opentelemetry workflow (elastic#39559)
  updatecli: move to the .github folder and support for signed commits (elastic#39472)
  Osquerybeat: Add action responses data stream (elastic#39143)
  [winlogbeat] performance improvment; avoid rendering event message twice (elastic#39544)
  Fix the AWS SDK dependencies issue causing the "not found, ResolveEndpointV2" error (elastic#39454)
  x-pack/filebeat/input/cel: add http metrics collection (elastic#39503)
  build(deps): bump github.com/elastic/elastic-agent-libs from 0.9.4 to 0.9.7 (elastic#39424)
  Remove unused env vars from pipelines (elastic#39534)
  [BK] - Remove osx steps from branch execution (elastic#39552)
  [BK] - Remove certain steps from running for Branches (elastic#39533)
  Allow dependabot report BK status checks (elastic#39540)
  Remove hardcoded module definitions in CI (elastic#39506)
  Explicitly set DOCKER_PULL, RACE_DETECTOR and TEST_COVERAGE for pipelines (elastic#39510)
  Fixed pipelines formatting (elastic#39513)
  Update filebeat pipeline to match Jenkins steps (elastic#39261)
  Add error check to groupToEvents so we don't blindly add error values (elastic#39404)
  Remove fields not needed for session view in add_session_view processor (elastic#39500)
  `aws-s3` input: Split S3 poller and SQS reader into explicit input objects (elastic#39353)
  ci(jenkins): remove post-build notifications (elastic#39483)
  [DOCS] Add the `read_pipeline` cluster privilege for winlogbeat and the `auto_configure` index privilege to beats documentation (elastic#38534)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants