Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make sure EbsOptimizedInfo exists when selecting instances by EBS maximum bandwidth #6663

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

kanwren
Copy link
Contributor

@kanwren kanwren commented Aug 6, 2024

We observed that g2.xlarge has .EbsInfo.EbsOptimizedSupport == "default", but no .EbsInfo.EbsOptimizedInfo:

$ aws ec2 describe-instance-types --region us-west-2 --instance-types 'g2.xlarge' | jq '.InstanceTypes[0].EbsInfo'
{
  "EbsOptimizedSupport": "default",
  "EncryptionSupport": "supported",
  "NvmeSupport": "required"
}

This can cause a nil pointer dereference when gathering instance info, following the support for EBS maximum bandwidth introduced in #5925

{"level":"INFO","time":"2024-08-05T18:52:30.307Z","logger":"controller","message":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","commit":"490ef94","controller":"nodeclaim.disruption","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"utility-7lkkw"},"namespace":"","name":"utility-7lkkw","reconcileID":"997d55c4-ab30-4309-a0bb-32ad3e4b89b6"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1fa3011]

goroutine 1986 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:111 +0x1e5
panic({0x26f1360?, 0x4b7d1a0?})
	runtime/panic.go:770 +0x132
github.com/aws/karpenter-provider-aws/pkg/providers/instancetype.computeRequirements(0xc00cfef400, {0x0, 0x0, 0x0}, {0xc0000560fb, 0x9}, {0x33cf790, 0xc00e9847f0})
	github.com/aws/karpenter-provider-aws/pkg/providers/instancetype/types.go:169 +0x1711
github.com/aws/karpenter-provider-aws/pkg/providers/instancetype.NewInstanceType({0x33c95d8, 0xc0088ef1a0}, 0xc00cfef400, {0xc0000560fb?, 0xc0088f8df0?}, {0xc00831a2d0, 0x1, 0x1}, 0x0, 0x0, ...)
	github.com/aws/karpenter-provider-aws/pkg/providers/instancetype/types.go:58 +0xb4
github.com/aws/karpenter-provider-aws/pkg/providers/instancetype.(*DefaultProvider).List.func2(0xc00cfef400, 0x411f45?)
	github.com/aws/karpenter-provider-aws/pkg/providers/instancetype/instancetype.go:167 +0x3a8
github.com/samber/lo.Map[...]({0xc001a31a88?, 0x316, 0x2b26360}, 0xc0088f9208?)
	github.com/samber/lo@v1.39.0/slice.go:29 +0x66
github.com/aws/karpenter-provider-aws/pkg/providers/instancetype.(*DefaultProvider).List(0xc0006e6540, {0x33c95d8, 0xc0088ef1a0}, 0x0, 0xc005a12908)
	github.com/aws/karpenter-provider-aws/pkg/providers/instancetype/instancetype.go:155 +0xf3e
github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(*CloudProvider).GetInstanceTypes(0xc000a4f3e0, {0x33c95d8, 0xc0088ef1a0}, 0xc008a0a1e0)
	github.com/aws/karpenter-provider-aws/pkg/cloudprovider/cloudprovider.go:172 +0x24a
github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(*CloudProvider).isAMIDrifted(0xc000a4f3e0?, {0x33c95d8?, 0xc0088ef1a0?}, 0xc0088e88c0, 0x0?, 0xc0057589a0, 0xc008a04248)
	github.com/aws/karpenter-provider-aws/pkg/cloudprovider/drift.go:70 +0x4d
github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(*CloudProvider).isNodeClassDrifted(0xc000a4f3e0, {0x33c95d8, 0xc0088ef1a0}, 0xc0088e88c0, 0xc008a0a1e0, 0xc008a04248)
	github.com/aws/karpenter-provider-aws/pkg/cloudprovider/drift.go:50 +0xd7
github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(*CloudProvider).IsDrifted(0xc000a4f3e0, {0x33c95d8, 0xc0088ef1a0}, 0xc0088e88c0)
	github.com/aws/karpenter-provider-aws/pkg/cloudprovider/cloudprovider.go:208 +0x35f
sigs.k8s.io/karpenter/pkg/cloudprovider/metrics.(*decorator).IsDrifted(0xc0004161f0, {0x33c95d8, 0xc0088ef1a0}, 0xc0088e88c0)
	sigs.k8s.io/karpenter@v0.37.0/pkg/cloudprovider/metrics/cloudprovider.go:151 +0x14f
sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/disruption.(*Drift).isDrifted(0xc0004162d0, {0x33c95d8, 0xc0088ef1a0}, 0xc0087cbc20, 0xc0088e88c0)
	sigs.k8s.io/karpenter@v0.37.0/pkg/controllers/nodeclaim/disruption/drift.go:104 +0xd7
sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/disruption.(*Drift).Reconcile(0xc0004162d0, {0x33c95d8, 0xc0088ef1a0}, 0xc0087cbc20, 0xc0088e88c0)
	sigs.k8s.io/karpenter@v0.37.0/pkg/controllers/nodeclaim/disruption/drift.go:67 +0x275
sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/disruption.(*Controller).Reconcile(0xc0002cf140, {0x33c95d8, 0xc0088ef050}, 0xc0088e88c0)
	sigs.k8s.io/karpenter@v0.37.0/pkg/controllers/nodeclaim/disruption/controller.go:95 +0x369
sigs.k8s.io/controller-runtime/pkg/reconcile.(*objectReconcilerAdapter[...]).Reconcile(0x339c740, {0x33c95d8, 0xc0088ef050}, {{{0x0, 0x0?}, {0xc00070e400?, 0x5?}}})
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:142 +0x194
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x33cfd88?, {0x33c95d8?, 0xc0088ef050?}, {{{0x0?, 0xb?}, {0xc00070e400?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00040cdc0, {0x33c9610, 0xc0004fd040}, {0x28a1fc0, 0xc001a9dbe0})
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00040cdc0, {0x33c9610, 0xc0004fd040})
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 354
	sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:218 +0x486

Add an extra check to EBS maximum bandwidth selection to make sure that this key exists.

@kanwren kanwren requested a review from a team as a code owner August 6, 2024 20:33
@kanwren kanwren requested a review from njtran August 6, 2024 20:33
Copy link

netlify bot commented Aug 6, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 1780bbf
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/66b312e7b32c5e0008e34ac6
😎 Deploy Preview https://deploy-preview-6663--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kanwren kanwren force-pushed the kanwren/ebs-optimized-info-check branch from 014efe6 to 648033f Compare August 6, 2024 20:36
Copy link

@scottcrawford03 scottcrawford03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kanwren
Copy link
Contributor Author

kanwren commented Aug 6, 2024

I would add this or a similar instance type to the generated test data, but it seems there are no instance types in us-east-1 (the region used for generation) that have this issue:

$ aws ec2 describe-instance-types --region us-east-1 | jq 'def count(f): reduce f as $i (0; . + 1); count(.InstanceTypes[] | select(.EbsInfo.EbsOptimizedInfo == null and .EbsInfo.EbsOptimizedSupport == "default"))'
0

@kanwren
Copy link
Contributor Author

kanwren commented Aug 6, 2024

Tentatively added g2.xlarge to the tests, which required switching the region used for gathering test instance types.

jonathan-innis
jonathan-innis previously approved these changes Aug 7, 2024
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis jonathan-innis enabled auto-merge (squash) August 7, 2024 05:58
@kanwren
Copy link
Contributor Author

kanwren commented Aug 7, 2024

Removing g2.xlarge from the tests, since it seems to be EOL and has a bunch of weird data. In this case, g2.xlarge has {"MemoryInfo": {"SizeInMiB": 0}}, which caused the memory metrics test to fail, since it expected to get a value > 0; it's expected that we wouldn't get sane data from those. I'll leave the code change simply to prevent nil panics in case of a weird instance type like this, but it makes sense that the tests choke on it.

Full description below if you want to see for yourself :p

Full describe-instance-types output on g2.xlarge
{
  "InstanceTypes": [
    {
      "InstanceType": "g2.xlarge",
      "CurrentGeneration": false,
      "FreeTierEligible": false,
      "SupportedUsageClasses": [],
      "SupportedRootDeviceTypes": [
        "ebs",
        "instance-store"
      ],
      "SupportedVirtualizationTypes": [
        "hvm"
      ],
      "BareMetal": false,
      "Hypervisor": "xen",
      "ProcessorInfo": {
        "SupportedArchitectures": [
          "x86_64"
        ],
        "Manufacturer": "Other"
      },
      "VCpuInfo": {
        "DefaultVCpus": 4
      },
      "MemoryInfo": {
        "SizeInMiB": 0
      },
      "InstanceStorageSupported": true,
      "InstanceStorageInfo": {
        "TotalSizeInGB": 0,
        "Disks": [
          {
            "SizeInGB": 0,
            "Count": 1,
            "Type": "hdd"
          }
        ],
        "NvmeSupport": "unsupported",
        "EncryptionSupport": "unsupported"
      },
      "EbsInfo": {
        "EbsOptimizedSupport": "default",
        "EncryptionSupport": "supported",
        "NvmeSupport": "required"
      },
      "NetworkInfo": {
        "NetworkPerformance": "Unspecified",
        "MaximumNetworkInterfaces": 4,
        "MaximumNetworkCards": 1,
        "DefaultNetworkCardIndex": 0,
        "NetworkCards": [
          {
            "NetworkCardIndex": 0,
            "NetworkPerformance": "Unspecified",
            "MaximumNetworkInterfaces": 4
          }
        ],
        "Ipv4AddressesPerInterface": 15,
        "Ipv6AddressesPerInterface": 0,
        "Ipv6Supported": false,
        "EnaSupport": "unsupported",
        "EfaSupported": false,
        "EncryptionInTransitSupported": false,
        "EnaSrdSupported": false
      },
      "PlacementGroupInfo": {
        "SupportedStrategies": [
          "cluster",
          "partition",
          "spread"
        ]
      },
      "HibernationSupported": false,
      "BurstablePerformanceSupported": false,
      "DedicatedHostsSupported": false,
      "AutoRecoverySupported": false,
      "SupportedBootModes": [
        "legacy-bios"
      ],
      "NitroEnclavesSupport": "unsupported",
      "NitroTpmSupport": "unsupported",
      "PhcSupport": "unsupported"
    }
  ]
}

auto-merge was automatically disabled August 7, 2024 06:19

Head branch was pushed to by a user without write access

`g2.xlarge` has `.EbsInfo.EbsOptimizedSupport == "default"`, but no
`.EbsInfo.EbsOptimizedInfo`. When selecting instance by EBS maximum
bandwidth, make sure that this key exists.
@kanwren kanwren force-pushed the kanwren/ebs-optimized-info-check branch from 1e78d61 to a98c0a2 Compare August 7, 2024 06:19
@kanwren kanwren force-pushed the kanwren/ebs-optimized-info-check branch from a98c0a2 to 1780bbf Compare August 7, 2024 06:23
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis jonathan-innis enabled auto-merge (squash) August 7, 2024 20:08
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10279220049

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 79.274%

Totals Coverage Status
Change from base Build 10278905359: 0.0%
Covered Lines: 5982
Relevant Lines: 7546

💛 - Coveralls

@jonathan-innis jonathan-innis merged commit 525a56b into aws:main Aug 7, 2024
16 checks passed
@kanwren kanwren deleted the kanwren/ebs-optimized-info-check branch August 7, 2024 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants