Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/npe on update #640

Merged
merged 7 commits into from
Aug 10, 2022

Conversation

Miles-Garnsey
Copy link
Member

What this PR does:

Because the cassandra.ApplyAuth function call within reconcileDatacenters creates a PodTemplateSpec within the cassandra.DatacenterConfig, certain downstream reconciliation functions have come to depend on the creation of that struct to behave in an error free way.

When auth==false this causes NPEs which crash the operator in spectacular fashion.

This PR ensures that all of those downstream functions guard against a nil PodTemplateSpec, consistently return errors, no longer expect to each create their own PodTemplateSpec and that any errors that do occur are handled gracefully.

Which issue(s) this PR fixes:
#637

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner July 28, 2022 07:48
// Ensure we have a valid PodTemplateSpec before proceeding to modify it.
if dcConfig.PodTemplateSpec == nil {
dcConfig.PodTemplateSpec = &corev1.PodTemplateSpec{}
// we need to declare at least one container, otherwise the PodTemplateSpec struct will be invalid
Copy link
Contributor

@burmanm burmanm Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't PodTemplateSpec stay as nil? PodTemplateSpec is not required by cass-operator and definitely shouldn't be created if you're not intending to use it to customize something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always customize things in K8ssandra-operator and always rely on the podTemplateSpec.
There's only one very edgy case where it's not true, which is when auth is disabled, medusa is disabled and metrics filters are explicitly voided with an empty list.
Doing this will simplify a lot future developments on K8ssandra-operator.

@Miles-Garnsey
Copy link
Member Author

@adejanovski we're seeing a single test failure here on CreateSingleDatacenterCluster which (surprisingly!) appears to be legitimate.

Log: k8ssandra-operator-b9cc64b7c-4gcp6.log

The specific error relates to duplicate cassandra containers being created in the STS... This is likely due to the changes you asked for where we do cassandra.UpdateCassandraContainer() in reconcileDatacenters(), which has uncovered a another new bug in AddContainersToPodTemplateSpec()

The single-dc manifest is extremely complex, however, I'd say that the fact that it declares spec.containers[0] == cassandra may play a role.

When the reconciliation runs, it calls UpdateCassandraContainer here, which creates the container cassandra in the PodTemplateSpec. Then, it calls AddContainersToPodTemplateSpec(), which adds the cassandra container again.

Because this function is defined (see here) without the strategic merge logic we'd expect, we end up with two separate cassandra containers. FWIW, this is an error irrespective of how we've modified reconcileDatacenters, the fact that it was originally working is a bit of a fluke IMO.

I've dropped the STS here: sts.txt

Here's the error it creates:

create Pod weirdclustername-dc1-default-sts-0 in StatefulSet weirdclustername-dc1-default-sts failed error: Pod "weirdclustername-dc1-default-sts-0" is invalid: [spec.containers[0].

image: Required value, spec.containers[1].name: Duplicate value: "cassandra"]

@Miles-Garnsey
Copy link
Member Author

@adejanovski I think this one is good to go pending tests passing (the previous round of tests did). Can you please confirm you're happy with it and I'll merge.

k8ssandra-example.yaml Show resolved Hide resolved
func ApplyAuth(dcConfig *DatacenterConfig, authEnabled bool) error {

if dcConfig.PodTemplateSpec == nil {
return errors.New("PodTemplateSpec was nil, cannot add auth settings")
Copy link
Contributor

@adutra adutra Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to learn what others think, but I'm torn on the idea of returning errors when PodTemplateSpec is nil, for a few reasons:

  1. Such errors are not the user's fault but rather the developer's, and yet users would see these messages in logs, but how would they be able to fix them by themselves? Put in other words, returning an error would signify that the error is a runtime error, i.e. something that happens occasionally – while in fact, this is a design error that should really never happen at runtime.
  2. Such errors being the developer's fault – our fault :-) – I'd rather see them panic here, so that our unit tests can "fail spectacularly" as you say, and thus quickly detect such problems. If the issue stems from setting auth to false, maybe we'd need to improve the existing tests so that they can catch this one, and other similar nil pointer issues. But returning an error without fixing the issue would just make the controller enter an infinite loop of failed reconcile attempts.
  3. Speaking of which, the real fix is in datacenters.go, and your changes look good there. IMHO the changes in that file are enough to consider the problem solved. In all places where dcConfig.PodTemplateSpec is read, I'd simply remove the nil checks and consider that this field is assumed to never be nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👎 on returning an error in functions that work with PodTemplateSpec. I agree about the real fix being in datacenters.go and that the changes look good.

I also agree that this is something that should be handled with test coverage vs making functions unnecessarily convoluted. If we aren't confident in our ability to ensure PodTemplateSpec will be initialized before it's use and having adequate test coverage, then maybe we shouldn't export dcConfig.PodTemplateSpec, i.e., change it to dcConfig.podTemplateSpec and provide a method on dcConfig for accessing it. That method would ensure it is non-nil. I'm good though with the changes in datacenters.go :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE errors being returned: I think that's some good feedback. The invariant we are trying to maintain is that the PodTemplateSpec must never be nil. I have done this by returning a runtime error, and it is correct to say that (because of the way we handle such errors) it probably isn't quite the optimal design.

The question is how we can maintain this invariant via a compile time check, perhaps the solution is to make the PodTemplateSpec a value type instead of a pointer? That way it can never be nil and we can remove the nil checks everywhere.

The reality is that (while it may have originally made sense for the PodTemplateSpec to be a pointer as it was optional) the logic has evolved and it is no longer optional at all. We should perhaps reflect that in the types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, we had a similar situation with CassandraConfig and its fields, which were all pointers until a while ago; and we had nil checks everywhere. They were subsequently turned into value types and that indeed made things much cleaner. The 2 aspects to watch when doing this change imho are: turning nilness tests into equality tests; and how things get serialized/deserialized.

@Miles-Garnsey
Copy link
Member Author

I've done more manual testing on this and it definitely fixes the NPEs we were seeing.

While I do appreciate the point that perhaps it is not optimal to return errors from some of the functions as this PR does, I also think that cassandra.DatacenterConfig.PodTemplateSpec should no longer be a pointer (since it is not optional). That's the way to fix the root cause of this problem really.

Because making PodTemplateSpec a value type is a much bigger refactor, can I suggest that we merge this as is, then turn our hand to the refactoring tasks. Leaving this unmerged while we debate the finer points of the type system risks missing the forest for the trees a little, given that a commonly used piece of dev functionality is currently extremely broken.

@adutra
Copy link
Contributor

adutra commented Aug 9, 2022

can I suggest that we merge this as is, then turn our hand to the refactoring tasks.

Fair enough, I'll approve but please go ahead and create a follow-up issue to further refactor PodTemplateSpec.

@Miles-Garnsey Miles-Garnsey merged commit 1a23daa into k8ssandra:main Aug 10, 2022
adejanovski pushed a commit that referenced this pull request Sep 13, 2022
* Fix NPEs when auth == false and refactor all PodTemplateSpec linked functions for consistent error returns.
* CRD upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants