-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/npe on update #640
Bugfix/npe on update #640
Conversation
…nctions for consistent error returns.
fe22c52
to
9596e0f
Compare
controllers/k8ssandra/datacenters.go
Outdated
// Ensure we have a valid PodTemplateSpec before proceeding to modify it. | ||
if dcConfig.PodTemplateSpec == nil { | ||
dcConfig.PodTemplateSpec = &corev1.PodTemplateSpec{} | ||
// we need to declare at least one container, otherwise the PodTemplateSpec struct will be invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't PodTemplateSpec stay as nil? PodTemplateSpec is not required by cass-operator and definitely shouldn't be created if you're not intending to use it to customize something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always customize things in K8ssandra-operator and always rely on the podTemplateSpec.
There's only one very edgy case where it's not true, which is when auth is disabled, medusa is disabled and metrics filters are explicitly voided with an empty list.
Doing this will simplify a lot future developments on K8ssandra-operator.
@adejanovski we're seeing a single test failure here on Log: k8ssandra-operator-b9cc64b7c-4gcp6.log The specific error relates to duplicate cassandra containers being created in the STS... This is likely due to the changes you asked for where we do The When the reconciliation runs, it calls Because this function is defined (see here) without the strategic merge logic we'd expect, we end up with two separate cassandra containers. FWIW, this is an error irrespective of how we've modified I've dropped the STS here: sts.txt Here's the error it creates:
|
…plateSpec` so that cassandra container is only eer added once.
@adejanovski I think this one is good to go pending tests passing (the previous round of tests did). Can you please confirm you're happy with it and I'll merge. |
func ApplyAuth(dcConfig *DatacenterConfig, authEnabled bool) error { | ||
|
||
if dcConfig.PodTemplateSpec == nil { | ||
return errors.New("PodTemplateSpec was nil, cannot add auth settings") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to learn what others think, but I'm torn on the idea of returning errors when PodTemplateSpec
is nil, for a few reasons:
- Such errors are not the user's fault but rather the developer's, and yet users would see these messages in logs, but how would they be able to fix them by themselves? Put in other words, returning an error would signify that the error is a runtime error, i.e. something that happens occasionally – while in fact, this is a design error that should really never happen at runtime.
- Such errors being the developer's fault – our fault :-) – I'd rather see them panic here, so that our unit tests can "fail spectacularly" as you say, and thus quickly detect such problems. If the issue stems from setting auth to false, maybe we'd need to improve the existing tests so that they can catch this one, and other similar nil pointer issues. But returning an error without fixing the issue would just make the controller enter an infinite loop of failed reconcile attempts.
- Speaking of which, the real fix is in datacenters.go, and your changes look good there. IMHO the changes in that file are enough to consider the problem solved. In all places where
dcConfig.PodTemplateSpec
is read, I'd simply remove the nil checks and consider that this field is assumed to never be nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👎 on returning an error in functions that work with PodTemplateSpec. I agree about the real fix being in datacenters.go and that the changes look good.
I also agree that this is something that should be handled with test coverage vs making functions unnecessarily convoluted. If we aren't confident in our ability to ensure PodTemplateSpec will be initialized before it's use and having adequate test coverage, then maybe we shouldn't export dcConfig.PodTemplateSpec
, i.e., change it to dcConfig.podTemplateSpec
and provide a method on dcConfig
for accessing it. That method would ensure it is non-nil. I'm good though with the changes in datacenters.go :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE errors being returned: I think that's some good feedback. The invariant we are trying to maintain is that the PodTemplateSpec must never be nil. I have done this by returning a runtime error, and it is correct to say that (because of the way we handle such errors) it probably isn't quite the optimal design.
The question is how we can maintain this invariant via a compile time check, perhaps the solution is to make the PodTemplateSpec a value type instead of a pointer? That way it can never be nil and we can remove the nil checks everywhere.
The reality is that (while it may have originally made sense for the PodTemplateSpec to be a pointer as it was optional) the logic has evolved and it is no longer optional at all. We should perhaps reflect that in the types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, we had a similar situation with CassandraConfig
and its fields, which were all pointers until a while ago; and we had nil checks everywhere. They were subsequently turned into value types and that indeed made things much cleaner. The 2 aspects to watch when doing this change imho are: turning nilness tests into equality tests; and how things get serialized/deserialized.
I've done more manual testing on this and it definitely fixes the NPEs we were seeing. While I do appreciate the point that perhaps it is not optimal to return errors from some of the functions as this PR does, I also think that Because making PodTemplateSpec a value type is a much bigger refactor, can I suggest that we merge this as is, then turn our hand to the refactoring tasks. Leaving this unmerged while we debate the finer points of the type system risks missing the forest for the trees a little, given that a commonly used piece of dev functionality is currently extremely broken. |
Fair enough, I'll approve but please go ahead and create a follow-up issue to further refactor PodTemplateSpec. |
* Fix NPEs when auth == false and refactor all PodTemplateSpec linked functions for consistent error returns. * CRD upgrades.
What this PR does:
Because the
cassandra.ApplyAuth
function call withinreconcileDatacenters
creates a PodTemplateSpec within thecassandra.DatacenterConfig
, certain downstream reconciliation functions have come to depend on the creation of that struct to behave in an error free way.When
auth==false
this causes NPEs which crash the operator in spectacular fashion.This PR ensures that all of those downstream functions guard against a nil
PodTemplateSpec
, consistently return errors, no longer expect to each create their ownPodTemplateSpec
and that any errors that do occur are handled gracefully.Which issue(s) this PR fixes:
#637
Checklist