Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow descriptionless assertion statements in production code #68616

Closed
mark-vieira opened this issue Feb 5, 2021 · 17 comments
Closed
Labels
:Delivery/Build Build or test infrastructure discuss Team:Delivery Meta label for Delivery team

Comments

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 5, 2021

We use assertions instead of exceptions in various places throughout the Elasticsearch code base. We should probably enforce a convention here being either a) explicitly disallow the use of assertions altogether (even though we force -ea) and use exceptions instead or b) require that a description message be provided with the assertion to avoid things like java.lang.AssertionError: null in our logs.

I haven't really looked into it but this seems like something we could enforce with checkstyle.

@mark-vieira mark-vieira added the :Delivery/Build Build or test infrastructure label Feb 5, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Feb 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira
Copy link
Contributor Author

I've added the discuss label here since this has broad implications on coding conventions in the codebase.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 8, 2021

explicitly disallow the use of assertions altogether

Please don't do this. Assertions are extremely powerful pieces of documentation that make reading and reasoning about code so much easier. They protect against future changes that end up calling code in ways that the original author did not anticipate. If anything, IMO we should be finding ways to use assertions more, not discouraging their use. We use assertions to verify invariants of the system that would be too expensive to compute in production code, so simply replacing them with exceptions would not be feasible.

require that a description message be provided with the assertion

I'd rather not do this either. java.lang.AssertionError: null also comes with a stack trace pointing to the line that failed, and that is almost always enough. This rule would force changes like the following:

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java b/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
index 51e634893be..dc0766f80eb 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
@@ -137,14 +137,14 @@ public class CoordinationState {
         assert getLastCommittedConfiguration().isEmpty() : getLastCommittedConfiguration();
         assert lastPublishedVersion == 0 : lastPublishedVersion;
         assert lastPublishedConfiguration.isEmpty() : lastPublishedConfiguration;
-        assert electionWon == false;
+        assert electionWon == false : "election already won";
         assert joinVotes.isEmpty() : joinVotes;
         assert publishVotes.isEmpty() : publishVotes;

         assert initialState.term() == 0 : initialState + " should have term 0";
         assert initialState.version() == getLastAcceptedVersion() : initialState + " should have version " + getLastAcceptedVersion();
-        assert initialState.getLastAcceptedConfiguration().isEmpty() == false;
-        assert initialState.getLastCommittedConfiguration().isEmpty() == false;
+        assert initialState.getLastAcceptedConfiguration().isEmpty() == false : "empty last-accepted config";
+        assert initialState.getLastCommittedConfiguration().isEmpty() == false : "empty last-committed config";

         persistedState.setLastAcceptedState(initialState);
     }

I don't think the proposed extra ceremony adds any value over and above the line number on which the assertion tripped.

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Feb 8, 2021

hey protect against future changes that end up calling code in ways that the original author did not anticipate.

I don't see how assertions are a better solution to this than throwing exceptions. Additionally exceptions can be better documented in the javadocs for particular apis. We don't need to make the checked exceptions so there's no additional load for callers.

We use assertions to verify invariants of the system that would be too expensive to compute in production code, so simply replacing them with exceptions would not be feasible.

In what way is this more expensive? Either way we are evaluating a boolean expression, then throwing an error which involves filling stacktraces, etc.

I don't think the proposed extra ceremony adds any value over and above the line number on which the assertion tripped.

This assumes folks readily have the ES code available to them. When/if these assertions are tripped in production we should at least give folks looking at the logs an indication of what exploded, no? Having an error with a null message is not the greatest experience for folks trying to debug issues.

@DaveCTurner
Copy link
Contributor

Perhaps it would help if you could share some more context around why you've raised this for discussion. There are no linked tickets or anything, but I feel you might be talking about a specific encounter with an uninformative AssertionError that caused you some debugging headaches. There certainly are some uninformative assertions, and I agree we should strive to avoid them, but that's very different from completely banning this extraordinarily useful tool.

I don't see how assertions are a better solution to this than throwing exceptions.

The two solutions express very different intents. Assertions say "this is always true, there are other mechanisms in the surrounding code that guarantee it, if it's false then that's definitely a bug, fail the test" whereas conditional exceptions say "this might legitimately be false; rejecting it is not necessarily bad, it's up to the caller to handle our rejection". Essentially nothing handles AssertionError outside of a few very special cases. The difference is extremely valuable: I find branchy exception-throwing code to be much more time-consuming to read than straight-line asserting code.

In what way is this more expensive? Either way we are evaluating a boolean expression, then throwing an error which involves filling stacktraces, etc.

We have much more complicated/expensive things that are only checked if assertions are enabled. For instance the machinery for catching concurrency bugs on the indexing path, based around things like this (which has caught bugs):

// we maintain a second map that only receives the updates that we skip on the actual map (unsafe ops)
// this map is only maintained if assertions are enabled
private volatile Maps unsafeKeysMap = new Maps();

This is on the critical path for indexing. I don't think it's feasible to make these checks in production, but it would also be terrible to lose these checks entirely, or even to try and get similarly strong coverage by writing focussed tests instead. It's so much more powerful to have every test verify these things.

This is a similar example:

private boolean invariant() {
// local checkpoints only set during primary mode
assert primaryMode || checkpoints.values().stream().allMatch(lcps -> lcps.localCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO);
// global checkpoints only set during primary mode
assert primaryMode || checkpoints.values().stream().allMatch(cps -> cps.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO);
// relocation handoff can only occur in primary mode
assert handoffInProgress == false || primaryMode;
// a relocated copy is not in primary mode
assert relocated == false || primaryMode == false;
// the current shard is marked as in-sync when the global checkpoint tracker operates in primary mode
assert primaryMode == false || checkpoints.get(shardAllocationId).inSync;
// the routing table and replication group is set when the global checkpoint tracker operates in primary mode
assert primaryMode == false || (routingTable != null && replicationGroup != null) :
"primary mode but routing table is " + routingTable + " and replication group is " + replicationGroup;
// when in primary mode, the current allocation ID is the allocation ID of the primary or the relocation allocation ID
assert primaryMode == false
|| (routingTable.primaryShard().allocationId().getId().equals(shardAllocationId)
|| routingTable.primaryShard().allocationId().getRelocationId().equals(shardAllocationId));
// during relocation handoff there are no entries blocking global checkpoint advancement
assert handoffInProgress == false || pendingInSync.isEmpty() :
"entries blocking global checkpoint advancement during relocation handoff: " + pendingInSync;
// entries blocking global checkpoint advancement can only exist in primary mode and when not having a relocation handoff
assert pendingInSync.isEmpty() || (primaryMode && !handoffInProgress);
// the computed global checkpoint is always up-to-date
assert primaryMode == false
|| globalCheckpoint == computeGlobalCheckpoint(pendingInSync, checkpoints.values(), globalCheckpoint)
: "global checkpoint is not up-to-date, expected: " +
computeGlobalCheckpoint(pendingInSync, checkpoints.values(), globalCheckpoint) + " but was: " + globalCheckpoint;
// when in primary mode, the global checkpoint is at most the minimum local checkpoint on all in-sync shard copies
assert primaryMode == false
|| globalCheckpoint <= inSyncCheckpointStates(checkpoints, CheckpointState::getLocalCheckpoint, LongStream::min)
: "global checkpoint [" + globalCheckpoint + "] "
+ "for primary mode allocation ID [" + shardAllocationId + "] "
+ "more than in-sync local checkpoints [" + checkpoints + "]";
// we have a routing table iff we have a replication group
assert (routingTable == null) == (replicationGroup == null) :
"routing table is " + routingTable + " but replication group is " + replicationGroup;
assert replicationGroup == null || replicationGroup.equals(calculateReplicationGroup()) :
"cached replication group out of sync: expected: " + calculateReplicationGroup() + " but was: " + replicationGroup;
// all assigned shards from the routing table are tracked
assert routingTable == null || checkpoints.keySet().containsAll(routingTable.getAllAllocationIds()) :
"local checkpoints " + checkpoints + " not in-sync with routing table " + routingTable;
for (Map.Entry<String, CheckpointState> entry : checkpoints.entrySet()) {
// blocking global checkpoint advancement only happens for shards that are not in-sync
assert pendingInSync.contains(entry.getKey()) == false || entry.getValue().inSync == false :
"shard copy " + entry.getKey() + " blocks global checkpoint advancement but is in-sync";
// in-sync shard copies are tracked
assert entry.getValue().inSync == false || entry.getValue().tracked :
"shard copy " + entry.getKey() + " is in-sync but not tracked";
}
// all pending in sync shards are tracked
for (String aId : pendingInSync) {
assert checkpoints.get(aId) != null : "aId [" + aId + "] is pending in sync but isn't tracked";
}
if (primaryMode && indexSettings.isSoftDeleteEnabled() && hasAllPeerRecoveryRetentionLeases) {
// all tracked shard copies have a corresponding peer-recovery retention lease
for (final ShardRouting shardRouting : routingTable.assignedShards()) {
if (checkpoints.get(shardRouting.allocationId().getId()).tracked) {
assert retentionLeases.contains(getPeerRecoveryRetentionLeaseId(shardRouting))
: "no retention lease for tracked shard [" + shardRouting + "] in " + retentionLeases;
assert PEER_RECOVERY_RETENTION_LEASE_SOURCE.equals(
retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source())
: "incorrect source [" + retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source()
+ "] for [" + shardRouting + "] in " + retentionLeases;
}
}
}
return true;
}

We don't want to check all this jazz every time we take or drop that mutex in production, it should not be necessary since we're confident that our test coverage would catch a problem here before release, but we absolutely want these checks to happen in every test that does any sort of replication to give us that confidence.

Also the boolean expression itself might be super-expensive to compute. For instance (this one has also caught bugs):

assert event.previousState() == coordinationState.get().getLastAcceptedState() ||
XContentHelper.convertToMap(
JsonXContent.jsonXContent, Strings.toString(event.previousState()), false
).equals(
XContentHelper.convertToMap(
JsonXContent.jsonXContent,
Strings.toString(clusterStateWithNoMasterBlock(coordinationState.get().getLastAcceptedState())),
false))
: Strings.toString(event.previousState()) + " vs "
+ Strings.toString(clusterStateWithNoMasterBlock(coordinationState.get().getLastAcceptedState()));

We could technically solve these things without assertions if we had some kind of global flag that is only enabled in tests that skips over the expensive test-only checks in production, and used these checks to throw an exception that nobody ever caught, and maybe even had some warnings in the IDE to make sure we didn't inadvertently change behaviour based on whether the checks were skipped or not, and also added compiler support to reduce the syntactic overhead of the checks, but we already have this functionality today (it's assertions) and if we reinvented it I don't see that we'd have made any progress on the problem that a failure might not contain all the information needed to diagnose what went wrong.

When/if these assertions are tripped in production

Wait, what? Surely we're only talking about tests? We absolutely shouldn't have folks running Elasticsearch in production with assertions enabled, not least because of the terrible performance consequences. We discussed adding a bootstrap check for this (#66271) which I argued against on the grounds that it would be trivial to bypass, but I could be persuaded if we're encountering production clusters with assertions enabled too often, and I would definitely change my position if the alternative is banning assertions altogether.

not the greatest experience for folks trying to debug issues.

Debugging an AssertionError without the code in front of you is not supposed to be a great experience, whether it has an informative message or not. I'd frankly be amazed if it was even feasible. Hitting an AssertionError is definitely a bug in the code, it should not be possible to get one no matter how badly you misuse Elasticsearch from the outside.

@mark-vieira
Copy link
Contributor Author

We have much more complicated/expensive things that are only checked if assertions are enabled.

I think I was working under the incorrect assumption that we enable assertions in prod, which we don't, so this makes sense. If the primary point of assertions is to catch when ES developers break certain assumptions in tests, that makes a lot more sense. Sorry for the misdirection here.

Wait, what? Surely we're only talking about tests? We absolutely shouldn't have folks running Elasticsearch in production with assertions enabled, not least because of the terrible performance consequences.

Yes, you are absolutely correct.

Back to your original statement:

We use assertions to verify invariants of the system that would be too expensive to compute in production code

I think the trouble I have is that I'm used to a strict "no assertions in production code" due to the fact that enabling/disabling them obviously has behavioral consequences at runtime. Depending on how we are doing exception handling, it may be that even our tests are swallowing some AssertionErrors. It doesn't seem like we do catch(Throwable very often so perhaps this is not a big deal and the reality is that things exploding in production due to conditions "slipping past" assertions is a very rare occurrence, in which case I'm happy to close this issue.

@tvernum
Copy link
Contributor

tvernum commented Feb 9, 2021

Like @DaveCTurner, I think we should definitely keep using assertions.

I'm -0 on enforcing messages.
I think it's OK for assertions that have a single possible trigger (e.g. assert foo != null) to not have a message (of course, I'd prefer that the JVM generated a message automatically like C does, but ...)

However, we have a lot of branches, and line numbers change with some frequency, so if there are real cases where people are finding it hard to track down the cause of an assertion purely from a file + line number, then I'd be OK with requiring messages.

@DaveCTurner
Copy link
Contributor

I think I was working under the incorrect assumption that we enable assertions in prod, which we don't, so this makes sense. If the primary point of assertions is to catch when ES developers break certain assumptions in tests, that makes a lot more sense. Sorry for the misdirection here.

Aha ok yes this makes a lot more sense now.

I think the trouble I have is that I'm used to a strict "no assertions in production code" due to the fact that enabling/disabling them obviously has behavioral consequences at runtime.

Indeed, this is a real concern, we have had bugs due to this (e.g. #29585). They're pretty rare tho, given how many assertions there are in the codebase and how few of them have meaningful side-effects that result in bugs, vs how many bugs we catch with assertions, so on balance I still think assertions are a good thing.

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

It doesn't seem like we do catch(Throwable very often so perhaps this is not a big deal

Yes, catching Throwable is something we strive to avoid as much as possible - catching any Error is unreasonable behaviour (whether an AssertionError or something else). See #19231 for the big change in that direction, and also things like #30845 and #68210.

@mark-vieira
Copy link
Contributor Author

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

That seems like something we should do, at least for non-test code. Any assertion statements should be side-effect free for obvious reasons.

@dnhatn
Copy link
Member

dnhatn commented Feb 9, 2021

Indeed, this is a real concern, we have had bugs due to this (e.g. #29585). They're pretty rare tho, given how many assertions there are in the codebase and how few of them have meaningful side-effects that result in bugs, vs how many bugs we catch with assertions, so on balance I still think assertions are a good thing.

IIRC we discussed this and preferred to run our tests with and without assertions enabled.

@mark-vieira
Copy link
Contributor Author

IIRC we discussed this and preferred to run our tests with and without assertions enabled.

Hmm, I wasn't aware of this. I presume we would only be able to do this for external cluster tests, unless we are certain we don't ever use assert in our test cases and rely exclusively on junit Assert methods.

@dnhatn
Copy link
Member

dnhatn commented Feb 9, 2021

@mark-vieira I think @jasontedor suggested this.

@jasontedor
Copy link
Member

Indeed, we have no chance to catch the assertion with side effects issues (of which we've had a few over the years) in CI if we only run there with assertions enabled. The old suggestion of mine is indeed that we run the external cluster tests with assertions disabled.

I don't think javac can be made to fail if an assert has side effects, I think a linter like FindBugs is needed for that, which would be a big hill to climb right now to be able to enable (the codebase would have to be cleaned). Also we have a place in the codebase where we deliberate use that it has a side effect that we'd need to suppress (we suppress the IntelliJ warnings there):

static {
boolean enabled = false;
/*
* If assertions are enabled, the following line will be evaluated and enabled will have the value true, otherwise when assertions
* are disabled enabled will have the value false.
*/
// noinspection ConstantConditions,AssertWithSideEffects
assert enabled = true;
// noinspection ConstantConditions
ENABLED = enabled;
}

@jpountz
Copy link
Contributor

jpountz commented Feb 15, 2021

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

@DaveCTurner I'm curious how you reconcile this suggestion with the example from LiveVersionMap, which you linked in a previous comment that performs assignments in assertions. Are you considering maintaining a list of exceptions where we believe that side-effects bring more value than risk, or should we find another way to check invariants in LiveVersionMap?

@hendrikmuhs
Copy link
Contributor

+1 for keeping it the way it is today, for the reasons @DaveCTurner already pointed out. I rarely use assertions, but there are a few use cases where I don't want to fail in prod but in CI. Examples are around optimistic locking and certain optimization execution paths. Those are not catastrophic failures but good to know for debugging. I could live with enforcing a message, however I trust the developer/code reviewer to add them if they add value.

@DaveCTurner
Copy link
Contributor

Are you considering maintaining a list of exceptions where we believe that side-effects bring more value than risk

I envisioned less a separate list of exceptions and more that one could suppress individual cases at the call site, like we do with other generally-good-but-not-universal ideas like forbidden-APIs and @SuppressForbidden. I've no idea if this is technically feasible in this case, but I find the general pattern of blocking accidental slip-ups while still allowing a deliberate bypass to be powerful.

@mark-vieira
Copy link
Contributor Author

Alright, I think at best we don't have strong agreement on this but we do generally acknowledge that issues with assertion side-effects can lead to production issues. As Jason has described I've opened #69067 to specifically address that and I'll close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure discuss Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

8 participants