Don't allow descriptionless assertion statements in production code #68616

mark-vieira · 2021-02-05T22:36:51Z

We use assertions instead of exceptions in various places throughout the Elasticsearch code base. We should probably enforce a convention here being either a) explicitly disallow the use of assertions altogether (even though we force -ea) and use exceptions instead or b) require that a description message be provided with the assertion to avoid things like java.lang.AssertionError: null in our logs.

I haven't really looked into it but this seems like something we could enforce with checkstyle.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-05T22:36:54Z

Pinging @elastic/es-delivery (Team:Delivery)

mark-vieira · 2021-02-05T22:37:30Z

I've added the discuss label here since this has broad implications on coding conventions in the codebase.

DaveCTurner · 2021-02-08T10:16:15Z

explicitly disallow the use of assertions altogether

Please don't do this. Assertions are extremely powerful pieces of documentation that make reading and reasoning about code so much easier. They protect against future changes that end up calling code in ways that the original author did not anticipate. If anything, IMO we should be finding ways to use assertions more, not discouraging their use. We use assertions to verify invariants of the system that would be too expensive to compute in production code, so simply replacing them with exceptions would not be feasible.

require that a description message be provided with the assertion

I'd rather not do this either. java.lang.AssertionError: null also comes with a stack trace pointing to the line that failed, and that is almost always enough. This rule would force changes like the following:

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java b/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
index 51e634893be..dc0766f80eb 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationState.java
@@ -137,14 +137,14 @@ public class CoordinationState {
         assert getLastCommittedConfiguration().isEmpty() : getLastCommittedConfiguration();
         assert lastPublishedVersion == 0 : lastPublishedVersion;
         assert lastPublishedConfiguration.isEmpty() : lastPublishedConfiguration;
-        assert electionWon == false;
+        assert electionWon == false : "election already won";
         assert joinVotes.isEmpty() : joinVotes;
         assert publishVotes.isEmpty() : publishVotes;

         assert initialState.term() == 0 : initialState + " should have term 0";
         assert initialState.version() == getLastAcceptedVersion() : initialState + " should have version " + getLastAcceptedVersion();
-        assert initialState.getLastAcceptedConfiguration().isEmpty() == false;
-        assert initialState.getLastCommittedConfiguration().isEmpty() == false;
+        assert initialState.getLastAcceptedConfiguration().isEmpty() == false : "empty last-accepted config";
+        assert initialState.getLastCommittedConfiguration().isEmpty() == false : "empty last-committed config";

         persistedState.setLastAcceptedState(initialState);
     }

I don't think the proposed extra ceremony adds any value over and above the line number on which the assertion tripped.

mark-vieira · 2021-02-08T19:26:25Z

hey protect against future changes that end up calling code in ways that the original author did not anticipate.

I don't see how assertions are a better solution to this than throwing exceptions. Additionally exceptions can be better documented in the javadocs for particular apis. We don't need to make the checked exceptions so there's no additional load for callers.

We use assertions to verify invariants of the system that would be too expensive to compute in production code, so simply replacing them with exceptions would not be feasible.

In what way is this more expensive? Either way we are evaluating a boolean expression, then throwing an error which involves filling stacktraces, etc.

I don't think the proposed extra ceremony adds any value over and above the line number on which the assertion tripped.

This assumes folks readily have the ES code available to them. When/if these assertions are tripped in production we should at least give folks looking at the logs an indication of what exploded, no? Having an error with a null message is not the greatest experience for folks trying to debug issues.

DaveCTurner · 2021-02-08T22:06:30Z

Perhaps it would help if you could share some more context around why you've raised this for discussion. There are no linked tickets or anything, but I feel you might be talking about a specific encounter with an uninformative AssertionError that caused you some debugging headaches. There certainly are some uninformative assertions, and I agree we should strive to avoid them, but that's very different from completely banning this extraordinarily useful tool.

I don't see how assertions are a better solution to this than throwing exceptions.

The two solutions express very different intents. Assertions say "this is always true, there are other mechanisms in the surrounding code that guarantee it, if it's false then that's definitely a bug, fail the test" whereas conditional exceptions say "this might legitimately be false; rejecting it is not necessarily bad, it's up to the caller to handle our rejection". Essentially nothing handles AssertionError outside of a few very special cases. The difference is extremely valuable: I find branchy exception-throwing code to be much more time-consuming to read than straight-line asserting code.

In what way is this more expensive? Either way we are evaluating a boolean expression, then throwing an error which involves filling stacktraces, etc.

We have much more complicated/expensive things that are only checked if assertions are enabled. For instance the machinery for catching concurrency bugs on the indexing path, based around things like this (which has caught bugs):

elasticsearch/server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

Lines 184 to 186 in 2d1e8b3

    
           // we maintain a second map that only receives the updates that we skip on the actual map (unsafe ops) 
        
           // this map is only maintained if assertions are enabled 
        
           private volatile Maps unsafeKeysMap = new Maps();

This is on the critical path for indexing. I don't think it's feasible to make these checks in production, but it would also be terrible to lose these checks entirely, or even to try and get similarly strong coverage by writing focussed tests instead. It's so much more powerful to have every test verify these things.

This is a similar example:

elasticsearch/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

Lines 763 to 848 in 1b6ad96

    
           private boolean invariant() { 
        
               // local checkpoints only set during primary mode 
        
               assert primaryMode || checkpoints.values().stream().allMatch(lcps -> lcps.localCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO); 
        
               // global checkpoints only set during primary mode 
        
               assert primaryMode || checkpoints.values().stream().allMatch(cps -> cps.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO); 
        
               // relocation handoff can only occur in primary mode 
        
               assert handoffInProgress == false || primaryMode; 
        
               // a relocated copy is not in primary mode 
        
               assert relocated == false || primaryMode == false; 
        
               // the current shard is marked as in-sync when the global checkpoint tracker operates in primary mode 
        
               assert primaryMode == false || checkpoints.get(shardAllocationId).inSync; 
        
               // the routing table and replication group is set when the global checkpoint tracker operates in primary mode 
        
               assert primaryMode == false || (routingTable != null && replicationGroup != null) : 
        
                   "primary mode but routing table is " + routingTable + " and replication group is " + replicationGroup; 
        
               // when in primary mode, the current allocation ID is the allocation ID of the primary or the relocation allocation ID 
        
               assert primaryMode == false 
        
                       || (routingTable.primaryShard().allocationId().getId().equals(shardAllocationId) 
        
                       || routingTable.primaryShard().allocationId().getRelocationId().equals(shardAllocationId)); 
        
               // during relocation handoff there are no entries blocking global checkpoint advancement 
        
               assert handoffInProgress == false || pendingInSync.isEmpty() : 
        
                   "entries blocking global checkpoint advancement during relocation handoff: " + pendingInSync; 
        
               // entries blocking global checkpoint advancement can only exist in primary mode and when not having a relocation handoff 
        
               assert pendingInSync.isEmpty() || (primaryMode && !handoffInProgress); 
        
               // the computed global checkpoint is always up-to-date 
        
               assert primaryMode == false 
        
                       || globalCheckpoint == computeGlobalCheckpoint(pendingInSync, checkpoints.values(), globalCheckpoint) 
        
                       : "global checkpoint is not up-to-date, expected: " + 
        
                       computeGlobalCheckpoint(pendingInSync, checkpoints.values(), globalCheckpoint) + " but was: " + globalCheckpoint; 
        
               // when in primary mode, the global checkpoint is at most the minimum local checkpoint on all in-sync shard copies 
        
               assert primaryMode == false 
        
                       || globalCheckpoint <= inSyncCheckpointStates(checkpoints, CheckpointState::getLocalCheckpoint, LongStream::min) 
        
                       : "global checkpoint [" + globalCheckpoint + "] " 
        
                       + "for primary mode allocation ID [" + shardAllocationId + "] " 
        
                       + "more than in-sync local checkpoints [" + checkpoints + "]"; 
        
               // we have a routing table iff we have a replication group 
        
               assert (routingTable == null) == (replicationGroup == null) : 
        
                   "routing table is " + routingTable + " but replication group is " + replicationGroup; 
        
               assert replicationGroup == null || replicationGroup.equals(calculateReplicationGroup()) : 
        
                   "cached replication group out of sync: expected: " + calculateReplicationGroup() + " but was: " + replicationGroup; 
        
               // all assigned shards from the routing table are tracked 
        
               assert routingTable == null || checkpoints.keySet().containsAll(routingTable.getAllAllocationIds()) : 
        
                   "local checkpoints " + checkpoints + " not in-sync with routing table " + routingTable; 
        
               for (Map.Entry<String, CheckpointState> entry : checkpoints.entrySet()) { 
        
                   // blocking global checkpoint advancement only happens for shards that are not in-sync 
        
                   assert pendingInSync.contains(entry.getKey()) == false || entry.getValue().inSync == false : 
        
                       "shard copy " + entry.getKey() + " blocks global checkpoint advancement but is in-sync"; 
        
                   // in-sync shard copies are tracked 
        
                   assert entry.getValue().inSync == false || entry.getValue().tracked : 
        
                       "shard copy " + entry.getKey() + " is in-sync but not tracked"; 
        
               } 
        
               // all pending in sync shards are tracked 
        
               for (String aId : pendingInSync) { 
        
                   assert checkpoints.get(aId) != null : "aId [" + aId + "] is pending in sync but isn't tracked"; 
        
               } 
        
               if (primaryMode && indexSettings.isSoftDeleteEnabled() && hasAllPeerRecoveryRetentionLeases) { 
        
                   // all tracked shard copies have a corresponding peer-recovery retention lease 
        
                   for (final ShardRouting shardRouting : routingTable.assignedShards()) { 
        
                       if (checkpoints.get(shardRouting.allocationId().getId()).tracked) { 
        
                           assert retentionLeases.contains(getPeerRecoveryRetentionLeaseId(shardRouting)) 
        
                               : "no retention lease for tracked shard [" + shardRouting + "] in " + retentionLeases; 
        
                           assert PEER_RECOVERY_RETENTION_LEASE_SOURCE.equals( 
        
                               retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source()) 
        
                               : "incorrect source [" + retentionLeases.get(getPeerRecoveryRetentionLeaseId(shardRouting)).source() 
        
                               + "] for [" + shardRouting + "] in " + retentionLeases; 
        
                       } 
        
                   } 
        
               } 
        
               return true; 
        
           }

We don't want to check all this jazz every time we take or drop that mutex in production, it should not be necessary since we're confident that our test coverage would catch a problem here before release, but we absolutely want these checks to happen in every test that does any sort of replication to give us that confidence.

Also the boolean expression itself might be super-expensive to compute. For instance (this one has also caught bugs):

elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

Lines 1115 to 1124 in 2d1e8b3

    
           assert event.previousState() == coordinationState.get().getLastAcceptedState() || 
        
               XContentHelper.convertToMap( 
        
                   JsonXContent.jsonXContent, Strings.toString(event.previousState()), false 
        
               ).equals( 
        
                   XContentHelper.convertToMap( 
        
                       JsonXContent.jsonXContent, 
        
                       Strings.toString(clusterStateWithNoMasterBlock(coordinationState.get().getLastAcceptedState())), 
        
                       false)) 
        
               : Strings.toString(event.previousState()) + " vs " 
        
               + Strings.toString(clusterStateWithNoMasterBlock(coordinationState.get().getLastAcceptedState()));

We could technically solve these things without assertions if we had some kind of global flag that is only enabled in tests that skips over the expensive test-only checks in production, and used these checks to throw an exception that nobody ever caught, and maybe even had some warnings in the IDE to make sure we didn't inadvertently change behaviour based on whether the checks were skipped or not, and also added compiler support to reduce the syntactic overhead of the checks, but we already have this functionality today (it's assertions) and if we reinvented it I don't see that we'd have made any progress on the problem that a failure might not contain all the information needed to diagnose what went wrong.

When/if these assertions are tripped in production

Wait, what? Surely we're only talking about tests? We absolutely shouldn't have folks running Elasticsearch in production with assertions enabled, not least because of the terrible performance consequences. We discussed adding a bootstrap check for this (#66271) which I argued against on the grounds that it would be trivial to bypass, but I could be persuaded if we're encountering production clusters with assertions enabled too often, and I would definitely change my position if the alternative is banning assertions altogether.

not the greatest experience for folks trying to debug issues.

Debugging an AssertionError without the code in front of you is not supposed to be a great experience, whether it has an informative message or not. I'd frankly be amazed if it was even feasible. Hitting an AssertionError is definitely a bug in the code, it should not be possible to get one no matter how badly you misuse Elasticsearch from the outside.

mark-vieira · 2021-02-08T22:19:39Z

We have much more complicated/expensive things that are only checked if assertions are enabled.

I think I was working under the incorrect assumption that we enable assertions in prod, which we don't, so this makes sense. If the primary point of assertions is to catch when ES developers break certain assumptions in tests, that makes a lot more sense. Sorry for the misdirection here.

Wait, what? Surely we're only talking about tests? We absolutely shouldn't have folks running Elasticsearch in production with assertions enabled, not least because of the terrible performance consequences.

Yes, you are absolutely correct.

Back to your original statement:

We use assertions to verify invariants of the system that would be too expensive to compute in production code

I think the trouble I have is that I'm used to a strict "no assertions in production code" due to the fact that enabling/disabling them obviously has behavioral consequences at runtime. Depending on how we are doing exception handling, it may be that even our tests are swallowing some AssertionErrors. It doesn't seem like we do catch(Throwable very often so perhaps this is not a big deal and the reality is that things exploding in production due to conditions "slipping past" assertions is a very rare occurrence, in which case I'm happy to close this issue.

tvernum · 2021-02-09T02:35:17Z

Like @DaveCTurner, I think we should definitely keep using assertions.

I'm -0 on enforcing messages.
I think it's OK for assertions that have a single possible trigger (e.g. assert foo != null) to not have a message (of course, I'd prefer that the JVM generated a message automatically like C does, but ...)

However, we have a lot of branches, and line numbers change with some frequency, so if there are real cases where people are finding it hard to track down the cause of an assertion purely from a file + line number, then I'd be OK with requiring messages.

DaveCTurner · 2021-02-09T09:07:42Z

I think I was working under the incorrect assumption that we enable assertions in prod, which we don't, so this makes sense. If the primary point of assertions is to catch when ES developers break certain assumptions in tests, that makes a lot more sense. Sorry for the misdirection here.

Aha ok yes this makes a lot more sense now.

I think the trouble I have is that I'm used to a strict "no assertions in production code" due to the fact that enabling/disabling them obviously has behavioral consequences at runtime.

Indeed, this is a real concern, we have had bugs due to this (e.g. #29585). They're pretty rare tho, given how many assertions there are in the codebase and how few of them have meaningful side-effects that result in bugs, vs how many bugs we catch with assertions, so on balance I still think assertions are a good thing.

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

It doesn't seem like we do catch(Throwable very often so perhaps this is not a big deal

Yes, catching Throwable is something we strive to avoid as much as possible - catching any Error is unreasonable behaviour (whether an AssertionError or something else). See #19231 for the big change in that direction, and also things like #30845 and #68210.

mark-vieira · 2021-02-09T16:01:02Z

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

That seems like something we should do, at least for non-test code. Any assertion statements should be side-effect free for obvious reasons.

dnhatn · 2021-02-09T16:03:13Z

Indeed, this is a real concern, we have had bugs due to this (e.g. #29585). They're pretty rare tho, given how many assertions there are in the codebase and how few of them have meaningful side-effects that result in bugs, vs how many bugs we catch with assertions, so on balance I still think assertions are a good thing.

IIRC we discussed this and preferred to run our tests with and without assertions enabled.

mark-vieira · 2021-02-09T16:06:39Z

IIRC we discussed this and preferred to run our tests with and without assertions enabled.

Hmm, I wasn't aware of this. I presume we would only be able to do this for external cluster tests, unless we are certain we don't ever use assert in our test cases and rely exclusively on junit Assert methods.

dnhatn · 2021-02-09T16:08:34Z

@mark-vieira I think @jasontedor suggested this.

jasontedor · 2021-02-09T18:00:55Z

Indeed, we have no chance to catch the assertion with side effects issues (of which we've had a few over the years) in CI if we only run there with assertions enabled. The old suggestion of mine is indeed that we run the external cluster tests with assertions disabled.

I don't think javac can be made to fail if an assert has side effects, I think a linter like FindBugs is needed for that, which would be a big hill to climb right now to be able to enable (the codebase would have to be cleaned). Also we have a place in the codebase where we deliberate use that it has a side effect that we'd need to suppress (we suppress the IntelliJ warnings there):

elasticsearch/server/src/main/java/org/elasticsearch/Assertions.java

Lines 24 to 34 in 8fff763

    
               static { 
        
                   boolean enabled = false; 
        
                   /* 
        
                    * If assertions are enabled, the following line will be evaluated and enabled will have the value true, otherwise when assertions 
        
                    * are disabled enabled will have the value false. 
        
                    */ 
        
                   // noinspection ConstantConditions,AssertWithSideEffects 
        
                   assert enabled = true; 
        
                   // noinspection ConstantConditions 
        
                   ENABLED = enabled; 
        
               }

jpountz · 2021-02-15T14:08:51Z

My IntelliJ warns on assertions that obviously have side effects. It's not perfect of course but it does catch a lot of cases. Maybe we should fail the build on that warning?

@DaveCTurner I'm curious how you reconcile this suggestion with the example from LiveVersionMap, which you linked in a previous comment that performs assignments in assertions. Are you considering maintaining a list of exceptions where we believe that side-effects bring more value than risk, or should we find another way to check invariants in LiveVersionMap?

hendrikmuhs · 2021-02-15T15:14:20Z

+1 for keeping it the way it is today, for the reasons @DaveCTurner already pointed out. I rarely use assertions, but there are a few use cases where I don't want to fail in prod but in CI. Examples are around optimistic locking and certain optimization execution paths. Those are not catastrophic failures but good to know for debugging. I could live with enforcing a message, however I trust the developer/code reviewer to add them if they add value.

DaveCTurner · 2021-02-15T15:27:30Z

Are you considering maintaining a list of exceptions where we believe that side-effects bring more value than risk

I envisioned less a separate list of exceptions and more that one could suppress individual cases at the call site, like we do with other generally-good-but-not-universal ideas like forbidden-APIs and @SuppressForbidden. I've no idea if this is technically feasible in this case, but I find the general pattern of blocking accidental slip-ups while still allowing a deliberate bypass to be powerful.

mark-vieira · 2021-02-16T17:34:42Z

Alright, I think at best we don't have strong agreement on this but we do generally acknowledge that issues with assertion side-effects can lead to production issues. As Jason has described I've opened #69067 to specifically address that and I'll close this issue for now.

mark-vieira added the :Delivery/Build Build or test infrastructure label Feb 5, 2021

elasticmachine added the Team:Delivery Meta label for Delivery team label Feb 5, 2021

mark-vieira added the discuss label Feb 5, 2021

mark-vieira mentioned this issue Feb 16, 2021

Integrations tests should be regularly run with assertions disabled #69067

Open

mark-vieira closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't allow descriptionless assertion statements in production code #68616

Don't allow descriptionless assertion statements in production code #68616

mark-vieira commented Feb 5, 2021 •

edited

Loading

elasticmachine commented Feb 5, 2021

mark-vieira commented Feb 5, 2021

DaveCTurner commented Feb 8, 2021 •

edited

Loading

mark-vieira commented Feb 8, 2021 •

edited

Loading

DaveCTurner commented Feb 8, 2021

mark-vieira commented Feb 8, 2021

tvernum commented Feb 9, 2021

DaveCTurner commented Feb 9, 2021

mark-vieira commented Feb 9, 2021

dnhatn commented Feb 9, 2021

mark-vieira commented Feb 9, 2021

dnhatn commented Feb 9, 2021

jasontedor commented Feb 9, 2021

jpountz commented Feb 15, 2021

hendrikmuhs commented Feb 15, 2021

DaveCTurner commented Feb 15, 2021

mark-vieira commented Feb 16, 2021

Don't allow descriptionless assertion statements in production code #68616

Don't allow descriptionless assertion statements in production code #68616

Comments

mark-vieira commented Feb 5, 2021 • edited Loading

elasticmachine commented Feb 5, 2021

mark-vieira commented Feb 5, 2021

DaveCTurner commented Feb 8, 2021 • edited Loading

mark-vieira commented Feb 8, 2021 • edited Loading

DaveCTurner commented Feb 8, 2021

mark-vieira commented Feb 8, 2021

tvernum commented Feb 9, 2021

DaveCTurner commented Feb 9, 2021

mark-vieira commented Feb 9, 2021

dnhatn commented Feb 9, 2021

mark-vieira commented Feb 9, 2021

dnhatn commented Feb 9, 2021

jasontedor commented Feb 9, 2021

jpountz commented Feb 15, 2021

hendrikmuhs commented Feb 15, 2021

DaveCTurner commented Feb 15, 2021

mark-vieira commented Feb 16, 2021

mark-vieira commented Feb 5, 2021 •

edited

Loading

DaveCTurner commented Feb 8, 2021 •

edited

Loading

mark-vieira commented Feb 8, 2021 •

edited

Loading