You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flush checks the size of the current batch to see if it still has work,
but a race can happen if pendingOpsCount > maxPendingOpCount and
operations have been placed in bufferedOperations.
When flush is called, scheduleCurrentBatchLocked() used to be called
with flush=true, which would call sendBatchLocked() with flush=true as well. If the call to sendBatchLocked() did not get
rate limited, it would add a listener to the actual batch write that
would call scheduleCurrentBatchLocked() again if flush == true.
If the current batch's size is 0, scheduleCurrentBatchLocked() simply
returns, ending the flush.
If pendingOpsCount > maxPendingOpCount and operations are being buffered into bufferedOperations instead of a batch, there is a chance that scheduleCurrentBatchLocked() will execute before any pending operations
complete, and the current batch is empty. Thus ending the "flush".
This fix's approach avoids that with a global flush state and a flush
"endpoint" that marks the last operation that needs to be flushed. This
allows flush() to be called more than once (including a terminal call to close() if it came to that).
We will end up calling scheduleCurrentBatchLocked() more often from sendBatchLocked(), which is done via callback... so there's going to
be some more overhead, there. When we did inserts of 30,000 in quick
succession, many of the calls to sendBatchLocked() were coming from
the rate limit backoff branch, and those all had flush=false.
Environment details
OS type and version: Linux/macOS
Java version: 21
Version(s): com.google.cloud:google-cloud-firestore:3.22.0, com.google.cloud:google-cloud-firestore:3.24.2
Steps to reproduce
Insert 30,000 items into a collection within a try with resources block
Application hangs once exiting the try with resources block
Code example
@JsonIgnoreProperties(ignoreUnknown = true)
public record HistoricalReference(
@JsonProperty("Identifier")
Stringidentifier,
@JsonProperty("Start date")
LocalDatestartDate// ...
) { }
@tom-andersen & @milaGGL - I've been testing with the fix for a few days, and I still see a few problems. The one I've been running down is difficult for me to understand, but here's what I've got so far:
We create somewhere between 15,000 and 30,000 write operations. Occasionally, something will happen that will prevent some of the earlier operations from making their way into a BulkCommitBatch until the very end of the batch handling. The chain of futures leading up to lastFlushOperation gets very long.
In one case I have logs to, I see the 819th - 837th operation not get into a batch until the final operation - in this case, the 14,530th operation. When those operations finally run and finish, they are the last. And now every operation has been executed successfully. The futures chain starts propagating success up towards lastFlushOperation, but something happens and everything falls off the rails somewhere around the 13,174th operation. At that point, propagation stops, and lastFlushOperation will never finish.
This does not happen every time. And I have varying streaks of lucky and unlucky.
bulkWriter.flush()
(and thereforebulkWriter.close()
have a race when checking to see if there are more items to flush.We noticed this in our Cloud Function application that tries to write 30,000 records to a collection.
I have a fix that is working here:
https://github.com/thewildllama/java-firestore/tree/bulk-writer-flush-fix
I detailed the problem in my commit message:
Environment details
OS type and version: Linux/macOS
Java version: 21
Version(s): com.google.cloud:google-cloud-firestore:3.22.0, com.google.cloud:google-cloud-firestore:3.24.2
Steps to reproduce
Code example
The text was updated successfully, but these errors were encountered: