firestore.BulkWriter.flush() has a race #1772

thewildllama · 2024-07-31T23:37:19Z

bulkWriter.flush() (and therefore bulkWriter.close() have a race when checking to see if there are more items to flush.

We noticed this in our Cloud Function application that tries to write 30,000 records to a collection.

I have a fix that is working here:
https://github.com/thewildllama/java-firestore/tree/bulk-writer-flush-fix

I detailed the problem in my commit message:

Flush checks the size of the current batch to see if it still has work,
but a race can happen if pendingOpsCount > maxPendingOpCount and
operations have been placed in bufferedOperations.

When flush is called, scheduleCurrentBatchLocked() used to be called
with flush=true, which would call sendBatchLocked() with
flush=true as well. If the call to sendBatchLocked() did not get
rate limited, it would add a listener to the actual batch write that
would call scheduleCurrentBatchLocked() again if flush == true.

If the current batch's size is 0, scheduleCurrentBatchLocked() simply
returns, ending the flush.

If pendingOpsCount > maxPendingOpCount and operations are being buffered into
bufferedOperations instead of a batch, there is a chance that
scheduleCurrentBatchLocked() will execute before any pending operations
complete, and the current batch is empty. Thus ending the "flush".

This fix's approach avoids that with a global flush state and a flush
"endpoint" that marks the last operation that needs to be flushed. This
allows flush() to be called more than once (including a terminal call to
close() if it came to that).

We will end up calling scheduleCurrentBatchLocked() more often from
sendBatchLocked(), which is done via callback... so there's going to
be some more overhead, there. When we did inserts of 30,000 in quick
succession, many of the calls to sendBatchLocked() were coming from
the rate limit backoff branch, and those all had flush=false.

Environment details

OS type and version: Linux/macOS
Java version: 21
Version(s): com.google.cloud:google-cloud-firestore:3.22.0, com.google.cloud:google-cloud-firestore:3.24.2

Steps to reproduce

Insert 30,000 items into a collection within a try with resources block
Application hangs once exiting the try with resources block

Code example

@JsonIgnoreProperties(ignoreUnknown = true)
public record HistoricalReference(
    @JsonProperty("Identifier")
    String identifier,
    @JsonProperty("Start date")
    LocalDate startDate
    // ...
) { }

try (BulkWriter bulkWriter = firestore.bulkWriter()) {
    historicalReferences.forEach(
        historicalReference -> {
            DocumentReference docRef = collectionReference.document(historicalReference.identifier());
            bulkWriter.set(
                docRef,
                // Can't wait for record support!
                OBJECT_MAPPER.convertValue(historicalReference,new TypeReference<>() {})
            );
        }
    );
}

The text was updated successfully, but these errors were encountered:

thewildllama · 2024-08-05T16:01:44Z

@tom-andersen - Nice fix! Way cooler than mine. :)

thewildllama · 2024-08-08T20:03:18Z

@tom-andersen & @milaGGL - I've been testing with the fix for a few days, and I still see a few problems. The one I've been running down is difficult for me to understand, but here's what I've got so far:

We create somewhere between 15,000 and 30,000 write operations. Occasionally, something will happen that will prevent some of the earlier operations from making their way into a BulkCommitBatch until the very end of the batch handling. The chain of futures leading up to lastFlushOperation gets very long.

In one case I have logs to, I see the 819th - 837th operation not get into a batch until the final operation - in this case, the 14,530th operation. When those operations finally run and finish, they are the last. And now every operation has been executed successfully. The futures chain starts propagating success up towards lastFlushOperation, but something happens and everything falls off the rails somewhere around the 13,174th operation. At that point, propagation stops, and lastFlushOperation will never finish.

This does not happen every time. And I have varying streaks of lucky and unlucky.

milaGGL · 2024-08-08T20:05:32Z

Is this happening with both fixes?

thewildllama · 2024-08-08T20:07:34Z

I only saw "fix: BulkWriter flush logic #1778"... is there another? Or you mean my attempt? I saw things getting stuck with mine, too.

product-auto-label bot added the api: firestore Issues related to the googleapis/java-firestore API. label Aug 1, 2024

milaGGL self-assigned this Aug 2, 2024

tom-andersen linked a pull request Aug 3, 2024 that will close this issue

fix: BulkWriter flush logic #1778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

firestore.BulkWriter.flush() has a race #1772

firestore.BulkWriter.flush() has a race #1772

thewildllama commented Jul 31, 2024

thewildllama commented Aug 5, 2024

thewildllama commented Aug 8, 2024

milaGGL commented Aug 8, 2024

thewildllama commented Aug 8, 2024

firestore.BulkWriter.flush() has a race #1772

firestore.BulkWriter.flush() has a race #1772

Comments

thewildllama commented Jul 31, 2024

Environment details

Steps to reproduce

Code example

thewildllama commented Aug 5, 2024

thewildllama commented Aug 8, 2024

milaGGL commented Aug 8, 2024

thewildllama commented Aug 8, 2024