Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insertion failures cause segfault in concurrent index update actor #99

Closed
dongxinEric opened this issue Feb 28, 2019 · 2 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@dongxinEric
Copy link
Contributor

A segmentation fault sometimes got thrown out from https://github.com/FoundationDB/fdb-document-layer/blob/master/src/QLContext.actor.cpp#L418

After some basic debugging it seems like somehow the memory of one of the item in Cartesian product iterator became invalid.

@dongxinEric dongxinEric self-assigned this Feb 28, 2019
@dongxinEric
Copy link
Contributor Author

dongxinEric commented Mar 3, 2019

Some observations: it seems like to happen in the following scenario:

  1. When trying to insert/update multiple documents in one transaction;
  2. Then one of them failed for whatever awesome reason;
  3. Then other actors got canceled, theoretically;
  4. But somehow some actor still were trying to fulfill their duty and thus trying to access to some memory now became invalid, and thus the seg fault.

Now magically, I cannot reproduce it locally, but we do still see seg fault happening in our nightly correctness run: about a dozen out of 100K iterations. Thanks to the solid FDB KVS backing this, and due to the fact that the actual error(step 2.) will be returned to user before the seg fault, this error will be transparent to user since the monitor process will automatically restart the main fdbdoc once it died, and there is NO correctness violation, because nothing got committed yet.

@hgray1 hgray1 assigned apkar and unassigned dongxinEric Mar 11, 2019
@apkar apkar changed the title Segmentation Fault when running correctness tests Insertion failures cause segfault in concurrent index update actor Mar 13, 2019
@apkar apkar added the bug Something isn't working label Mar 13, 2019
@apkar
Copy link
Contributor

apkar commented Mar 13, 2019

The batch of inserts happening in a single request is run under a single transaction with the transaction loop present in doRetry(). When an insert in this batch fails, catch block calls tr->onError on the FDB transaction. This causes the transaction to be reset. Any concurrent actors using this transaction will be left in a bad state. There are two places we need this fix.

  • Transaction shouldn't reset itself to clean state on non-retryable errors. Instead, it should fail any further requests on that transaction. That gives a chance to concurrent actors to deal with cancelled transaction. This needs changes in fdbclient.
  • DocTransaction should cancel all the ongoing actors in case transaction failure.

This issue deals with the second task.

apkar added a commit to apkar/fdb-document-layer that referenced this issue Mar 13, 2019
…r->onError()

By not cleaning DocTransaction state which internally depends on FDB
transaction state, concurrent actors during onError() are causing
segfault.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants