-
Notifications
You must be signed in to change notification settings - Fork 66
replicating Security DB results in failed bootstrap #735
Comments
here is a zip of my test project |
this is a biggie because we cannot bootstrap into Production without replicating the Security. I also forgot to mention that when you change the call to MarkLogic from |
Any notes in your ErrorLog.txt to give us an idea of where/when/why setup.xqy failed? Thanks! |
the annoying and no less mystifying thing about this is that there is no error written to the error log. All the "contention" messages on the security db are at debug level. I originally tracked the error down to the call of admin:forest-create on line 1615 of setup.xqy however that when I was creating App-Services forests. However i bet it is still going in there. I traced the flow into this api and it never seemed to come out. I had try/catch envelopes but it all didn't help. I have simply no idea how or where it failed. |
I wonder what the essential difference between /v1/eval and QC eval is.. I did notice that QC forces different-transaction. Maybe /v1/eval doesn't by default? I'd have to dig around in /v1/eval sources. |
The logic to apply replica forests has changed considerably. That may be the root cause.. |
Initial test doesn't look good. Bootstrap fails:
And after that Admin UI, and other things stop responding.. |
That is using latest dev.. |
I ran this against a 3 node cluster with 8.0-6.4, and I noticed the following ErrorLog messages. I first see these message appear:
after which it waits. At some point the bootstrap times out to fail with the earlier mentioned XDMP-FORESTNOT. After that ErrorLog suddenly continues with:
And once finished synchronizing, the cluster returns to a responsive state. Note: I tried with 8.0-1.1 before, and that didn't complete successfully, leaving a unresponsive cluster, which I had to clear fully.. |
@grtjn so just to summarise: with 8.0-6.4, bootstrap fails but the deployment is eventually successful? |
@joecrean still running tests. Ping me offline if you like.. |
I think I have been able to pinpoint the issue, though I still wonder why this never caused trouble in the past.. |
I have derived the following so far: the bootstrap gets passed admin:forest-create, and even to the point of admin:save-config-without-restart. At that stage the Security database starts closing, and gets stuck at middle closing. This is because bootstrap code has not finished yet for some reason. It used to complete without trouble in the past, but since MarkLogic 8 (and also with 9) the bootstrap code gets stuck because it tries to access Security database, but that is by that time in middle closing state. E.g. deadlock. Only when the evals timeout with an XDMP-FORESTNOT, the handles on Security db are released, and Security db can unmount, remount, and initialize replication. I added a check to see if Security has been touched, and bail out from bootstrap. Doing that makes the bootstrap end normally and timely. I need to do some extra checks though as to why this is necessary.. |
Same issue occurs on MarkLogic 7, except that it doesn't wait with releasing handles, so one doesn't really notice there is a deadlock:
|
So I had seen this issue when I added the code to scale out the cluster. I had tested with an older version of Roxy at that point, and it had the same problem. So I didn't delve into it deeper (I just reran bootstrap and it would complete after 1 or 2 re-runs, typically). All of my runs were on ML8. |
I did a lot of research, and discovered there are many xdmp:eval's agains the Security database, most of them irrelevant to --replicate-internals, except for those in setup:save-cleanup-state. Commenting this line helped a lot, but disables functionality important to setting up replica forests: https://github.com/marklogic/roxy/blob/dev/deploy/lib/xquery/setup.xqy#L1344 I need to see if I can remove the xdmp:eval in that function, or execute it earlier somehow.. |
The update for the cleanup state should be done very late in the execution (after successfully creating the new replicas has been completed). The eval was used just so that we could run against the security DB to get some role information. I don't think we can change it to not run against that DB....but we should be able to move it earlier. Although at this point, I don't know that the evals in here run until everything is done. |
I'll dig in and try some things.. |
Actually, I think you can remove the xdmp:eval in that function. It is unnecessary to give doc permissions to the admin role (admin users can always see everything), so no need to check if the current user has admin role either. |
Fixed #735: removed unnecessary eval against Security
@joecrean the fix was merged into the dev branch. You could run Please close the ticket if you are satisfied by the fix.. |
One sec, in my last PR i moved amps to before db creation, but it depends on the db, which must exist before the amp can be created. I'll open a PR for that shortly.. |
#735: accidental move of create-amps, breaking self-test
@grtjn One thing - ml env wipe does not remove the Security replica. |
To wipe the internal replicas you need to use |
that wiped the grin off my face... haha i crack me up |
Fixed in dev |
The issue
Short description of the problem:
When you try to replicate the Security database as part of a bootstrap you can see that when the Security db goes offline the bootstrap seems to continue and gets into a kind of death loop trying to lookup details in the Security db. Here is a sample of the output in the ML ErrorLog.txt.
What are the steps to reproduce the problem?
Create a 3 node cluster using Docker , Red Hat SLES 6, ML 8.0-6.1. My host laptop is Mac OS X. Download and create a vanilla roxy project. Set
systems-dbs=Security
. Run the bootstrap as below , eventually you will see the error message below.Here is my
local.properties
Tech Specs
Which Operating System are you using?
SLES 6
Which version of MarkLogic are you using?
8.0-6.1
Which version of Roxy are you using (see version.txt)?
1.7.5
The text was updated successfully, but these errors were encountered: