Skip to content
This repository has been archived by the owner on Sep 21, 2022. It is now read-only.

Add cluster health check to drain script before shutting down local node #209

Closed
wants to merge 8 commits into from

Conversation

iamejboy
Copy link
Contributor

Thanks for opening a PR. Please make sure you've read and followed the Contributing guide, including signing the Contributor License Agreement.

Feature or Bug Description

Amended mysql drain script to check cluster nodes health before allowing local node to shutdown.

Motivation

We think this adds safeness to update cluster when it is being check on drain before shutting down local node.

Related Issue

#208

@cfdreddbot
Copy link

Hey iamejboy!

Thanks for submitting this pull request!

All pull request submitters and commit authors must have a Contributor License Agreement (CLA) on-file with us. Please sign the appropriate CLA (individual or corporate).

When sending signed CLA please provide your github username in case of individual CLA or the list of github usernames that can make pull requests on behalf of your organization.

If you are confident that you're covered under a Corporate CLA, please make sure you've publicized your membership in the appropriate Github Org, per these instructions.

Once you've publicized your membership, one of the owners of this repository can close and reopen this pull request, and dreddbot will take another look.

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157656403

The labels on this github issue will be updated when the story is started.

@ctaymor
Copy link
Contributor

ctaymor commented May 21, 2018

Hi @CAFxX,
Thanks for the PR.
Please respond to the issue to let us know when you've signed the CLA, and we'll close and reopen the PR to satisfy the robots.

  • Caroline

@iamejboy
Copy link
Contributor Author

Hi @ctaymor,

We believe we are covered with Corporate CLA and already publicized our membership.

Regards

@ctaymor
Copy link
Contributor

ctaymor commented May 22, 2018

Closing and reopening to get cfdreddbot to pick up the CLA.

@ctaymor ctaymor closed this May 22, 2018
@ctaymor ctaymor reopened this May 22, 2018
@cfdreddbot
Copy link

Hey iamejboy!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157781101

The labels on this github issue will be updated when the story is started.

Copy link

@bgandon bgandon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic contribution, as this release was desperately missing such drain script. Thank you for contributing it!

As you'll see, I'm not an expert of wsrep internals, but rather an expert in authoring BOSH Releases and writing best-practice drainscripts. For the MariaDB expertise, I trust your own expertise and tests.

Please review my comments, push updates, and don't hesitate to answer where appropriate.

@@ -1,6 +1,59 @@
#!/bin/bash -e
#!/bin/bash -eu
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see an exemple drain script here: https://github.com/gstackio/exemplar-release/blob/latest/jobs/paragon/templates/drain

The structure drain scripts includes quite a deal of common patterns, that are very specific to drain scripts. Copying the file descriptor 1 to file desc 3 is a elegant way for providing activity logs of the drain script while still conforming the BOSH requirement to only output an integer to the original stdout.

Other advices and examples can be found here: https://github.com/gstackio/exemplar-release/tree/latest#drain-docs

Here as a start, I would advise you to use the #!/usr/bin/env bash sheebang for portability, and use set -eu -o pipefail after so that it cannot be overridden.

return_code=$?
echo 0
exit ${return_code}
NODE_IP=<%= spec.ip %>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read the section about encoding: https://github.com/gstackio/exemplar-release#encoding

Every BOSH property injected with ERB in this script should be properly escaped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! had that previously, but removed after changing codes doing test. smh

NODE_IP=<%= spec.ip %>
MYSQL_PORT=<%= p("cf_mysql.mysql.port") %>

LOG_DIR="/var/vcap/sys/log/mysql/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line and replace this with the logging pattern demonstrated in https://github.com/gstackio/exemplar-release/blob/latest/jobs/paragon/templates/drain

Then your echo "..." &>> "$LOG_DIR/drain.log" statements will become plain echo "..." because of the default redirection, and exit protocol (status and echoing integer) will be handled automatically at script exit.

fi
}

CLUSTER_NODES=(`wsrep_var wsrep_incoming_addresses $NODE_IP $MYSQL_PORT | sed -e 's/,/ /g'`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Bash-isms consistenly, and use the $() syntax here instead of backticks.

Please quote "$NODE_IP" "$MYSQL_PORT" to be sure the exact values are passed.

# rolling restart can actually cause data loss (e.g. if a node that is out
# of sync is used to bootstrap the cluster): in this case we fail immediately.
for NODE in "${CLUSTER_NODES[@]}"; do
NODE_IP=`echo $NODE | cut -d ":" -f 1`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, please:

  1. use $()
  2. quote properly $NODE
  3. don't over-quote ":" as it is a litteral

All in all, I suggest you write NODE_IP=$(echo "$NODE" | cut -d: -f1)

And if you agree with this change, please apply it 3 times more below.

NODE_IP=`echo $NODE | cut -d ":" -f 1`
NODE_PORT=`echo $NODE | cut -d ":" -f 2`
cluster_status=`wsrep_var wsrep_cluster_status $NODE_IP $NODE_PORT`
if [ "$cluster_status" != "Primary" ]; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here quoting "Primary" is unnecessary as it is a litteral string anyway.

node_ip = spec.ip
hosts = (node_ip.split(".")[0..-2] + ["%"]).join('.')
%>
GRANT USAGE ON mysql.* TO 'drain'@'<%= hosts %>' IDENTIFIED BY '<%= p('cf_mysql.mysql.drain.db_password') %>'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to find some Ruby syntax to SQL-escape those values that are injected here.

@@ -0,0 +1,5 @@
<%
node_ip = spec.ip
hosts = (node_ip.split(".")[0..-2] + ["%"]).join('.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the way we should grant the drain user.

This script will be run anytime the MariaDB nodes IPs are changing anyway, and you cannot assume that the nodes IPs will be following a specific addressing scheme. So, you could to loop over the list of Galera nodes here (using BOSH link) and output one GRANT per IP address.

But there is even simpler: this script only need to allow the current node to connect, i.e. <%= spec.ip %>, because similar drain script will exist on other nodes and take care for the proper GRANT, just in time it is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you could to loop over the list of Galera nodes here (using BOSH link) and output one GRANT per IP address

this script only need to allow the current node to connect, i.e. <%= spec.ip %>, because similar drain script will exist on other nodes and take care for the proper GRANT, just in time it is necessary.

  • the script need to allow also other node ip, as when during drain on a local node, it will jump in to other node db to check its status.

timeout 5 \
/usr/local/bin/mysql --defaults-file=/var/vcap/jobs/mysql/config/drain.cnf -h "$host" -P "$port" \
--execute="SHOW STATUS LIKE '$var_name'" -N |\
awk '{print $2}' | tr -d '\n'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: this would look better if you put the pipe | at begining of this line instead of the end of previous line.
E.g.:

      --execute="SHOW STATUS LIKE '$var_name'" -N \
      | awk '{print $2}' \
      | tr -d '\n'

if [ "$state" != "Synced" ]; then
echo "wsrep_local_state_comment of node '$NODE_IP' is '$state' (expected 'Synced'): retry drain in 5 seconds" &>> "$LOG_DIR/drain.log"
# TODO: rewrite to avoid using dynamic drain (soon to be deprecated)
echo -5 # retry in 5 seconds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you write -5 here, then you must exit 0 right away, as per drain script contract.

@ndhanushkodi
Copy link
Contributor

Thanks for the feedback @bgandon. @iamejboy, if you want to make any of these changes, feel free. Otherwise we can take care of any cleanup when we merge the PR.

@iamejboy We were going over the functionality when it checks the Synced state for all of the nodes. Typically two nodes wouldn't normally be doing a state transfer when trying to drain the third node. Any state transfer should be finished before it starts draining the third node.

Are we missing a edge case here where you've seen this happen?

We suggest exit 1 in a non-Synced case since it removes the echo -5 which has risks of being an infinite loop if a node was stuck in a non-Synced state. In this case the bosh deploy would just hang forever since there is no drain timeout.

It also seems that it is impossible for a node to be Synced, and non-Primary. If this is the case, we probably don't need to check the Primary component. Does this match your understanding of these two fields?

Thoughts?

@iamejboy
Copy link
Contributor Author

@bgandon thanks for your feedback. @ndhanushkodi Yes, you may take care the cleanup.

Are we missing a edge case here where you've seen this happen?

  • have not seen yet as we are still evaluating cluster carefully, but surprise that drain doesn't do cluster check as additional safety net.

It also seems that it is impossible for a node to be Synced, and non-Primary

  • I see, if that is the case, there's no need to check the other.

@CAFxX
Copy link
Contributor

CAFxX commented May 27, 2018

Typically two nodes wouldn't normally be doing a state transfer when trying to drain the third node. Any state transfer should be finished before it starts draining the third node.

Yes, there's a couple of cases AFAICT:

  • Consider the case in which a node gets stopped by the IaaS during -but independently from- the deploy (e.g. node 1 crashes and restarts before draining node 3; when node 1 becomes joiner, node 2 becomes donor; if you now stop node 3 -that is the only in synced state- before node 2 goes back to synced... we are likely to go kaboom). While checking that all nodes are synced doesn't 100% solve the problem (bad timing could still cause the IaaS to restart a node exactly when another node is being deployed) it limits -almost for free- the window in which the cluster can enter a non-functioning state.
  • As explained in https://fromdual.com/murphys-law-is-also-valid-for-galera-cluster-for-mysql a node can switch from DONOR back to SYNCED significantly before the joiner node switches from JOINING to SYNCED, so it seems that checking only the local node is not enough.

@ndhanushkodi
Copy link
Contributor

Hey @CAFxX and @iamejboy,

We were trying to do some verification and testing around the changes in the drain script. We partitioned off one node so it was Initialized while the other nodes were Synced.

Unfortunately, when the node is partitioned off, the other nodes drop it from their wsrep_incoming_addresses list, so the healthy nodes continue through the drain script without detecting the degraded state of the cluster.

The script as it is does catch some very timing dependent cases, but not larger cluster problems.

One thought we had was to check all ips of the cluster, using the bosh link to get them all. That causes problems during a scale down though since BOSH deletes the unneeded nodes first, then the drain script would be unable to connect to them.

Any thoughts on how it might be made better?

@CAFxX
Copy link
Contributor

CAFxX commented May 31, 2018

Any thoughts on how it might be made better?

I feel like there should be an easier way, but this should work (using some drain magic):

  • at the end of the drain script check if the current node is being deleted, if so store in an ad-hoc drain table in the database that this node is being removed from the cluster
  • in the drain script, instead of checking all nodes in the bosh link, check only the nodes in the bosh link that don't appear in the drain table
  • in the post-deploy script, create the drain table (if it doesn't exist) or empty the drain table (if it exists)

or, similarly:

  • in the post-deploy script, create the drain table (if it doesn't exist) and add all nodes to it
  • in the drain script, check only the nodes that appear in the drain table
  • at the end of the drain script, if the local node is being deleted, remove it from the drain table

@ndhanushkodi do you think this would work?

@CAFxX
Copy link
Contributor

CAFxX commented Jun 7, 2018

@ndhanushkodi any feedback?

@ldangeard-orange
Copy link
Contributor

hi,
I test split-brain on galera cluster. In this case status are

+---------------------------+--------------------------------------+
| Variable_name             | Value                                |
+---------------------------+--------------------------------------+
| wsrep_cluster_status      | non-Primary                          |
| wsrep_local_state_comment | Initialized                          |
+---------------------------+--------------------------------------+

@ctaymor
Copy link
Contributor

ctaymor commented Jun 11, 2018

Hi @CAFxX,
Having a drain table sounds like a really expensive fix to this problem. We then have to add logic to avoid dumping it every time we backup the database, or for users migrating their database (for instance to PXC).

We thought of one potential solution which would catch some of these cases, and maybe be less expensive. The drain script could look to the bosh link to get the ips of the cluster, and call galera-healthcheck API for the other nodes. If the node is unreachable, we would drain, assuming it was deleted. If the node reports a healthy status, we would drain. If the node reported unhealthy, we would fail the drain script.

This would catch some cases of an unhealthy cluster, but the drain script would not fail in a full network partition.
What do you think about that trade-off?

@CAFxX
Copy link
Contributor

CAFxX commented Jun 12, 2018

@ctaymor for now it could work, I would suggest to doublecheck for unreachability of a node also via ICMP over a period of time (i.e. if the host does not respond to pings for e.g. 60 seconds -- this would obviously slow down the shrinking process, but I would argue that such an operation is going to be rarely executed anyway). The rationale for this is that at least some IaaS (e.g. vsphere) have provisions to immediately restart a missing VM, so attempting to ping the VM for a period of time should hopefully catch the case in which the VM disappears (e.g. due to host crash) right in the middle of a deploy, as well as the case in which the VM is running but mysql is not working.

Better long term solutions would be either

  • asking @cppforlife to include the number/list of old/new nodes of the instance group in the drain env vars. This way we would not need "heuristics" to figure out if the cluster is shrinking.
  • using consul to register/healthcheck the cluster nodes, as well as information about which nodes are being removed from the cluster, and use this information during drain (basically using consul instead of the drain table I was mentioning above)

@ctaymor
Copy link
Contributor

ctaymor commented Jun 14, 2018

@CAFxX I think you hit the nail on the head, that we're trying to come up with a mediocre workaround when the solution to enabling good drain scripts for clustered services should really be pushed down to the bosh level. I'd recommend opening an issue in BOSH.

Thanks for spending all this time working on trying to make the drain scripts better.

@CAFxX
Copy link
Contributor

CAFxX commented Jun 14, 2018

@ctaymor ok, we'll improve the current good-enough-for-now drain impl as discussed here and in the background will open an issue to bosh. Based on the slack discussion we should allow the operator to opt-out of the ICMP protection (e.g. for the case in which ICMP is not allowed in the deployment networks).

@CAFxX
Copy link
Contributor

CAFxX commented Jun 17, 2018

I also reached out to dmitriy and it seems there's some activity already happening in SAP for extending bosh for similar use cases:

https://github.com/cloudfoundry/bosh-notes/blob/master/proposals/drain-actions.md

@iamejboy
Copy link
Contributor Author

Hi @ctaymor & @ndhanushkodi

Updated PR with cleanup plus what was discussed above. Feel free to re-test and amend for any additional or found errors.

Regards,

@CAFxX
Copy link
Contributor

CAFxX commented Jul 6, 2018

@ctaymor @ndhanushkodi @bgandon have you had a chance to look at this?

@ndhanushkodi
Copy link
Contributor

Hey @iamejboy @CAFxX,

We've added a story to our backlog to pull in some of these changes to fix the drain script issues, and we may change the implementation a little. Unfortunately, our team is stretched a little thin and this will have to be prioritized low for now, but we should be able to eventually get to it.

Thanks,
Nitya and Joseph (@jpalermo)

@CAFxX
Copy link
Contributor

CAFxX commented Jul 6, 2018

@ndhanushkodi our production rollout of c2c is blocked on this (and the other c2c bug @iamejboy reported a few days ago, but it seems that a candidate fix is available now), so we were hoping it would happen sooner than "eventually". Just to organize work on our side, any rough estimate about when you realistically expect to be able to work on this?

@ctaymor
Copy link
Contributor

ctaymor commented Jul 17, 2018

Hi @CAFxX, we do want to go ahead and try to unblock you.
We're scheduling it to work on soon. https://www.pivotaltracker.com/story/show/157656403

  • Caroline

@jpalermo
Copy link
Member

We updated the drain script to use the Galera Healthcheck endpoint rather than determining node health directly (since this is a job Galera Healthcheck already performs).

Should be merged into the develop branch now. Let us know if you have any questions or concerns.

@jpalermo
Copy link
Member

Hey @CAFxX. Looks like our pipelines ran into a problem.

When deleting a deployment, BOSH does not respect the max-in-flight options. So rather than draining 1 node at a time, it drains all at the same time. With the new drain script behavior in place, this caused most of the delete-deployments we ran to fail.

You can get around this with a --force or a --skip-drain, but we don't think this is a reasonable requirement for deleting a healthy deployment.

We talked with the BOSH team to see if there is any way for us to differentiate scaling down a node vs deleting a deployment from the perspective of the drain script; but there is no way to differentiate. This is not the first time they've heard of this issue and are going to prioritize some work to resolve the parallel draining, but there is no guarantee it will get done soon.

We don't think there is a simple enough solution in cf-mysql-release to implement this behavior currently.

However, we did port the behavior over to our new pxc-release. This is intended to be a replacement for cf-mysql-release moving forward.

Because we've structured the jobs differently in pxc-release, and separated the galera-agent/galera_health_check job out with its own drain scripts, we think the drain script will work there without any problem.

@CAFxX
Copy link
Contributor

CAFxX commented Jul 26, 2018

@jpalermo in my understanding the drain script is designed to aid graceful/zero-downtime lifecycles for bosh jobs and, at least from my perspective, it's hard to describe deleting a whole deployment as a "graceful" operation. If we agree on this, then using --force/--skip-drain in the pipelines should be OK (but I agree that the BOSH behavior is at fault here and should be changed, so much so I already filed cloudfoundry-attic/bosh-notes#47 a while back).

However, we did port the behavior over to our new pxc-release. This is intended to be a replacement for cf-mysql-release moving forward.

We would like to avoid using in production pre-release/non-GA components or, as in the case of the current cf-mysql-release, components with known issues that can affect reliability.

@jpalermo
Copy link
Member

@CAFxX if it were just our pipelines, we would probably consider requiring the --force/--skip-drain, but there are many automated deployments and tools out in production customer instances that depend on a vanilla bosh delete-deployment working.

Open source software is a bit strange, since it can be hard to determine what is pre-release and what is ready for production. I can tell you that our biggest deployments are now using pxc-release in production and Pivotal's latest version of Pivotal Cloud Foundry just shipped with pxc-release as the default database.

We have not hit the 1.0 release banner for pxc yet: https://www.pivotaltracker.com/story/show/157528771

At which point we plant to transition cf-deployment away from cf-mysql release. But the only thing holding up the 1.0 release at this point is we want to change the bosh property syntax for creating users/databases, nothing to do with stability or performance.

@CAFxX
Copy link
Contributor

CAFxX commented Jul 27, 2018

@jpalermo would you be open to leave it behind an opt-in manifest property? We are also considering pxc-release, but we were really close to rolling this out and moving to pxc is going to require to redo a significant amount of validation work that would push back our c2c rollout for weeks.

@jpalermo
Copy link
Member

I'll see what the mysql product managers think of that idea today.

@iamejboy
Copy link
Contributor Author

iamejboy commented Aug 8, 2018

Hi @jpalermo @ctaymor @ndhanushkodi,

Thanks for adding this as option.

By the way, can we removed 000 here

if [[ $status_code -eq 000 || $status_code -eq 200 ]]; then

Since we already put on spec desc, "---skip-drain flag is required in order to delete a deployment" then add "scaling down"

In this case, we will fail drain if we have unreachable node, crashed node, or simply monit stopped node in the cluster. Drain script will be more useful to make sure that cluster is healthy before shutting down the local node. Unlike when 000 is added, we assume its scaling down but its not always the case. It could be either of those mentioned.

cc: @CAFxX

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants