Skip to content
This repository has been archived by the owner on Oct 5, 2022. It is now read-only.

support 2 data centres HA with no central ZooKeeper cluster (or 2 single JVMs which can restart themselves with no fabric) #622

Open
jstrachan opened this issue Jan 30, 2014 · 10 comments

Comments

@jstrachan
Copy link
Contributor

it'd be awesome to be able to support fabric8 when folks have exactly 2 data centres and either can go down; or when they have just 2 JVMs and they want to implement a master/slave broker pair.

In either case, there's 2 things which either could fail. Since there's only 2, there's no chance of quorum.

Another similar use case is the retail scenario; where there the internet connection could be flaky, so joining a remote fabric on startup is not feasible.

However in these cases, the remote DC / container could startup its own fabric; then just discover (when a fabric is available) the master git repo and push/pull to it - but still startup if there's no remote fabric.

This would mean that the single DC or single container would basically be a stand alone separate fabric; it'd just be able to share configurations across fabrics. So this idea is a little bit like store and forward with ActiveMQ; you could wire together 2 fabrics; so when available a fabric bridge could connect to the remote fabric ZK and copy relevant data into the local ZK cluster (using ephemeral nodes so it goes away if the remote ZK fabric disappears).

An added benefit of this approach is then things can restart even when there's no ZK - providing the previously started up so that they have a git repo and the like; its just they can't sync any changes or discover any remote services on the remote fabric.

@jstrachan
Copy link
Contributor Author

So the main 2 pieces to solve this would be

  • allow 2 fabrics to elect leaders to bridge to each other and push/pull the git stuff so that configuration applied on one fabric gets applied to the other
  • a ZK bridge for discovery purposes; so that containers in one fabric can see containers in another fabric. It might be folks want to keep them quite separate really; other than for a management perspective (which is easy to do too; we'd just use the DC name as a directory in all ZK cluster folders), however we then could still do global management views and so forth

@jstrachan
Copy link
Contributor Author

the ZK bridging my wish to use this:
http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html

though we need to make sure we don't mess up any of the existing semantics (e.g. the way the ZK cluster stuff works for master / slave and the like)

@jstrachan
Copy link
Contributor Author

was chatting on IM with @chirino recently; he made the excellent point that (paraphrasing badly here...)

Most folks have been dealing with the 2 Data Centre failover scenario for a while by making one a known master; the main data centre (so run 2 ZKs there). If the other DC fails its fine. If the master fails, its a manual process to create a new ZK server so that the backup DC can take over.

This is easily achievable now really; i.e. use 3 ZK nodes; favour DC A which has more ZKs; then DC B. Then B can fail and things just work.

If A fails, then its a manual process to configure DC B to use just a single ZK instance (rather than assume, say, 3 ZK servers with A) when its been manually established that really A is down and its not a network split

@iocanel
Copy link
Member

iocanel commented May 30, 2014

Quoting ZooKeeper documentation:

Forming a ZK ensemble between two datacenters is a problematic endeavour as the high variance in latency between the datacenters could lead to false positive failure detection and partitioning. However if the ensemble runs entirely in one datacenter, and the second datacenter runs only Observers, partitions aren't problematic as the ensemble remains connected. Clients of the Observers may still see and issue proposals.

The main reason for this is that EVERY zookeeper write operation requires to voted from the servers. So splitting the servers in different datacenters will dramatically impact the performance.

So I guess that we should always have all zookeeper servers run in a single datacenter (the master one) and then have an observer running at each client datacenters.

Let's assume datacenters A and B and datacenter A hosts the ensemble. If the 2 datacenters get disconnected the B will loose communication with ZooKeeper and no read write operation would be possible. A will continue to work properly.

What is our goal here?
How do we want to have datacenter B to work?
Do we want the 2 clusters to split and proceed like nothing ever happened?
If so how would we "merge" data back when connection is restored?

@jstrachan
Copy link
Contributor Author

When folks are happy with a single master data centre that seems fine.

It seems common for folks to have 2 or 3 data centres where any data centre
can go down & there should be no single point of failure. The only real way
to do this with ZK seems to be having an ensemble per data centre. That's
fine & works today; but I can see folks wanting to combine the data centres
into some form of complete fabric from a management perspective.

Am thinking of things like:

  • configuration changes should be replicated across data centres (when they
    are all online); i guess that's a case of pushing changes from the
    git master in one DC to the others when there's a change. Not being able to
    replicate changes when DCs are disconnected is fine though; provided they
    sync up later when they are reconnected.
  • allow AMQ brokers to form store & forward networks across DCs (while
    connected)
  • view all containers in all DCs in the CLI and hawtio console

For the last 2 items I wondered if replicating some state between DCs into
each DCs ensemble might be a simple solution; where one DC elects a leader
and uses an observer on each other DCs ensemble and replicates the
relevant state to its local ensemble; so the existing code in fabric8
doesn't have to maintain N connections to different ensembles etc.

In the case of AMQ it'd be to advertise brokers in remote DCs for network
bridges to use. For the latter, replicating container metadata across DCs
would be a simple fix.

Another option would be to just force all the relevant code in fabric8 to
maintain a ZK connection to all DCs; I just figured mirroring state in each
ensemble with ephemeral nodes would be the simplest thing to do

On Friday, May 30, 2014, Ioannis Canellos notifications@github.com wrote:

Quoting ZooKeeper documentation:

Forming a ZK ensemble between two datacenters is a problematic endeavour
as the high variance in latency between the datacenters could lead to false
positive failure detection and partitioning. However if the ensemble runs
entirely in one datacenter, and the second datacenter runs only Observers,
partitions aren't problematic as the ensemble remains connected. Clients of
the Observers may still see and issue proposals.

The main reason for this is that EVERY zookeeper write operation requires
to voted from the servers. So splitting the servers in different
datacenters will dramatically impact the performance.

So I guess that we should always have all zookeeper servers run in a
single datacenter (the master one) and then have an observer running at
each client datacenters.

Let's assume datacenters A and B and datacenter A hosts the ensemble. If
the 2 datacenters get disconnected the B will loose communication with
ZooKeeper and no read write operation would be possible. A will continue to
work properly.

What is our goal here?
How do we want to have datacenter B to work?
Do we want the 2 clusters to split and proceed like nothing ever happened?
If so how would we "merge" data back when connection is restored?


Reply to this email directly or view it on GitHub
#622 (comment).

James

Red Hat

Twitter: @jstrachan
Email: jstracha@redhat.com
Blog: http://macstrac.blogspot.com/

hawtio: http:/ http://fusesource.com//hawt.io/
fabric8: http:/ http://fusesource.com//fabric8.io/

Open Source Integration

@iocanel
Copy link
Member

iocanel commented May 30, 2014

If we want to combine fabric, for purely management purposes I guess we could have multiple isolated fabrics (1 per DC) that would talk to each other via some short of bridge.

We could use this bridge to push / pull profile configuration between DC and also publish the local DC container attributes to the remote datacenter. As long as we don't try to perform inter-DC locking / leader election, it should work ok.

@jstrachan
Copy link
Contributor Author

Agreed. Am mostly just thinking its a way to do cross DC configuration,
discovery & tooling really. Never for multi-DC leader election.

To help avoid folks accidentally doing multi-DC leader election stuff; we
should be careful of what gets replicated and where its put. Eg we maybe
need to add the DC name (or local for the current ensemble) in paths used
when we replicate amq / camel / web / API endpoints; then it's easier to do
local only or DC aware discovery etc

On Friday, May 30, 2014, Ioannis Canellos notifications@github.com wrote:

If we want to combine fabric, for purely management purposes I guess we
could have multiple isolated fabrics (1 per DC) that would talk to each
other via some short of bridge.

We could use this bridge to push / pull profile configuration between DC
and also publish the local DC container attributes to the remote
datacenter. As long as we don't try to perform inter-DC locking / leader
election, it should work ok.


Reply to this email directly or view it on GitHub
#622 (comment).

James

Red Hat

Twitter: @jstrachan
Email: jstracha@redhat.com
Blog: http://macstrac.blogspot.com/

hawtio: http:/ http://fusesource.com//hawt.io/
fabric8: http:/ http://fusesource.com//fabric8.io/

Open Source Integration

@brozow
Copy link
Contributor

brozow commented Jul 12, 2014

This is a very interesting use case for me. I have been thinking about building a distributed monitoring and data collection system on top of fabric8 for a large retail customer. The want to have monitoring and data collection in each branch because the branch internet can be flaky (for example they can go down when the store closes). My objective was to be able to roll out new services into new containers in the branches via fabric8 but I was uncertain how zookeeper handled the flaky internet. Is it possible to solve this scenario today?

If not how could my team and I get started worked toward making this work?

If it helps to make anything simpler I suspect would only be configuration changes initiated in the data center and pushed to the stores as they connected and retrieved them

@davsclaus
Copy link
Member

I wonder this is on our table for v2 when we stand on top of kube / OS3 ?

@jstrachan
Copy link
Contributor Author

its looking like kubernetes is going to use "ubernetes" to solve the multi-data centre problem:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation.md

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants