-
Notifications
You must be signed in to change notification settings - Fork 97
Side by Side Upgrade
In this wiki we describe how to upgrade a Xenon based service using a side-by-side approach. Side-by-side in this case means that we will install the newer version of the service on a separate set of nodes and migrate all necessary data from the old node cluster to the new node cluster.
For the side-by-side upgrade to work your project will need to meet several requirements:
- You need additional resources to set up a second node cluster.
- The new node cluster needs to be able to connect to the old node cluster.
- The new node cluster should be fully formed and stable. Its recommended quorum is set to node group size during migration so node failures are not masked during upgrade and migration task fails (it can always be retried)
- There might need to be a small time window where requests to the service are queued, at an edge device, or failed (the service appears offline).
Besides the aforementioned requirements there is also a set of assumptions that we are making:
- Only data/state needs to be migrated between the old and new node cluster.
- The system need to be tolerant to temporarily inconsistent data.
- On the old node cluster active tasks will finish during the maintenance window.
- No indirect interaction between old and new node cluster through third party entities, e.g. external data sources or hardware.
- Your clients will be able to connect to the new node cluster, e.g. you use an external load balancer point it to the new node cluster.
If your cluster has enabled AuthN/AuthX services, in order to seamlessly access both nodes, same user(documentSelfLink
) needs to exist in both node groups.
For example, if example@vmware.com
user with /core/authz/user-groups/example@vmware.com
documentSelfLink exists in old node group, then same user /core/authz/user-groups/example@vmware.com
needs to exists in new node group.
As long as user documentSelfLinks are same, auth token for the user works for both node groups.
The side-by-side upgrade will proceed in the following steps:
- Install the new service on a separate set of hardware.
- Enter the maintenance window.
- Do the data migration.
- Enable the new service.
- Exit maintenance window.
When entering the maintenance window users should not be able to start new task on the old or new node cluster. Otherwise the data migration might not be complete since the underlying Lucene query only returns results available before the query was issue.
The data migration step starts one migration service for each data entity type that needs to be migrated. The migration service will query the old node cluster for entities and posts them to the new node cluster.
In case your data model changed, e.g. new fields in the newer version of the entity need to be filled in during migration you can supply a transformation service that the MigrationService will call in order to transform the entity before posting it to the destination.
This will allow you to do simple transformations like field re-namings or filling in missing value as well as splitting and merging objects. In order to merge objects the transformation service will need to call into the old service to retrieve the necessary objects.
In case the amount of data is larger than you can possibly migrate within your maintenance window you will need to start the migration before entering the maintenance window. To support continuous data migration the Migration service can be supplied with a timestamp which it uses to query for all document updated after the time stamp. Also the migration service returns the latest document update time it saw while paging through all entities it retrieved from the old service.
After the data migration is completed we need to make the service available to all users. This depends how you currently make the old service available to users. In case you are using an external load-balancer you can point that load-blancer to the new service.
Now you can allow users to reach the new service.
This walkthrough provides a hands-on example showcasing how Xenon supports a live, side-by-side upgrade of a service. For this walkthrough, the existing (old) Xenon service is running on what we'll call the "blue" node cluster; the new version of the Xenon service will be on a separate "green" node cluster.
This walkthrough focuses solely on the Xenon components of a live upgrade; it will not go into specifics relating to a load balancer or API gateway.
The MigrationTaskService provides built-in support to migrate state from the "blue" node cluster to the new "green" node cluster.
Assuming you've cloned the xenon
repo locally, modify the DecentralizedControlPlaneHost.java's start()
to also start the MigrationTaskService
. This is done by adding the following line after the UiService
is started.
// Start migration service
super.startFactory(new MigrationTaskService());
// Don't forget to add the import!
// import com.vmware.xenon.services.common.MigrationTaskService;
Next, build Xenon locally using mvn clean install -DskipTests
. Then from the root xenon
directory, startup our "blue" node-cluster (which will only consist of a single Xenon node for this walkthrough):
java -jar xenon-host/target/xenon-host-*-jar-with-dependencies.jar \
--port=8000 --id=blueNodeAtPort8000 \
--adminPassword=changeme \
--sandbox=xenon-host/target/xenonSandboxBlue
In a separate terminal window, startup the "green" node cluster in a similar manner (but on a different port).
java -jar xenon-host/target/xenon-host-*-jar-with-dependencies.jar \
--port=8001 --id=greenNodeAtPort8001 \
--adminPassword=changeme \
--sandbox=xenon-host/target/xenonSandboxGreen
Please refer to Starting Xenon Host page for details on other available command line arguments.
NOTE: In our example, the ExampleService implementation is exactly the same between both "blue" and "green" deployments. In reality, these implementations would be different for an upgrade, but this walkthrough still shows how state from a previous node cluster can be migrated to a new node cluster.
Also, we only have a single node in each "cluster" ... but this is intentional to keep the walkthrough straightforward. The same steps apply for a true "cluster" of nodes (given that the "blue" and "green" node clusters are separate from each other).
Let's put in some test data into the "blue" node cluster. We'll use Apache's ab benchmarking utility to send HTTP POST
s to create 1,000 example service instances, but feel free to use whatever tool you want.
echo '{"name": "example-1", "counter": 1}' > /tmp/xenonExampleService.json
ab -p /tmp/xenonExampleService.json -T "application/json" -c 5 -n 1000 http://localhost:8000/core/examples
At this point, the "blue" node has 1,000 example service instances; the "green" node has zero. You can verify this:
# 8000 --> blue, has 1000 instances ; 8001 --> green, has 0 instances
curl localhost:8000/core/examples
curl localhost:8001/core/examples
For this walkthrough, we only wish to migrate ExampleService
state to the "green" cluster. As mentioned earlier, you'll need to create a MigrationTaskService
instance for each service factory you wish to migrate.
Save the following MigrationTaskService
details into a file located in /tmp/xenonMigrateExampleService.json
:
{
"sourceNodeGroupReference": "http://localhost:8000/core/node-groups/default",
"destinationNodeGroupReference": "http://localhost:8001/core/node-groups/default",
"sourceFactoryLink": "/core/examples",
"destinationFactoryLink": "/core/examples",
"continuousMigration": "false"
}
NOTE:
continuousMigration
is an optional field that defaults tofalse
. If you set it totrue
, an ongoing migration task will be used (which defaults to firing every minute). See MigrationTaskService for more details.
Fire off a HTTP POST
to create the migration task in your favorite HTTP client.
curl -H "Content-Type: application/json" --data @/tmp/xenonMigrateExampleService.json http://localhost:8001/management/migration-tasks
The response should supply a documentSelfLink
where you can see the status of the migration. I used that value and confirmed the migration finished by examining the response of: curl localhost:8001/management/migration-tasks/26167d31-63df-4dfc-a604-5774348d25f5
.
Now, you can also see that all the ExampleService
state was migrated to the "green" cluster.
curl localhost:8001/core/examples
It's important to note that the
MigrationTaskService
can compare a ServiceDocument'sdocumentUpdateTimeMicros
and only migrate it if it hasn't been migrated yet. This means you can kick off a newMigrationTaskService
multiple times without worrying that duplicates will be added to the "green" cluster.
You're done! All the state from the "blue" cluster has successfully been migrated to the "green" cluster.
The MigrationTaskService
also supports changes to a service's State (aka: ServiceDocument
) between versions. This comes in handy if the "green" version of your service introduces new fields (or renames fields) that need to be properly initialized from the old version's state.
The MigrationTaskService
accepts a stateless transformationServiceLink
that takes the "blue" (or old) state as an input, and transforms it to the expected "green" (or new) state when migrating.
In this walkthrough, we did not include a transformationServiceLink
in our POST
body, so there was no transformation... but this functionality is available if you need it.
If you don't want to migrate all state to your "green" cluster, you can also provide a querySpec
for your MigrationTaskService
instance that defines precisely the state that should be migrated.
If the migration task is migrating the /core/examples
factory, the default querySpec
used (if not provided) would be:
"querySpec": {
"query": {
"occurance": "MUST_OCCUR",
"booleanClauses": [
{
"occurance": "MUST_OCCUR",
"term": {
"propertyName": "documentSelfLink",
"matchValue": "/core/examples/*",
"matchType": "WILDCARD"
}
} ]
},
"resultLimit": 500,
"options": [
"EXPAND_CONTENT"
]
}
More details on Xenon queries can be found:
Admittedly, there is a bit of hand-waving here relating to a truly live upgrade with zero downtime, especially when it comes to how/when to configure an external load balancer or API gateway to ensure that your service's state is both consistent and available.
One method of ensuring a "live upgrade" might consist of:
- Fire off migration task for each service factory you want to migrate
- Monitor migration tasks. Once migration tasks are finished (or close to finished), configure your load balancer queue all requests
- Perform a final migration and wait until all tasks finished
- Configure load balancer to route traffic to "blue" node cluster
Even if the "blue" cluster contains a lot of data, the migration is done while the "blue" cluster is still live and responding to clients. Once most of the state is migrated, the maintanence window (where the load balancer is queuing requests) will be so small that clients will barely notice.
Migration task retrieves each document from its owner node by comparing documentOwnerId
and it's hostId
.
For example, there are two hosts, host-1
and host-2
, and two documents doc-a
and doc-b
.
host-1
is the owner of doc-a
and host-2
is the owner of doc-b
.
Each node locally stores both doc-a
and doc-b
since xenon synchronize documents.
When migration task starts, it will issue LocalQueryTask
to retrieve documents on each node.
For host-1
, it will fetch doc-a
and doc-b
initially, then compare documentOwnerId
and filter out doc-b
because doc-b
owner is host-2
. This behavior is to migrate authoritative document(document on the owner node) to the new environment.
If data is restored from some other host which has different hostId (e.g: host-old
) and the document never has changed on the new node, the document still keeps old hostId as its document owner.
For example, doc-old
with owner host-old
exists on host-1
and host-2
because data has restored from host-old
before.
When migration happens, doc-old
will be filtered out on both host-1
and host-2
due to document owner mismatch; Therefore, doc-old
will NOT be migrated to the new environment.
To prevent migration skips documents due to owner mismatch, we recommend using same hostId on new node when restoring data.
When MigrationOption#ALL_VERSIONS
is specified in migration request, the task
attempts to migrate all historical documents(old document versions) by
performing same operations in order. (e.g: when source has 3 versions of
document v0=POST, v1=PATCH, v2=PUT, migration task will attempt POST, PATCH,
PUT operations with versioned document as input)
The migrated versions may not have the same document versions in source, but the order of the history will be maintained.
NOTE:
When migrating a document with DELETE in history, destination will only have
histories after delete.
This is due to the DELETE change in xenon 1.3.7+ that DELETE now purges past
histories.
In prior versions, POST with PRAGMA_DIRECTIVE_FORCE_INDEX_UPDATE
after DELETE
added new version on top of existing histories.
When migrating a large number of documents, following items may need to be considered in order to improve performance.
documentExpirationTimeMicros
of migration request (MigrationTaskService.State
) is used
for the lifetime of migration task. When migrating large number of data, the timeout should
be increased to avoid failing the task by timeout.
This value will also be used for the expiration of QueryTask
that migration calls to retrieve documents.
Migration task generates a lot of POST requests to the destination system. To avoid overflowing the traffic, the number of concurrent migration task should be controlled.
When fetching data from source hosts, increased number of querySpec.resultLimit
reduces number of pages that data fetch query need to go through.
It is disabled by default and is controlled by ESTIMATE_COUNT
migration request option.
Since count query is expensive operation in xenon, it should be kept disabled when migrating
a large dataset.