Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove restriction in overwriting worker info within join logic #18275

Merged
merged 3 commits into from
Oct 23, 2023

Conversation

lucyge2022
Copy link
Contributor

What changes are proposed in this pull request?

Now with worker id can be assumed from a different worker instance whether on a different pod in k8s or a different host machine for baremetal. The creation onto the persisted ring path : /DHT/DefaultAlluxioCluster/AUTHORIZED/ should not bail if a different value is seen.

Why are the changes needed?

to enable rejoin of a worker bearing same worker id but with different host or other WorkerInfo fileds.

Does this PR introduce any user facing changes?

No

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I left a comment on a corner case that might take a bit of discussion. I feel we may want to add a bit more in-line comments on that situation?

LOG.warn("Same worker entity found bearing same workerid,"
+ "maybe benign if pod restart in k8s env or same worker"
+ " scheduled to restart on another machine in baremetal env.");
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you now observe a new worker with the same identity, would there be an old worker with the same id on the ring? If so then that obsolete one needs to be removed? Hmmmm but what about the corner case that the admin incorrectly used the same ID on multiple workers, like copying the config blindly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it will override, for enterprise customer normally we should always do automated scripts such as ansible to run deployment instead of letting customer do stuff like copy properties. typical adding-node or new-deployment should always do a wipe-out clean to clean system info files such as worker_identity file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log the old one so we could tell from the logs if something is unexpected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -93,11 +92,14 @@ public void join(WorkerInfo workerInfo) throws IOException {
byte[] serializedEntity = entity.serialize();
// If there's existing entry, check if it's me.
if (existingEntityBytes != null) {
// It's not me, something is wrong.
// It's not me, or not the same me.
if (!Arrays.equals(existingEntityBytes, serializedEntity)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why compare by the serialized form? The WorkerServiceEntity implements equals and ignores comparing WorkerNetAddress deliberately. If compared by the serialized form, the netaddress would be taken into account, which I find confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in bare metal cases, if a worker restart none of the fields in workerserviceentity will get changed, and the content should be the same, so we don't need to insert the kv pair again to etcd. if not , then thats a case of the worker changing hostname like in k8s, hence comparing the serialized form, just to say is it the same me on etcd? or its a different me, and im about to overwriting the content

"Some other member with same id registered on the ring, bail.");
// In k8s this might be bcos worker pod restarting with the same worker identity
// but certain fields such as hostname has been changed. Register to ring path anyway.
LOG.warn("Same worker entity found bearing same workerid,"
Copy link
Contributor

@dbw9580 dbw9580 Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the serialized forms don't compare equal, it doesn't mean the two have different worker IDs. Could be different network address, or anything ephemeral in WorkerServiceEntity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly, we found the key already exists, so now we have to overwrite it with a different WorkerServiceEntity content, before it doesnt allow doing so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see what you are trying to achieve. If the serialized forms are equal, then you don't have to overwrite anything. Makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

LOG.warn("Same worker entity found bearing same workerid,"
+ "maybe benign if pod restart in k8s env or same worker"
+ " scheduled to restart on another machine in baremetal env.");
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log the old one so we could tell from the logs if something is unexpected

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • First word of title ("Restart") is not an imperative verb. Please use one of the valid words
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@lucyge2022 lucyge2022 changed the title Restart join logic Remove restriction in overwriting worker info within join logic Oct 23, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@lucyge2022 lucyge2022 added the type-feature This issue is a feature request label Oct 23, 2023
@lucyge2022
Copy link
Contributor Author

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit 0884fc3 into Alluxio:main Oct 23, 2023
14 checks passed
ssz1997 pushed a commit to ssz1997/alluxio that referenced this pull request Dec 15, 2023
### What changes are proposed in this pull request?

Now with worker id can be assumed from a different worker instance whether on a different pod in k8s or a different host machine for baremetal. The creation onto the persisted ring path : /DHT/DefaultAlluxioCluster/AUTHORIZED/ should not bail if a different value is seen.

### Why are the changes needed?

to enable rejoin of a worker bearing same worker id but with different host or other WorkerInfo fileds.

### Does this PR introduce any user facing changes?

No

			pr-link: Alluxio#18275
			change-id: cid-51322e010e0d51ae4f81268c2bb607b568f08c46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature This issue is a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants