-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove restriction in overwriting worker info within join logic #18275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! I left a comment on a corner case that might take a bit of discussion. I feel we may want to add a bit more in-line comments on that situation?
LOG.warn("Same worker entity found bearing same workerid," | ||
+ "maybe benign if pod restart in k8s env or same worker" | ||
+ " scheduled to restart on another machine in baremetal env."); | ||
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you now observe a new worker with the same identity, would there be an old worker with the same id on the ring? If so then that obsolete one needs to be removed? Hmmmm but what about the corner case that the admin incorrectly used the same ID on multiple workers, like copying the config blindly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it will override, for enterprise customer normally we should always do automated scripts such as ansible to run deployment instead of letting customer do stuff like copy properties. typical adding-node or new-deployment should always do a wipe-out clean to clean system info files such as worker_identity file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: log the old one so we could tell from the logs if something is unexpected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -93,11 +92,14 @@ public void join(WorkerInfo workerInfo) throws IOException { | |||
byte[] serializedEntity = entity.serialize(); | |||
// If there's existing entry, check if it's me. | |||
if (existingEntityBytes != null) { | |||
// It's not me, something is wrong. | |||
// It's not me, or not the same me. | |||
if (!Arrays.equals(existingEntityBytes, serializedEntity)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why compare by the serialized form? The WorkerServiceEntity
implements equals
and ignores comparing WorkerNetAddress
deliberately. If compared by the serialized form, the netaddress would be taken into account, which I find confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in bare metal cases, if a worker restart none of the fields in workerserviceentity will get changed, and the content should be the same, so we don't need to insert the kv pair again to etcd. if not , then thats a case of the worker changing hostname like in k8s, hence comparing the serialized form, just to say is it the same me on etcd? or its a different me, and im about to overwriting the content
"Some other member with same id registered on the ring, bail."); | ||
// In k8s this might be bcos worker pod restarting with the same worker identity | ||
// but certain fields such as hostname has been changed. Register to ring path anyway. | ||
LOG.warn("Same worker entity found bearing same workerid," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the serialized forms don't compare equal, it doesn't mean the two have different worker IDs. Could be different network address, or anything ephemeral in WorkerServiceEntity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah exactly, we found the key already exists, so now we have to overwrite it with a different WorkerServiceEntity content, before it doesnt allow doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see what you are trying to achieve. If the serialized forms are equal, then you don't have to overwrite anything. Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
2fec0ec
to
b597c61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
LOG.warn("Same worker entity found bearing same workerid," | ||
+ "maybe benign if pod restart in k8s env or same worker" | ||
+ " scheduled to restart on another machine in baremetal env."); | ||
mAlluxioEtcdClient.createForPath(pathOnRing, Optional.of(serializedEntity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: log the old one so we could tell from the logs if something is unexpected
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
d7bf480
to
c5d1162
Compare
Automated checks report:
All checks passed! |
alluxio-bot, merge this please. |
### What changes are proposed in this pull request? Now with worker id can be assumed from a different worker instance whether on a different pod in k8s or a different host machine for baremetal. The creation onto the persisted ring path : /DHT/DefaultAlluxioCluster/AUTHORIZED/ should not bail if a different value is seen. ### Why are the changes needed? to enable rejoin of a worker bearing same worker id but with different host or other WorkerInfo fileds. ### Does this PR introduce any user facing changes? No pr-link: Alluxio#18275 change-id: cid-51322e010e0d51ae4f81268c2bb607b568f08c46
What changes are proposed in this pull request?
Now with worker id can be assumed from a different worker instance whether on a different pod in k8s or a different host machine for baremetal. The creation onto the persisted ring path : /DHT/DefaultAlluxioCluster/AUTHORIZED/ should not bail if a different value is seen.
Why are the changes needed?
to enable rejoin of a worker bearing same worker id but with different host or other WorkerInfo fileds.
Does this PR introduce any user facing changes?
No