Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add http server port in the worker net address #18499

Merged
merged 6 commits into from
Jan 31, 2024

Conversation

ChunxuTang
Copy link
Member

@ChunxuTang ChunxuTang commented Jan 25, 2024

What changes are proposed in this pull request?

This PR enables registering the worker's HTTP server's port in the etcd. This helps to find worker's restful APIs from the Python client.

Why are the changes needed?

Alluxio Python client (e.g. in ML use cases) needs to connect to the worker's REST APIs. But as the http server port isn't included in the worker's information in the etcd, the client fails to find the API endpoint.

Does this PR introduce any user facing changes?

No.

@ChunxuTang ChunxuTang changed the title [WIP- Don't review] Add http server port in the worker net address Add http server port in the worker net address Jan 27, 2024
@ChunxuTang
Copy link
Member Author

Hi @lucyge2022 @jja725 @dbw9580 @ssyssy , could you take a review of this etcd registration change?

@ChunxuTang
Copy link
Member Author

Verified the backward compatibility by
start worker with old version -> stop worker -> start new version of worker -> see if it works
Cc @lucyge2022

Comment on lines 60 to 61
@com.google.gson.annotations.SerializedName("HttpServerPort")
private int mHttpServerPort;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This port is for the restful API server on the worker, right? There is already a "web port" which runs the web UI. Since both the UI and the restful API runs over HTTP, and possibly more services that run over HTTP will be added, can we be more specific that this port is used for the restful API exclusively?

Suggested change
@com.google.gson.annotations.SerializedName("HttpServerPort")
private int mHttpServerPort;
@com.google.gson.annotations.SerializedName("RestApiPort")
private int mRestApiPort;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HttpServerPort is used in the PropertyKey as the config for the restful API. The name is WORKER_HTTP_SERVER_PORT. If we change the mHttpServerPort to mRestApiPort, it will cause naming inconsistencies since other WorkerNetAddress variables (e.g. mWebPort, mDataPort) follow the same names as configs in the PropertyKey.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest keeping the name for now. If we want to change the name in the future, we'll need to change both this one and the property key (and probably any other places using HTTP server port...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u also add some comment here, saying this is a "optional"-like field which will not be included in equalness comparison

Comment on lines 257 to 258
// Skip the comparison of mHttpServerPort for backward compatibility
// && mHttpServerPort == that.mHttpServerPort;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's incompatible with comparing the restful api port for equality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update the hashCode implementation, or leave a comment explaining why mHttpServerPort is excluded from computing the hashcode.

Copy link
Member Author

@ChunxuTang ChunxuTang Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incompatibility.

As @lucyge2022 mentioned, the previous etcd-registered worker information doesn't have HTTP server ports. 1) When we do the comparison, the lack of HTTP ports will cause an issue. 2) It will also influence worker registration if we just add an HTTP port for an old worker (I added a new test to cover this scenario).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashCode.

Updated with a comment on the exclusion of mHttpServerPort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do the comparison, the lack of HTTP ports will cause an issue

When checking equality, WorkerServiceEntity explicitly ignores all ephemeral information about a worker such as network addresses, and only compares the worker identity. This is why WorkerIdentity exists. Therefore, a different HTTP port won't cause issue, as any other existing network addresses like hostname and grpc port.

+ "clean local worker identity settings to continue.",
existingEntity.get().getWorkerNetAddress().toString(),
existingEntity.get().getIdentity()));
if (existingEntity.get().equals(entity)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comment below, like
// same entity, update with new information.

WorkerInfo wkr2 = new WorkerInfo()
.setIdentity(workerIdentity1)
.setAddress(new WorkerNetAddress()
.setHost("worker2").setContainerHost("containerhostname2")
.setRpcPort(2000).setDataPort(2001).setWebPort(2011)
.setDomainSocketPath("/var/lib/domain.sock"));
.setDomainSocketPath("/var/lib/domain.sock").setHttpServerPort(2021));
membershipManager.join(wkr1);
// bring wrk1 down and join wrk2 with a same worker identity.
membershipManager.stopHeartBeat(wkr1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a test to have a worker join without http port set and afterwards join with http port set , it should be allowed to join and the list of workers should be unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test testWorkerHttpServerPorts. Could you take a review?

@@ -233,6 +254,8 @@ public boolean equals(Object o) {
&& mDataPort == that.mDataPort
&& mWebPort == that.mWebPort
&& mDomainSocketPath.equals(that.mDomainSocketPath);
// Skip the comparison of mHttpServerPort for backward compatibility
// && mHttpServerPort == that.mHttpServerPort;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u remove line 258?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@lucyge2022
Copy link
Contributor

Verified the backward compatibility by start worker with old version -> stop worker -> start new version of worker -> see if it works Cc @lucyge2022

cool perfect thanks!

Comment on lines 108 to 112
// only the service name and the identity are the factors that define
// what a WorkerServiceEntity is. Any other is either ephemeral or supplementary data.
return mIdentity.equals(anotherO.mIdentity)
&& getServiceEntityName().equals(anotherO.getServiceEntityName());
&& getServiceEntityName().equals(anotherO.getServiceEntityName())
&& mAddress.equals(anotherO.getWorkerNetAddress());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explicitly states that WorkerServiceEntity ignores worker address for equality comparison. If a worker with the same ID restarts and the address changes, it will leave two different entries on etcd and causes the clients to think there are multiple workers, but there are actually only one worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bowen, actually when collision happens, in etcd comparison it directly compares the value bytes (EtcdMembershipManager.java line 134) and under that situation we evaluate if there's a need to update, and then we use equals here, if equals we know its the same worker and then update with new info, this is to introduce backward compatibility after adding a field "httpport" into the info to persist on etcd. Can u check if this equals is used somewhere else apart from etcdmembership join that will cause other problems?

Comment on lines 257 to 258
// Skip the comparison of mHttpServerPort for backward compatibility
// && mHttpServerPort == that.mHttpServerPort;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do the comparison, the lack of HTTP ports will cause an issue

When checking equality, WorkerServiceEntity explicitly ignores all ephemeral information about a worker such as network addresses, and only compares the worker identity. This is why WorkerIdentity exists. Therefore, a different HTTP port won't cause issue, as any other existing network addresses like hostname and grpc port.

public void testWorkerHttpServerPorts() throws Exception {
final MembershipManager membershipManager = getHealthyEtcdMemberMgr();
Configuration.set(PropertyKey.WORKER_MEMBERSHIP_MANAGER_TYPE, MembershipType.ETCD);
Configuration.set(PropertyKey.ETCD_ENDPOINTS, getClientEndpoints());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u dont need to set the config here anymore if getHealthyEtcdMemberMgr is used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

workerNetAddress.setHttpServerPort(1021);
membershipManager.join(wkr);
// check if the worker is rejoined and information updated
curWorkerInfo = membershipManager.getLiveMembers().getWorkerById(workerIdentity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also assert the getallmembers only have one worker -> meaning that we are not creating two entries

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I added that assertion.

@@ -401,4 +401,44 @@ public void testSameWorkerIdentityConflict() throws Exception {
Assert.assertTrue(curWorkerInfo.isPresent());
Assert.assertEquals(wkr2.getAddress(), curWorkerInfo.get().getAddress());
}

@Test
public void testWorkerHttpServerPorts() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u say testOptionalFieldChangeInWorkerAddress

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the function name to testOptionalHttpPortChangeInWorkerAddress.

@ChunxuTang
Copy link
Member Author

@lucyge2022 @dbw9580 I updated the PR according to our discussion. Could you take a look again?

Comment on lines 274 to 288
public boolean customizedEquals(Object o) {
if (this == o) {
return true;
}
if (!(o instanceof WorkerNetAddress)) {
return false;
}
WorkerNetAddress that = (WorkerNetAddress) o;
return mHost.equals(that.mHost)
&& mContainerHost.equals(that.mContainerHost)
&& mSecureRpcPort == that.mSecureRpcPort
&& mRpcPort == that.mRpcPort
&& mDataPort == that.mDataPort
&& mWebPort == that.mWebPort
&& mDomainSocketPath.equals(that.mDomainSocketPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not have to be put in WorkerNetAddress because the costumized comparison can be different depending on the caller. You can remove this and move code to WorkerServiceEntity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I moved the comparison logic to WorkerServiceEntity.

WorkerServiceEntity anotherO = (WorkerServiceEntity) o;
return mIdentity.equals(anotherO.mIdentity)
&& getServiceEntityName().equals(anotherO.getServiceEntityName())
&& mAddress.customizedEquals(anotherO.getWorkerNetAddress());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the comparison here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use name as consideredEquivalent ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, or be explicit about what is considered equivalent, e.g. equalsIgnoringOptionalFields

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function name to equalsIgnoringOptionalFields.

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ChunxuTang ChunxuTang added area-worker Alluxio worker workload-machine-learning Machine Learning Workloads labels Jan 31, 2024
@ChunxuTang
Copy link
Member Author

@lucyge2022 @dbw9580 @ssyssy @jja725 Thanks for your review!

@ChunxuTang
Copy link
Member Author

alluxio-bot, merge this please

@alluxio-bot
Copy link
Contributor

merge failed:
Merge refused because pull request does not have label start with type-

@ChunxuTang ChunxuTang added the type-feature This issue is a feature request label Jan 31, 2024
@ChunxuTang
Copy link
Member Author

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 262b2b3 into Alluxio:main Jan 31, 2024
14 checks passed
@ChunxuTang ChunxuTang deleted the http-port branch January 31, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-s3-rest-api area-worker Alluxio worker type-feature This issue is a feature request workload-machine-learning Machine Learning Workloads
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants