Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix etcd calls exception handling and adjust failure detection timeout #18554

Merged
merged 4 commits into from
Mar 28, 2024

Conversation

lucyge2022
Copy link
Contributor

What changes are proposed in this pull request?

  1. catch all exception thrown by fetch worker cluster view from etcd …when refresh the workers view instead of propagate throwing all the way to the caller
  2. make worker failure detection timeout (the timeout to determine if a worker is in FAILED state) configurable thru setting the service discovery entity's lease ttl
  3. make newLeaseInternal always overwrite the key with the newly created lease

Why are the changes needed?

To resolve:

  1. currently worker is considered down only 2sec after its disconnection with etcd, its too small, make the failure detection timeout configurable for registered service discovery services.
  2. etcd unavaible runtime exception will be propagated to caller which is non-ideal, currently capture at FileSystemContext.getCachedWorkers layer to prevent propagating to IO layer, causing an unnecessary ufs fallback such as cold read.

Does this PR introduce any user facing changes?

No

…when refresh the workers view instead of propagate throwing all the way to the caller

2. make worker failure detection timeout (the timeout to determine if a worker is in FAILED state) configurable thru setting the service discovery entity's lease ttl
3. make newLeaseInternal always overwrite the key with the newly created lease
@JiamingMai JiamingMai added the type-bug This issue is about a bug label Mar 27, 2024
Copy link
Contributor

@JiamingMai JiamingMai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the works!

Copy link
Contributor

@Kai-Zhang Kai-Zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -68,11 +74,11 @@ public class MembershipManagerTest {
/*
@BeforeClass
public static void init() {
PropertyConfigurator.configure("alluxio/conf/log4j.properties");
PropertyConfigurator.configure("/Users/lucyge/Documents/github/alluxio/conf/log4j.properties");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice that the absolute path contains your own information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch thanks!

@lucyge2022
Copy link
Contributor Author

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit fd5478e into Alluxio:main Mar 28, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants