Improve distributed load #18153

elega · 2023-09-14T11:18:00Z

What changes are proposed in this pull request?

Improve distributed load

Configurable job failure criteria
Configuration to determine if the load job should be restored from journal or not
Add an option to skip existing fully loaded file
Add retry count for failed files
Bug fixing

Why are the changes needed?

To enhance the distributed load tool

Does this PR introduce any user facing changes?

Yes. The skip-if-exists option is added to the distributed load cli.

JiamingMai · 2023-09-15T13:54:16Z

LGTM

elega · 2023-09-15T16:10:20Z

alluxio-bot, merge this please.

jja725

Thanks for the improvement, but I do have some concerns regarding retry and health threshold

jja725 · 2023-09-15T17:16:00Z

dora/core/common/src/main/java/alluxio/conf/PropertyKey.java

+  public static final PropertyKey MASTER_DORA_LOAD_JOB_TOTAL_FAILURE_COUNT_THRESHOLD =
+      intBuilder(Name.MASTER_DORA_LOAD_JOB_TOTAL_FAILURE_COUNT_THRESHOLD)
+          .setDefaultValue(-1)
+          .setDescription("The load job total load failure count threshold. -1 means never fail.")


Any reason we set this to never fail as default? I think in production env we should fail fast if there are a lot of exceptions. And we would record every failure which cause a lot of memory pressure on the cluster

Because sometimes the loading takes quite long. In my last experience, it took 2-3 days to load the whole data from UFS (about 300m files 0.5PB). If we fail the job in the middle if there's too many errors, the job will never succeed to load. So what i did is to let it load as much as possible and never fail.

jja725 · 2023-09-15T17:21:25Z

dora/core/common/src/main/java/alluxio/conf/PropertyKey.java

@@ -2291,6 +2291,38 @@ public String toString() {
          .setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
          .setScope(Scope.MASTER)
          .build();
+  public static final PropertyKey MASTER_SCHEDULER_RESTORE_JOB_FROM_JOURNAL =
+      booleanBuilder(Name.MASTER_SCHEDULER_RESTORE_JOB_FROM_JOURNAL)


I'm OK with adding an option but I would prefer to have fewer properties and one of the big complaints from customers is that we have too many properties. In which situation we would benefit if this is false? The only situation I can think of is in test env.

i guess we discussed this offline..... I proposed to stop restoring incomplete jobs from the journals which you showed your concorn with.... That's why i added an option for this. Mind reminding me if there's a better way?

When i used this distributed tool in a real cluster, i've noticed that after restarting the cluster, the job just starts unexpectedly and created some operaation challenges.

jja725 · 2023-09-15T17:24:45Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

@@ -108,11 +108,14 @@ public class DoraLoadJob extends AbstractJob<DoraLoadJob.DoraLoadTask> {

  // Job states
  private final Queue<String> mRetryFiles = new ArrayDeque<>();
+  private final Map<String, Integer> mRetryCount = new ConcurrentHashMap<>();


I don't quite understand the meaning of retry count since if it's retryable we just retry and should not retry a lot of times, and if it's not retryable we just do not retry. What we should do is pass the correct exception to the scheduler

so first this retry-able flag isn't reliable... Workers just populate this flag based on exception type and sometimes we just don't handle this correctly. E.g. in the previous implementation, any UnknownRuntimeException will be marked as not retryable, which is pretty much wrong. Plus there's individual UFS implementations which may throw different exceptions. It's hard to give a correct flag on the worker side.

secondly, even if an error is retry-able, we should not retry it indefinitely.... That endless retry might block the following tasks or decrease the performance significantly.

also sometimes the retry-able flag is not available. What if the RPC just fails? Do we also want to retry it here?

https://www.alibabacloud.com/help/en/oss/developer-reference/error-handling-1
Regarding the error handling, we can polish our UFS exceptions like AlluxioS3Exception and handle each exception properly. I agree that we could have a retry limit for failed files to avoid unlimited retry, But it should be combined with the health threshold to ensure we are not storing too many failed files.

jja725 · 2023-09-15T17:35:55Z

dora/core/server/worker/src/main/java/alluxio/worker/dora/PagedDoraWorker.java

+            AlluxioRuntimeException t = AlluxioRuntimeException.from(e);
+            errors.add(LoadFileFailure.newBuilder().setUfsStatus(status.toProto())
+                .setCode(t.getStatus().getCode().value())
+                .setRetryable(true)


why always set retryable as true? Then what's the meaning of retryable...

because this flag is not reliable.... we are not able to tell if an exception is really retry-able by just looking at t.isRetryable().
This one caused a couple of issues when i used this tool previously. Either ignoring retryable errors or doing some endless retries...
If you want to improve this. I think you can default the retryable to true and only add a couple of non-retryable "exceptions"

jja725 · 2023-09-15T17:39:58Z

dora/core/server/worker/src/main/java/alluxio/worker/dora/PagedDoraWorker.java

-                  .setCode(t.getStatus().getCode().value())
-                  .setRetryable(t.isRetryable() && permissionCheckSucceeded)
-                  .setMessage(t.getMessage()).build());
+      if (!loadData || !status.isFile()) {


nit: I'm not sure changing else condition to continue would help readability...

jja725 · 2023-09-15T17:50:27Z

And could you elaborate more about the bug-fixing part? It may hide somewhere I didn't notice

Improve distributed load

c8bc49d

elega force-pushed the yimin/improve-distributed-load branch from c967e8e to c8bc49d Compare September 15, 2023 11:29

JiamingMai approved these changes Sep 15, 2023

View reviewed changes

JiamingMai assigned elega Sep 15, 2023

JiamingMai added the type-feature This issue is a feature request label Sep 15, 2023

alluxio-bot merged commit 6a9f5fd into Alluxio:main Sep 15, 2023
12 checks passed

jja725 reviewed Sep 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve distributed load #18153

Improve distributed load #18153

elega commented Sep 14, 2023 •

edited

Loading

JiamingMai commented Sep 15, 2023

elega commented Sep 15, 2023

jja725 left a comment

jja725 Sep 15, 2023

elega Sep 16, 2023

jja725 Sep 15, 2023

elega Sep 16, 2023

jja725 Sep 15, 2023

elega Sep 16, 2023

jja725 Sep 18, 2023

jja725 Sep 15, 2023

elega Sep 16, 2023

jja725 Sep 15, 2023

jja725 commented Sep 15, 2023

Improve distributed load #18153

Improve distributed load #18153

Conversation

elega commented Sep 14, 2023 • edited Loading

What changes are proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user facing changes?

JiamingMai commented Sep 15, 2023

elega commented Sep 15, 2023

jja725 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jja725 commented Sep 15, 2023

elega commented Sep 14, 2023 •

edited

Loading