Improve distributed load #18362

elega · 2023-11-02T12:58:10Z

What changes are proposed in this pull request?

Features:

Retry in the end of the load process only once
Persist failed file list once the load job is done

Improvements:

Only keep the most recent 1000 errroes to save memory
Create a LoadSubTaskIterator and refactor the task generation progress
Add an option to skip fetching ufs status again to speed up the retry (deafaulted on)
--progress displays the elapsed time
Add a metric to record error attribution
--progresss displays subtask level metrics additionally
fast fail the task when hash ring moves
--progress --verbose displays information about the subtask (instead of the file name only)

Bug fixes:

Fix the wrong behavior where the distributed load still gives all workers when dynamic hash ring is disabled
Fix the incorrect display of the loading progress report

elega · 2023-11-02T13:02:45Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+  private final Queue<LoadSubTask> mRetrySubTasksDLQ = new ArrayDeque<>();
+  // Only the most recent 1k failures are persisted.
+  // If more are needed, can turn on debug LOG for this class.
+  private final Queue<Pair<LoadSubTask, String>> mRecentFailures =


Previously we store the whole error messages for every error. Each entry takes up 2KB memory and I think it's too much. Plus --verbose cannot really display these much information. So I only keep the most recent 1000 ones to avoid OOM.

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

elega · 2023-11-02T13:08:56Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+        LoadSubTask subTaskToRetry = mRetrySubTasksDLQ.poll();
+        String path = subTaskToRetry.getUfsPath();
+        try {
+          if (GET_LATEST_UFS_ON_RETRY) {


provide an option to avoid getting metadata again. getting an ufs status without any concurrency is low effcient. files to load (in AI use cases) barely change.

we can potentially delete this option and remove handling this situation. If we get not found exception here means we would get exception from worker anyway. Having this just complicates the logic. I would defer to you to make the call

yeah i agree with you i actually wanted to delete this one but i just wanted to be not so aggressive. if you also think this needs to be removed, i'd be more than happy to do so.

elega · 2023-11-02T13:09:43Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

@@ -249,46 +327,6 @@ private Map<WorkerInfo, DoraLoadTask> aggregateSubTasks(List<LoadSubTask> subTas
    return workerToTaskMap;
  }

-  private List<LoadSubTask> createSubTasks(UfsStatus ufsStatus, Set<WorkerInfo> workers) {


these have been moved inside the LoadSubTaskIterator

elega · 2023-11-02T13:11:05Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

-      addFileFailure(subTask.getUfsPath(), type, message);
-      mRetryCount.remove(subTask);
-      return true;
+    if (subTask.isRetry() || !isHealthy()) {


basically we only retry once so if a subtask has been retried, just fail it.

elega · 2023-11-02T13:11:17Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

@@ -402,7 +478,8 @@ public int hashCode() {

  @Override
  public boolean isHealthy() {
-    long totalFailureCount = mTotalFailureCount.get();
+    // Tasks to retry are considered as "failed" tasks when calculating if the job is healthy.
+    long totalFailureCount = mTotalFinalFailureCount.get() + mRetrySubTasksDLQ.size();


elega · 2023-11-02T13:29:26Z

...re/server/master/src/main/java/alluxio/master/scheduler/MembershipManagerWorkerProvider.java

@@ -38,11 +41,22 @@ public MembershipManagerWorkerProvider(MembershipManager membershipMgr,
                                         FileSystemContext context) {
    mMembershipManager = membershipMgr;
    mContext = context;
+    if (context != null && context.getClusterConf() != null) {


The previous implementation wasn't consistent with the client implementation where USER_DYNAMIC_CONSISTENT_HASH_RING_ENABLED is checked to determine if dead workers are included or not.

I'm open to revoke this change though but i just feel the current status is not consistent.

let's keep it for consistency for now.

elega · 2023-11-02T13:30:29Z

dora/core/server/master/src/main/java/alluxio/master/scheduler/Scheduler.java

        Task task = tasksQ.peek();
        // only make sure 1 task is running at the time
        if (task == null || task.getResponseFuture() != null) {
          LOG.debug("head task is {}", (task == null) ? "NULL" : "already running");
          return;
        }
+        if (blkWorkerClientResource == null) {


I'm not sure about this either.
If the change above is applied and when USER_DYNAMIC_CONSISTENT_HASH_RING_ENABLED is set true, membership change will lead to a subtask to be dropped. We need a way to notify the job so that it can fail the task.

I would agree with this change. Previously we didn't add that since I think in AI ML use case dynamic is not that useful since all data need to be loaded into alluxio first(& that's where load command is used) .

But I guess here if we add file directly to failure is also not a correct behavior in doraloadjob under dynamic hash ring situation since we want to assign task to another worker.

my bad. i meant add this file to retry instead of just failing it.

jja725

Overall LGTM, thanks for the improvement!

jja725 · 2023-11-02T18:22:22Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+        LoadSubTask subTaskToRetry = mRetrySubTasksDLQ.poll();
+        String path = subTaskToRetry.getUfsPath();
+        try {
+          if (GET_LATEST_UFS_ON_RETRY) {


we can potentially delete this option and remove handling this situation. If we get not found exception here means we would get exception from worker anyway. Having this just complicates the logic. I would defer to you to make the call

jja725 · 2023-11-02T18:32:55Z

dora/core/common/src/main/java/alluxio/conf/PropertyKey.java

-          .setConsistencyCheckLevel(ConsistencyCheckLevel.WARN)
+  public static final PropertyKey MASTER_DORA_LOAD_JOB_FAILED_FILE_LIST_DIR =
+      stringBuilder(Name.MASTER_DORA_LOAD_JOB_FAILED_FILE_LIST_DIR)
+          .setDefaultValue(format("${%s}/distributed_load", Name.WORK_DIR))


could we put this under job/load? potentially other job would put information in the directory too

i propose /job_results/load

jja725 · 2023-11-02T19:21:35Z

dora/core/server/master/src/main/java/alluxio/master/scheduler/Scheduler.java

        Task task = tasksQ.peek();
        // only make sure 1 task is running at the time
        if (task == null || task.getResponseFuture() != null) {
          LOG.debug("head task is {}", (task == null) ? "NULL" : "already running");
          return;
        }
+        if (blkWorkerClientResource == null) {


I would agree with this change. Previously we didn't add that since I think in AI ML use case dynamic is not that useful since all data need to be loaded into alluxio first(& that's where load command is used) .

jja725 · 2023-11-02T20:27:50Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+      }
+      mRetryDeadLetterQueueSize = job.mRetrySubTasksDLQ.size();
+      mTimeElapsed =
+          (job.mJobFinishTimestamp == -1 ? CommonUtils.getCurrentMs() : job.mJobFinishTimestamp)


use optional instead would be more intuitive

you're right thanks

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

jja725 · 2023-11-02T20:45:34Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+      if (mCurrentUfsStatusSubTaskIterator.hasNext()) {
+        return mCurrentUfsStatusSubTaskIterator.next();
+      }
+      UfsStatus ufsStatus = mUfsStatusIterator.next();


we need to check mUfsStatusIterator hasNext before calling next

good catch... i let it throw no such element exception now

jja725 · 2023-11-02T20:48:41Z

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

+        batchBuilder.add(mLoadSubTaskIterator.next());
+        numSubTasks++;
+      } else if (!mRetrySubTasksDLQ.isEmpty()) {
+        LoadSubTask subTaskToRetry = mRetrySubTasksDLQ.poll();


A previous strategy would be retry at the end of line except there's too many subtasks wait to be retried(try to avoid memory pressure). And that's why we have RETRY_BLOCK_CAPACITY. What do you think?

yea this is a really good question... I just followed what we discussed last time in the meeting.
the downside of retry immediately is that if worker goes down for a while (e.g. 15 minutes), the immediate retry won't succeed and this will result in the load subtask to be considered as failure...
however, doing retry in the end of the loading process, as you mentioned, caused memory pressure. that's why i did a lot of changes to make the retry queue slim. I did an estimate, each retrying subtask in the queue takes about 0.5KB in memory. 10M subtasks to retry takes about 5GB memory. I updated the isHealthy() method to take the tasks in the retrying queue into account. We can set the job failure threshold to be something like 5M to terminate the job if there's too many tasks in the retrying queue.

Actually I have a better idea though. When we add jobs into the retry queue, we also attach a timestamp. We only retry subtasks that are "old" enough (e.g. only retry those which intially failed at least 15 minutes ago) and we can kae this configurable.

I'm going to add a TODO here for further improvement (likely will finish it in Q4 in 3.1 release) but let me know if you do think it's necessary to make such change in this PR. Thanks!

dora/core/server/master/src/main/java/alluxio/master/job/DoraLoadJob.java

jja725 · 2023-11-02T20:56:31Z

...re/server/master/src/main/java/alluxio/master/scheduler/MembershipManagerWorkerProvider.java

@@ -38,11 +41,22 @@ public MembershipManagerWorkerProvider(MembershipManager membershipMgr,
                                         FileSystemContext context) {
    mMembershipManager = membershipMgr;
    mContext = context;
+    if (context != null && context.getClusterConf() != null) {


let's keep it for consistency for now.

jja725 · 2023-11-02T21:33:10Z

dora/core/server/master/src/main/java/alluxio/master/scheduler/Scheduler.java

        Task task = tasksQ.peek();
        // only make sure 1 task is running at the time
        if (task == null || task.getResponseFuture() != null) {
          LOG.debug("head task is {}", (task == null) ? "NULL" : "already running");
          return;
        }
+        if (blkWorkerClientResource == null) {


But I guess here if we add file directly to failure is also not a correct behavior in doraloadjob under dynamic hash ring situation since we want to assign task to another worker.

jja725

LGTM, we can revisit the retry strategy later on

elega · 2023-11-06T06:23:16Z

alluxio-bot, merge this please.

### What changes are proposed in this pull request? Features: 1. Retry in the end of the load process only once 2. Persist failed file list once the load job is done Improvements: 1. Only keep the most recent 1000 errroes to save memory 2. Create a LoadSubTaskIterator and refactor the task generation progress 3. Add an option to skip fetching ufs status again to speed up the retry (deafaulted on) 4. --progress displays the elapsed time 5. Add a metric to record error attribution 6. --progresss displays subtask level metrics additionally 7. fast fail the task when hash ring moves 8. --progress --verbose displays information about the subtask (instead of the file name only) Bug fixes: 1. Fix the wrong behavior where the distributed load still gives all workers when dynamic hash ring is disabled 2. Fix the incorrect display of the loading progress report ![image](https://github.com/Alluxio/alluxio/assets/6771554/ff2c9ccb-0eed-472d-98cc-9cc53def37ba) pr-link: #18362 change-id: cid-7fb3f6f9230c04a01f7f466b32726c906e191ee2

### What changes are proposed in this pull request? Features: 1. Retry in the end of the load process only once 2. Persist failed file list once the load job is done Improvements: 1. Only keep the most recent 1000 errroes to save memory 2. Create a LoadSubTaskIterator and refactor the task generation progress 3. Add an option to skip fetching ufs status again to speed up the retry (deafaulted on) 4. --progress displays the elapsed time 5. Add a metric to record error attribution 6. --progresss displays subtask level metrics additionally 7. fast fail the task when hash ring moves 8. --progress --verbose displays information about the subtask (instead of the file name only) Bug fixes: 1. Fix the wrong behavior where the distributed load still gives all workers when dynamic hash ring is disabled 2. Fix the incorrect display of the loading progress report ![image](https://github.com/Alluxio/alluxio/assets/6771554/ff2c9ccb-0eed-472d-98cc-9cc53def37ba) pr-link: Alluxio#18362 change-id: cid-7fb3f6f9230c04a01f7f466b32726c906e191ee2

Improve distributed load

a16d1a5

elega requested a review from jja725 November 2, 2023 13:40

elega commented Nov 2, 2023

View reviewed changes

jja725 reviewed Nov 2, 2023

View reviewed changes

elega added 2 commits November 3, 2023 17:36

Address comment

8caa98f

update comments

cf5df91

jja725 self-requested a review November 6, 2023 04:25

jja725 approved these changes Nov 6, 2023

View reviewed changes

elega added the type-feature This issue is a feature request label Nov 6, 2023

alluxio-bot merged commit d766ae4 into Alluxio:main Nov 6, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve distributed load #18362

Improve distributed load #18362

elega commented Nov 2, 2023 •

edited

Loading

elega Nov 2, 2023

elega Nov 2, 2023

jja725 Nov 2, 2023

elega Nov 3, 2023

elega Nov 2, 2023

elega Nov 2, 2023

elega Nov 2, 2023

elega Nov 2, 2023

jja725 Nov 2, 2023

elega Nov 2, 2023

jja725 Nov 2, 2023

jja725 Nov 2, 2023 •

edited

Loading

elega Nov 3, 2023

jja725 left a comment

jja725 Nov 2, 2023

jja725 Nov 2, 2023

elega Nov 3, 2023

jja725 Nov 2, 2023

jja725 Nov 2, 2023

elega Nov 3, 2023

jja725 Nov 2, 2023

elega Nov 3, 2023

jja725 Nov 2, 2023

elega Nov 3, 2023

jja725 Nov 2, 2023

jja725 Nov 2, 2023 •

edited

Loading

jja725 left a comment

elega commented Nov 6, 2023

Improve distributed load #18362

Improve distributed load #18362

Conversation

elega commented Nov 2, 2023 • edited Loading

What changes are proposed in this pull request?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jja725 Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jja725 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jja725 Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

jja725 left a comment

Choose a reason for hiding this comment

elega commented Nov 6, 2023

elega commented Nov 2, 2023 •

edited

Loading

jja725 Nov 2, 2023 •

edited

Loading

jja725 Nov 2, 2023 •

edited

Loading