Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic reschedule subtask #2467

Merged
merged 10 commits into from
Sep 24, 2021

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Sep 17, 2021

What do these changes do?

  • Handle subtask result in one place, the SubtaskManagerActor.
    • before
      • The SubtaskExecutionActor calls TaskAPI.set_subtask_result to set the subtask result.
      • The SubtaskManagerActor handles run_subtask exceptions.
    • after
      • The SubtaskManagerActor gets and sets the subtask result, also handles the run_subtask exceptions.
  • Acquire and release global slots in supervisor.
    • before
      • The SubtaskExecutionActor releases global slots.
    • after
      • The SubtaskManagerActor releases global slots.
  • Add an option subtask_max_reschedules to reschedule failed subtask. (This PR can't handle worker main pool crash)

Related issue number

N/A

@qinxuye qinxuye modified the milestones: v0.8.0b1, v0.8.0b2 Sep 21, 2021
@fyrestone fyrestone marked this pull request as ready for review September 23, 2021 09:19
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 25d7427 into mars-project:master Sep 24, 2021
chaokunyang added a commit to chaokunyang/mars that referenced this pull request May 31, 2022
Merge branch merge_github_2524 of git@gitlab.alipay-inc.com:ray-project/mars.git into master
https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff

Signed-off-by: 捕牛 <hejialing.hjl@antgroup.com>


* [Ray] Support reconstructing worker (mars-project#2413)


* Make cmdline support third party modules (mars-project#2454)

Co-authored-by: hanguang <zhusiyuan.zsy@alibaba-inc.com>
* Support visualizing subtask graphs on Mars Web (mars-project#2426)


* Fix timeout error when waiting for a submitted task (mars-project#2457)


* Print the error message when error happens in `TaskProcessor` (mars-project#2458)


* Add nightly builds for docker images (mars-project#2456)


* Fix misuse of `name` parameter in DataFrame align (mars-project#2469)


* Fix hang when start sub pool fails (mars-project#2468)


* Refine and unify subtask detail APIs (mars-project#2465)


* Fix coverage for Azure pipeline (mars-project#2474)


* Split tileable information and subtask graph into two tabs (mars-project#2480)


* Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481)


* Basic reschedule subtask (mars-project#2467)


* Compatible with scikit-learn 1.0 (mars-project#2486)

Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com>
* Fix wrong translation in cluster deployment. (mars-project#2489)


* Fix bug that failed to execute query when there are multiple arguments (mars-project#2490)


* Include tileable property in detail api (mars-project#2493)


* Fix version of statsmodels to pass CI (mars-project#2497)


* Implements `glm.LogisticRegression` (mars-project#2466)


* Implements bagging sampling (mars-project#2496)


* Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498)


* Fix output of df.groupby(as_index=False).size() (mars-project#2507)


* Add preliminary implementations for ufunc methods (mars-project#2510)


* Add doc for reading csv in oss (mars-project#2514)


* [Ray] Fix serializing lambdas in web (mars-project#2512)


* Add `make_regression` support for learn module (mars-project#2515)


* Fix reduction result on empty series (mars-project#2520)


* Fix df.loc when df is empty (mars-project#2524)


* fix start subpool

* fix test_kill_and_wait_timeout

* fix autoscale timeout

* fix ray larger clsuter fixture

* Update ci ray package to 1.2.2

* remove python3.6 3.8 .39 ut and upgrade ray 3.7 image

* echo python path

* fix json decode error

* fix bundle release timeout

* fix remove placement group timeout

* fix no_restart

* fix ci

* fix autoscale
chaokunyang added a commit to chaokunyang/mars that referenced this pull request May 31, 2022
Merge branch merge_github_2524 of git@gitlab.alipay-inc.com:ray-project/mars.git into master
https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff

Signed-off-by: 捕牛 <hejialing.hjl@antgroup.com>

* [Ray] Support reconstructing worker (mars-project#2413)

* Make cmdline support third party modules (mars-project#2454)

Co-authored-by: hanguang <zhusiyuan.zsy@alibaba-inc.com>
* Support visualizing subtask graphs on Mars Web (mars-project#2426)

* Fix timeout error when waiting for a submitted task (mars-project#2457)

* Print the error message when error happens in `TaskProcessor` (mars-project#2458)

* Add nightly builds for docker images (mars-project#2456)

* Fix misuse of `name` parameter in DataFrame align (mars-project#2469)

* Fix hang when start sub pool fails (mars-project#2468)

* Refine and unify subtask detail APIs (mars-project#2465)

* Fix coverage for Azure pipeline (mars-project#2474)

* Split tileable information and subtask graph into two tabs (mars-project#2480)

* Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481)

* Basic reschedule subtask (mars-project#2467)

* Compatible with scikit-learn 1.0 (mars-project#2486)

Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com>
* Fix wrong translation in cluster deployment. (mars-project#2489)

* Fix bug that failed to execute query when there are multiple arguments (mars-project#2490)

* Include tileable property in detail api (mars-project#2493)

* Fix version of statsmodels to pass CI (mars-project#2497)

* Implements `glm.LogisticRegression` (mars-project#2466)

* Implements bagging sampling (mars-project#2496)

* Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498)

* Fix output of df.groupby(as_index=False).size() (mars-project#2507)

* Add preliminary implementations for ufunc methods (mars-project#2510)

* Add doc for reading csv in oss (mars-project#2514)

* [Ray] Fix serializing lambdas in web (mars-project#2512)

* Add `make_regression` support for learn module (mars-project#2515)

* Fix reduction result on empty series (mars-project#2520)

* Fix df.loc when df is empty (mars-project#2524)

* fix start subpool

* fix test_kill_and_wait_timeout

* fix autoscale timeout

* fix ray larger clsuter fixture

* Update ci ray package to 1.2.2

* remove python3.6 3.8 .39 ut and upgrade ray 3.7 image

* echo python path

* fix json decode error

* fix bundle release timeout

* fix remove placement group timeout

* fix no_restart

* fix ci

* fix autoscale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants