Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Event callback for adding notification #2106

Merged
merged 24 commits into from
Jul 7, 2023
Merged

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Jun 20, 2023

Issue #2000: A user would like to have a customizable event callback for the spot job, so that they can implement some notification when the spot job status changes. Example:

event_callback:
  send_slack_message $JOB_ID $JOB_STATUS

Allows the controller to execute a customizable bash script when the spot status changes.

Example:
Register a Slack message web hook from here.

event_callback: |
  echo 'this is an event callback script'
  curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"JOB_ID: $SKYPILOT_JOB_ID, JOB_STATUS: $JOB_STATUS\"}" https://hooks.slack.com/services/...

The user will receive a Slack notification when the spot status changes.
image

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@MaoZiming MaoZiming linked an issue Jun 20, 2023 that may be closed by this pull request
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this feature @MaoZiming! It would be very useful for the users. Left several comments.

Additional comments:

  1. We may want to call the event callback when the spot job starts recovering as well, according to the users feedback.
  2. We may want to think of whether we should run the callback asynchronously, instead of in a blocking way. Otherwise, our control logic may be significantly delayed.

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
@MaoZiming MaoZiming marked this pull request as ready for review June 23, 2023 03:26
@MaoZiming
Copy link
Collaborator Author

MaoZiming commented Jun 23, 2023

@Michaelvll Incorporated all the comments. The basic skeleton for slack notification should be there. Prob need better UI

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @MaoZiming! The slack messages were sent like a magic. It works well for STARTING/RUNNING/RECOVERING/RECOVERED/SUCCEEDED, but it seems like not working well with the CANCELLING/CANCELLED.

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
@@ -419,9 +470,13 @@ def set_cancelling(job_id: int):
(SpotStatus.CANCELLING.value, time.time(), job_id))
if rows.rowcount > 0:
logger.info('Cancelling the job...')
if callback_func is not None:
callback_func(-1, 'CANCELLING', '')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, when I am trying this PR, the STARTING, RUNNING, RECOVERING, RECOVERED works normally, while this CANCELLING/CANCELLED state was not sent to my slack. Do you have an idea why this happens?
Fyi, I did the following:

  1. sky spot launch test.yaml
  2. manually kill the cluster on the console
  3. the states were correctly sent
  4. sky spot cancel -a: the cancelling/cancelled states were not correctly sent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For set_cancelling outside of SpotController, I didn't pass in event_callback_func.
https://github.com/skypilot-org/skypilot/blob/spot-event-callback/sky/spot/controller.py#L456

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to pass the event_callback_func to those as well? Otherwise, sky spot cancel -a will not send the message.

Copy link
Collaborator Author

@MaoZiming MaoZiming Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think of a clean way to do this since we need to call event_callback_func outside of SpotController and multiprocessing.Process(target=_run_controller...).
Do you know if is there an easy way to get task inside the child process?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, as we did for the _cleanup function, we can retrieve the dag using the day_yaml as we did in _cleanup function

_cleanup(job_id, dag_yaml=dag_yaml)

However, we are not able to get some of the environment variable. We can probably print some log in the event callback saying which variables are not accessible, and we can wait for the user feedback to see if they want to have some of the variables in the CANCEL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Thanks for all the detailed feedback! The suggested changes should all be incorporated

sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this exciting PR @MaoZiming! The code looks good to me. Several small nits.

@@ -19,6 +19,7 @@
from sky.spot import recovery_strategy
from sky.spot import spot_state
from sky.spot import spot_utils
from sky.spot.spot_utils import event_callback_func
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: import module instead of functions. https://google.github.io/styleguide/pyguide.html#224-decision

Comment on lines 436 to 437
first_task_id = 0
first_task = dag.tasks[first_task_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
first_task_id = 0
first_task = dag.tasks[first_task_id]
first_task = dag.tasks[0]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you think we can use spot_state.get_latest_task_id_status to get the current task_id, so that the message can include the information of the currently cancelled task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I think that's better. Thanks!

cluster_name = generate_spot_cluster_name(task.name,
job_id) if task.name else None
logger.info(f'=== START: event callback for {state!r} ===')
log_path = os.path.join(constants.SKY_LOGS_DIRECTORY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we place it in a folder instead of the root directory of the logs? Probably, under ~/sky_logs/spot_event?

@MaoZiming
Copy link
Collaborator Author

Thank you so much @Michaelvll! I have incorporated all the suggestions.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @MaoZiming! I just realized that the current way to pass the callback function can be a bit confusing. Do you think we can do something like the comments below?

sky/spot/spot_state.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for awesome work @MaoZiming! LGTM. Please feel free to merge it.

@MaoZiming MaoZiming merged commit 9f39f8d into master Jul 7, 2023
@MaoZiming MaoZiming deleted the spot-event-callback branch July 7, 2023 01:20
@concretevitamin
Copy link
Member

Very nice @MaoZiming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Event callback for adding notification
3 participants