Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Get Kubernetes Events for Job #1975

Merged

Conversation

andreyvelich
Copy link
Member

Fixes: #1863.

This will allow users to get Kubernetes Events for Job and Job's pods via get_job_logs API and verbose parameter.
Initially, I didn't watch for the events similar to logs with follow=True parameter. This can be done later in the following PRs.

The events will be returned in this format:

{
  "test-kubeflow-worker-0": [
    "2024-01-05 22:58:20 Successfully assigned kubeflow-andrey/test-kubeflow-worker-0 to .....",
    "2024-01-05 22:58:21 Container image \"docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime\" already present on machine",
    "2024-01-05 22:58:21 Created container pytorch",
    "2024-01-05 22:58:21 Started container pytorch"
  ],
  "test-kubeflow-worker-1": [
    "2024-01-05 22:58:20 Successfully assigned kubeflow-andrey/test-kubeflow-worker-1 to ....",
    "2024-01-05 22:58:21 Container image \"docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime\" already present on machine",
    "2024-01-05 22:58:21 Created container pytorch",
    "2024-01-05 22:58:21 Started container pytorch"
  ],
  "test-kubeflow": [
    "2024-01-05 22:58:20 Created pod: test-kubeflow-worker-0",
    "2024-01-05 22:58:20 Created pod: test-kubeflow-worker-1",
    "2024-01-05 22:58:20 Created service: test-kubeflow-worker-0",
    "2024-01-05 22:58:20 Created service: test-kubeflow-worker-1",
    "2024-01-05 22:59:11 Pod: kubeflow-andrey.test-kubeflow-worker-1 exited with code 0",
    "2024-01-05 22:59:11 Pod: kubeflow-andrey.test-kubeflow-worker-0 exited with code 0",
    "2024-01-05 22:59:13 PyTorchJob kubeflow-andrey/test-kubeflow successfully completed."
  ]
}

I introduce a new get_job_pods API to get Job's pods data.
Also, I removed support for Python3 from our SDK. I believe, various Python typing capabilities wasn't there in Python 3.

/assign @droctothorpe @deepanker13 @johnugeorge @tenzen-y

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Fixes: #1863.

This will allow users to get Kubernetes Events for Job and Job's pods via get_job_logs API and verbose parameter.
Initially, I didn't watch for the events similar to logs with follow=True parameter. This can be done later in the following PRs.

The events will be returned in this format:

{
 "test-kubeflow-worker-0": [
   "2024-01-05 22:58:20 Successfully assigned kubeflow-andrey/test-kubeflow-worker-0 to .....",
   "2024-01-05 22:58:21 Container image \"docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime\" already present on machine",
   "2024-01-05 22:58:21 Created container pytorch",
   "2024-01-05 22:58:21 Started container pytorch"
 ],
 "test-kubeflow-worker-1": [
   "2024-01-05 22:58:20 Successfully assigned kubeflow-andrey/test-kubeflow-worker-1 to ....",
   "2024-01-05 22:58:21 Container image \"docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime\" already present on machine",
   "2024-01-05 22:58:21 Created container pytorch",
   "2024-01-05 22:58:21 Started container pytorch"
 ],
 "test-kubeflow": [
   "2024-01-05 22:58:20 Created pod: test-kubeflow-worker-0",
   "2024-01-05 22:58:20 Created pod: test-kubeflow-worker-1",
   "2024-01-05 22:58:20 Created service: test-kubeflow-worker-0",
   "2024-01-05 22:58:20 Created service: test-kubeflow-worker-1",
   "2024-01-05 22:59:11 Pod: kubeflow-andrey.test-kubeflow-worker-1 exited with code 0",
   "2024-01-05 22:59:11 Pod: kubeflow-andrey.test-kubeflow-worker-0 exited with code 0",
   "2024-01-05 22:59:13 PyTorchJob kubeflow-andrey/test-kubeflow successfully completed."
 ]
}

I introduce a new get_job_pods API to get Job's pods data.
Also, I removed support for Python3 from our SDK. I believe, various Python typing capabilities wasn't there in Python 3.

/assign @droctothorpe @deepanker13 @johnugeorge @tenzen-y

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Jan 5, 2024

Pull Request Test Coverage Report for Build 7491318987

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on issue-1863-sdk-job-events at 42.942%

Totals Coverage Status
Change from base Build 7478335417: 42.9%
Covered Lines: 3760
Relevant Lines: 8756

💛 - Coveralls

events = self.core_api.list_namespaced_event(namespace=namespace)

# Get events for the Job and Job's pods.
for event in events.items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we distinguish between job and pod events? Should we add a prefix?

Copy link
Member Author

@andreyvelich andreyvelich Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, users can use names to get events for Job or Pods.
E.g. for PyTorchJob with name train-mnist and 2 workers:

{"train-mnist": "Events", "train-mnist-worker-0": "Events", "train-mnist-worker-1": "Events"}

Users can understand the all pod names by running TrainingClient().get_pod_names()

Do you think we should add the prefix that identify type of the object ?
e.g.

{"pytorchjob-train-mnist": "Events", "pod-train-mnist-worker-0": "Events", "pod-train-mnist-worker-1": "Events"}

cc @droctothorpe @tenzen-y

and pod.status.phase != constants.POD_PHASE_PENDING
):
log_streams.append(
watch.Watch().stream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Didn't know that the Python client supported this.

@@ -861,27 +924,58 @@ def get_job_logs(
break

# Print logs to the StdOut
print(f"[Pod {pods[index]}]: {logline}")
print(f"[Pod {pods[index].metadata.name}]: {logline}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on colorizing by colorizing by pod name? Probably overkill / can be a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droctothorpe Can you elaborate here ? What are your ideas on printing logs for pods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of output from https://github.com/stern/stern.

image

Each pod gets a distinct color, which helps parse a giant wall of interleaved logs. Happy to tackle it in a discrete PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for sharing.
Can you create an issue to discuss this ?
We might need to discuss how various output will produce the results while running this (e.g. Jupyter Notebook, VSCode, local terminal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich andreyvelich force-pushed the issue-1863-sdk-job-events branch from e35a25b to 54b6269 Compare January 10, 2024 20:28
@andreyvelich
Copy link
Member Author

@droctothorpe @johnugeorge I added object kind to the event message for now, so it is easier to parse data from the event messaged:

{
  "pytorchjob test-kubeflow": [
    "2024-01-05 22:58:20 Created pod: test-kubeflow-worker-0"
  ],
  "pod test-kubeflow-worker-0": [
    "2024-01-05 22:58:20 Successfully assigned kubeflow-andrey/test-kubeflow-worker-1 to ...."
  ],
 }

@johnugeorge
Copy link
Member

/lgtm
Thanks @andreyvelich

@google-oss-prow google-oss-prow bot added the lgtm label Jan 11, 2024
@google-oss-prow google-oss-prow bot merged commit 778f555 into kubeflow:master Jan 11, 2024
31 checks passed
@andreyvelich andreyvelich deleted the issue-1863-sdk-job-events branch January 11, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Get Job Pods Events
5 participants