Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix main process retrieve logic for early stopping #1988

Conversation

shaowei-su
Copy link
Contributor

@shaowei-su shaowei-su commented Oct 28, 2022

What this PR does / why we need it:
The GetMainProcesses function expects the directory for completed marker, e.g. /var/log/katib:
https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/common/pns.go#L94
The current code will pass the entire log file i.e /var/log/katib/metrics.log. When there is only single container, this code path will work since if will assign the first non-0 pid as main process. However, when multiple sidecar containers presented this method will break.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Oct 28, 2022

Coverage Status

Coverage decreased (-0.02%) to 73.522% when pulling b3b4cc9 on shaowei-su:shaowei--fix-main-process-retrieve into 6b55540 on kubeflow:master.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su Thanks for creating this!

I remember that we don't support multiple sidecar containers (multiple processes), now.

// TODO (andreyvelich): Currently support only single child process.

What do you think about this? @andreyvelich

@shaowei-su
Copy link
Contributor Author

Hi @tenzen-y , in the deployment env I'm working on, the main process will take care of closing other sidecars.
This PR is mainly to address the bug regarding how to locate main process (you can find more info here in the comment section: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/common/pns.go#L94)

@tenzen-y
Copy link
Member

tenzen-y commented Nov 2, 2022

Hi @tenzen-y , in the deployment env I'm working on, the main process will take care of closing other sidecars. This PR is mainly to address the bug regarding how to locate main process (you can find more info here in the comment section: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/common/pns.go#L94)

@shaowei-su
I see. Thanks for clarifying!
Also, can you rebase this PR since we fixed Charmed Katib / Test?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tenzen-y
Copy link
Member

tenzen-y commented Nov 2, 2022

/lgtm cancel

@google-oss-prow google-oss-prow bot removed the lgtm label Nov 2, 2022
@shaowei-su shaowei-su force-pushed the shaowei--fix-main-process-retrieve branch from afe9527 to b3b4cc9 Compare November 2, 2022 16:15
@shaowei-su
Copy link
Contributor Author

Thanks @tenzen-y, PR rebased, PTAL!

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaowei-su Thanks for the updates!
/lgtm

/cc @andreyvelich @johnugeorge

@tenzen-y
Copy link
Member

tenzen-y commented Nov 3, 2022

/assign @andreyvelich @johnugeorge

@shaowei-su
Copy link
Contributor Author

Thanks for the stamp @tenzen-y ! can we merge in this change now?

@tenzen-y
Copy link
Member

tenzen-y commented Nov 8, 2022

Thanks for the stamp @tenzen-y ! can we merge in this change now?

This PR will be merged after the approvers has reviewed the changes.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for notice this @shaowei-su!
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, shaowei-su

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d8fbe6e into kubeflow:master Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants