Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example with collecting timestamp of the metrics #970

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Dec 12, 2019

Fixes: #944.
Hold until I change docker hub repository.
/hold


This change is Reviewable

spec:
containers:
- name: {{.Trial}}
image: docker.io/andreyvelichkevich/timestamp-metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added you into kubeflowkatib, you can now push kubeflowkatib/timestamp-metric now
BTW, I think you can rename the image name to kubeflowkatib/mxnet-mnist-example to make all example has timestamp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename folder and file name also to metrics-with-timestamp to mxnet-mnist-example. There is no need to highlight the termwith-timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a mxnet-mnist-example.
I took code of pytorch-mnist from here: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/file-metrics-collector/mnist.py.

Should I change example to Mxnet in that case? Mxnet example is that one: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_mnist.py.
Or we can name this example kubeflowkatib/pytorch-mnist-example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. IIRC @richardsliu created this image to show mxnet support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge So what can we do with this example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardsliu Can you point out at the Dockerfile used?

@@ -0,0 +1,6 @@
FROM pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime

WORKDIR /var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe /var is appropriate location for the script. If you check FSH (https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard) /var should be used for changing files.

I would suggest using /opt and ideally a subdir in there - e.g. /opt/mnist/

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. What do you think about /app?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you do not want to go with existing directory structure in /? I don't think it is a good practice in linux systems to put code in "random" dirs in /, but feel free to convince me otherwise:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in /examples/v1alpha3/metrics-with-timestamp ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in /examples/v1alpha3/metrics-with-timestamp ?

Is this answer to @johnugeorge'S question?:-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you do not want to go with existing directory structure in /? I don't think it is a good practice in linux systems to put code in "random" dirs in /, but feel free to convince me otherwise:)

To this question :)
Which WORKDIR you think will be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry:) as I suggested in my first comment - I would go for /opt/mnist/mnist.py or to make it extremely obvious what it is (although that is already captured in the image name, but you never know..) /opt/metrics-with-timestamp/mnist.py or something along those lines

Copy link
Member Author

@andreyvelich andreyvelich Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @hougangliu What do you think about WORKDIR for the example as /opt/metrics-with-timestamp/mnist.py ?

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 8, 2020

I changed example from pytorch to mxnet.
I think that would be more useful for the Katib user.
Also, I use kubeflowkatib docker repository now.

I named it mxnet-mnist-timestamp, but we can think about better name.

/cc @johnugeorge @vpavlin
/hold cancel

@k8s-ci-robot
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: vpavlin.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I changed example from pytorch to mxnet.
I think that would be more useful for the Katib user.
Also, I use kubeflowkatib docker repository now.
/cc @johnugeorge @vpavlin
/hold cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Member Author

I changed mxnet-mnist-timestamp image to mxnet-mnist as we discussed here: #1008 (comment).
I think it will be better to name it mxnet-mnist than mxnet-mnist-example.
In case that some users use old yaml files for their Katib experiments with mxnet-mnist-example image and, if we push new image to mxnet-mnist-example, it will brake current Katib experiments.

I renamed images in according yaml files, please take a look.

/cc @johnugeorge @hougangliu @gaocegege

@andreyvelich
Copy link
Member Author

I changed CI cluster num nodes to 8.
It decreased CI test time ~ 30%.
What do you think about this change @johnugeorge @hougangliu @gaocegege ?

@@ -33,6 +33,9 @@ echo "Creating GPU cluster"
gcloud --project ${PROJECT} beta container clusters create ${CLUSTER_NAME} \
--zone ${ZONE} \
--machine-type=n1-standard-8 \
--num-nodes=8 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changed since overall resources are fixed

@johnugeorge
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 2a8e8b4 into kubeflow:master Jan 14, 2020
@andreyvelich andreyvelich deleted the issue-944-timestamp-metric-example branch October 6, 2021 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MetricsCollector does not parse time of the metric
5 participants