Move perf-dash to community owned infra #549

BenTheElder · 2020-01-27T19:58:43Z

http://perf-dash.k8s.io/ is running on the "k8s-mungegithub" cluster in an old google internal GCP project, this cluster is still using kube-lego and is not actively maintained as far as I can tell, we should move it to community managed infra (or turn it down if nobody is going to maintain it).

cc @krzysied

The text was updated successfully, but these errors were encountered:

bartsmykla · 2020-01-29T12:33:39Z

If there will be help from people who has access to the current cluster I can help with moving that and actually own the ticket
/assign

krzysied · 2020-01-29T13:04:23Z

Adding scalability folks.
/cc @wojtek-t
/cc @mm4tt

mm4tt · 2020-01-29T13:10:28Z

I'm not sure about the k8s-mungegithub cluster being maintained or not, but we definitely maintain and use perf-dash on a regular basis.

We do have access to the current cluster and can help with providing all the info you need. The only requirement we have is for the sig-scalability folks to have the ability to view perf-dash logs and be able to deploy new versions of perf-dash whenever we need it.

BenTheElder · 2020-02-04T18:55:10Z

@mm4tt that would be greatly appreciated! ideally we can move things over to a CNCF cluster where the sig-scale community at large can be granted access as needed and we can spin down the google.com cluster/project.

mm4tt · 2020-02-05T10:49:44Z

Sure thing.

It looks like deploying perf-dash is as simple as deploying this deployment and service:
https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml
https://github.com/kubernetes/perf-tests/blob/master/perfdash/perfdash-service.yaml

On top of that we have the perf-dash.k8s.io domain configured to point to the external IP address of the perf-dash service. I have no knowledge on how the domain is configured though.

Let me know if there is anything else you need from us.

ameukam · 2020-02-05T11:00:05Z

The FQDN is managed through OctoDNS: https://github.com/kubernetes/k8s.io/blob/master/dns/zone-configs/k8s.io._0_base.yaml#L170.

bartsmykla · 2020-02-24T09:20:41Z

I'm gonna start some work in that area today. :-) If I'll need anything I'm gonna ping you @mm4tt

bartsmykla · 2020-03-06T08:39:36Z

I have started moving perf-dash to aaa but I faced a problem: aaa currently doesn't have enough resources to support that requests: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml#L39-L44 - but when I'll confirm these are necessary to run that tool I'll start conversation about providing nodes which are able to provide enough resources.

bartsmykla · 2020-03-06T09:22:11Z

I discussed with @mm4tt and it looks like these are necessary resources for the project, and currently it's running at 1 node: n1-highmem-2.
@dims @thockin what is the current process of requesting new node to be added to our current aaa pool?

/assign @dims @thockin

bartsmykla · 2020-03-06T09:26:56Z

Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa.
Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

/assign @BenTheElder

thockin · 2020-03-06T21:37:03Z

Wait - a DASHBOARD app needs 8 GB of memory? Can someone explain why?

…

On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla ***@***.***> wrote: Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"? /assign @BenTheElder — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

thockin · 2020-03-06T21:48:25Z

I am 100% in favor of moving this, but we need to understand the resource footprint.

…

On Fri, Mar 6, 2020 at 1:36 PM Tim Hockin ***@***.***> wrote: Wait - a DASHBOARD app needs 8 GB of memory? Can someone explain why? On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla ***@***.***> wrote: > > Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. > Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"? > > /assign @BenTheElder > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub, or unsubscribe.

BenTheElder · 2020-03-07T00:24:41Z

@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.

Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"?
(Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)

thockin · 2020-03-07T00:29:22Z

Yes, a static IP is totally fine. Prefer Ingress to Service type=LB because of certs

…

On Fri, Mar 6, 2020 at 4:24 PM Benjamin Elder ***@***.***> wrote: @bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"? I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"? (Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

bartsmykla · 2020-03-07T05:38:07Z

@BenTheElder @thockin got it, thank you for info, and I'm gonna proceed with that approach! :-)

bartsmykla · 2020-03-07T05:39:07Z

Also @mm4tt, can you give your opinion about why it needs 8GB of RAM?

/assign @mm4tt
/unassign @BenTheElder

wojtek-t · 2020-03-09T08:27:54Z

I know that the memory footprint is a bit unexpected, but it's that, as it stores in memory all the data that it serves.
So basically how it works is:

for each job it supports and for each run in the last N days/weeks, it loads a number metrics for this run from GCS and stores them in memory [that is very time consuming - initialization takes ~15-30 minutes]
when it serves the request, it simply serves it from memory

And all the metrics that we have, across all jobs sum up to a lot of data.

bartsmykla · 2020-03-10T07:27:06Z

Thank you @wojtek-t for your voice here.

My suggestion would be to add the proper node to our cluster for now, and then move the discussion if we somehow can and/or should to improve the application itself.

@thockin @dims wdyt?

dims · 2020-03-10T12:39:25Z

+1 to add an appropriate node to our cluster. if i remember right, we used terraform to stand up this cluster, so the changes would have to be done there.

bartsmykla · 2020-03-10T12:40:31Z

Thank you for an opinion @dims. I'm gonna wait for @thockin and then I'll create a proper PR. :-)

thockin · 2020-03-13T22:05:31Z

Sorry to be a pain in the rear.

@wojtek-t what sort of traffic is this serving to justify being entirely in memory? If we added n1-standard-4 VMs that's $30/mo each. We could add n1-highmem-2 for a bit less but may be less useful.

In other words, if that had to serve from disk, what bad things would happen?

thockin · 2020-03-13T22:07:53Z

I am fine to add a second pool of n1-standard-4, I just don't want to be wasteful

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla · 2020-04-06T12:23:32Z

I have created PR with next steps in its description, so feel free to do review :-)

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla · 2020-04-09T06:10:52Z

As a followup for the people who don't follow the PR#721. We have perf-dash-canary.k8s.io running and there are some issues with accessing the cluster by people from k8s-infra-rbac-perfdash@kubernetes.io group. When it will be solved and we confirm the data in both perf-dash-canary.k8s.io and perf-dash.k8s.io are equivalent we'll point the subdomain perf-dash.k8s.io to the new cluster and would be able to consider this task as done. :-)

spiffxp · 2020-04-15T19:34:03Z

/area cluster-infra

spiffxp · 2020-04-15T19:35:28Z

/unassign @dims
@thockin will work with @mm4tt to resolve why they're unable to access the perfdash namespace in the aaa cluster

spiffxp · 2020-04-15T23:55:13Z

/sig scalability

bartsmykla · 2020-04-16T11:55:18Z

I think I found was was causing the issue with accessing the namespaces! [#758]

bartsmykla · 2020-04-21T07:35:32Z

@mm4tt as #770 is merged and I think @thockin reconcilled groups, can you confirm you have access to the namespace now?

mm4tt · 2020-04-21T07:40:39Z

I confirm, I have access and everything seems to be working as it should. We can proceed with pointing perf-dash.k8s.io to the new cluster.
Thanks!

bartsmykla · 2020-04-21T07:47:00Z

My plan right now is:

~~Switching perf-dash.k8s.io to point to aaa cluster + adding new subdomain to the perf-dash ingress~~
- ~~DNS REQUEST: perf-dash.k8s.io #772~~
- ~~Switching perf-dash subdomain to the new place #773~~
~~Waiting for confirmation everything works fine after the change~~
~~Removing unnecessary anymore perf-dash-canary.k8s.io record~~
- ~~DNS REQUEST: perf-dash-canary.k8s.io #776~~
~~Creating PR with change the service type at perf-dash repostory from LoadBalancer to NodePort~~
- ~~Changed perfdash service type to NodePort perf-tests#1184~~
~~Removing part of code which was doing it manually from deployment instructions at our repository~~
- ~~Updated deploy instructions for perf-dash #778~~
~~Celebrate a small success 🎉~~

bartsmykla · 2020-04-22T08:52:41Z

I think it's the time for celebration as we managed to get all steps done! :-)

/close

k8s-ci-robot · 2020-04-22T08:53:03Z

@bartsmykla: Closing this issue.

In response to this:

I think it's the time for celebration as we managed to get all steps done! :-)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder · 2020-04-22T08:54:28Z

🎉

…

On Wed, Apr 22, 2020, 01:53 Kubernetes Prow Robot ***@***.***> wrote: @bartsmykla <https://github.com/bartsmykla>: Closing this issue. In response to this <#549 (comment)>: I think it's the time for celebration as we managed to get all steps done! :-) /close Instructions for interacting with me using PR comments are available here <https://git.k8s.io/community/contributors/guide/pull-requests.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#549 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHADK44E6QC3C3N7O67UKDRN2V77ANCNFSM4KMHGEJA> .

mm4tt · 2020-04-22T09:00:44Z

Thanks, @bartsmykla !!!

BenTheElder added the wg/k8s-infra label Jan 27, 2020

k8s-ci-robot assigned bartsmykla Jan 29, 2020

k8s-ci-robot assigned dims and thockin Mar 6, 2020

k8s-ci-robot assigned BenTheElder Mar 6, 2020

k8s-ci-robot assigned mm4tt and unassigned BenTheElder Mar 7, 2020

bartsmykla mentioned this issue Mar 7, 2020

Configuration of the "aaa" cluster at GCP #643

Closed

spiffxp mentioned this issue Apr 1, 2020

Take inventory of Google-owned cluster-based k8s-infra #240

Closed

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 6, 2020

Move perfdash to new infrastructure

cb46909

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 6, 2020

Move perfdash to new infrastructure

c9e3e2a

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla mentioned this issue Apr 6, 2020

Move perfdash to new infrastructure #721

Merged

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 7, 2020

Move perfdash to new infrastructure

dfbf3e2

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 7, 2020

Move perfdash to new infrastructure

a613a84

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 8, 2020

Move perfdash to new infrastructure

b803c15

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 8, 2020

Move perfdash to new infrastructure

cd24c29

Applies to: kubernetes#549 Signed-off-by: Bart Smykla <bsmykla@vmware.com>

k8s-ci-robot added the area/infra Infrastructure management, infrastructure design, code in infra/ label Apr 15, 2020

spiffxp added this to Needs Triage in sig-k8s-infra via automation Apr 15, 2020

spiffxp moved this from Needs Triage to In Progress in sig-k8s-infra Apr 15, 2020

k8s-ci-robot unassigned dims Apr 15, 2020

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Apr 15, 2020

bartsmykla mentioned this issue Apr 21, 2020

Remove perf-dash-canary.k8s.io subdomain #777

Merged

k8s-ci-robot closed this as completed Apr 22, 2020

sig-k8s-infra automation moved this from In Progress to Done Apr 22, 2020

spiffxp mentioned this issue Apr 23, 2020

Turn down k8s-mungegithub/mungegithub cluster #788

Closed

2 tasks

Move perf-dash to community owned infra #549

Move perf-dash to community owned infra #549

Comments

BenTheElder commented Jan 27, 2020

bartsmykla commented Jan 29, 2020 • edited Loading

krzysied commented Jan 29, 2020

mm4tt commented Jan 29, 2020

BenTheElder commented Feb 4, 2020

mm4tt commented Feb 5, 2020

ameukam commented Feb 5, 2020

bartsmykla commented Feb 24, 2020

bartsmykla commented Mar 6, 2020

bartsmykla commented Mar 6, 2020 • edited Loading

bartsmykla commented Mar 6, 2020

thockin commented Mar 6, 2020 via email

thockin commented Mar 6, 2020 via email

BenTheElder commented Mar 7, 2020

thockin commented Mar 7, 2020 via email

bartsmykla commented Mar 7, 2020

bartsmykla commented Mar 7, 2020

wojtek-t commented Mar 9, 2020

bartsmykla commented Mar 10, 2020 • edited Loading

dims commented Mar 10, 2020

bartsmykla commented Mar 10, 2020 • edited Loading

thockin commented Mar 13, 2020

thockin commented Mar 13, 2020

bartsmykla commented Apr 6, 2020

bartsmykla commented Apr 9, 2020

spiffxp commented Apr 15, 2020

spiffxp commented Apr 15, 2020

spiffxp commented Apr 15, 2020

bartsmykla commented Apr 16, 2020

bartsmykla commented Apr 21, 2020

mm4tt commented Apr 21, 2020

bartsmykla commented Apr 21, 2020 • edited Loading

bartsmykla commented Apr 22, 2020

k8s-ci-robot commented Apr 22, 2020

BenTheElder commented Apr 22, 2020 via email

mm4tt commented Apr 22, 2020

bartsmykla commented Jan 29, 2020 •

edited

Loading

bartsmykla commented Mar 6, 2020 •

edited

Loading

bartsmykla commented Mar 10, 2020 •

edited

Loading

bartsmykla commented Mar 10, 2020 •

edited

Loading

bartsmykla commented Apr 21, 2020 •

edited

Loading