Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move perf-dash to community owned infra #549

Closed
BenTheElder opened this issue Jan 27, 2020 · 41 comments
Closed

Move perf-dash to community owned infra #549

BenTheElder opened this issue Jan 27, 2020 · 41 comments
Assignees
Labels
area/infra Infrastructure management, infrastructure design, code in infra/ sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@BenTheElder
Copy link
Member

http://perf-dash.k8s.io/ is running on the "k8s-mungegithub" cluster in an old google internal GCP project, this cluster is still using kube-lego and is not actively maintained as far as I can tell, we should move it to community managed infra (or turn it down if nobody is going to maintain it).

cc @krzysied

@bartsmykla
Copy link
Contributor

bartsmykla commented Jan 29, 2020

If there will be help from people who has access to the current cluster I can help with moving that and actually own the ticket
/assign

@krzysied
Copy link

Adding scalability folks.
/cc @wojtek-t
/cc @mm4tt

@mm4tt
Copy link
Contributor

mm4tt commented Jan 29, 2020

I'm not sure about the k8s-mungegithub cluster being maintained or not, but we definitely maintain and use perf-dash on a regular basis.

We do have access to the current cluster and can help with providing all the info you need. The only requirement we have is for the sig-scalability folks to have the ability to view perf-dash logs and be able to deploy new versions of perf-dash whenever we need it.

@BenTheElder
Copy link
Member Author

@mm4tt that would be greatly appreciated! ideally we can move things over to a CNCF cluster where the sig-scale community at large can be granted access as needed and we can spin down the google.com cluster/project.

@mm4tt
Copy link
Contributor

mm4tt commented Feb 5, 2020

Sure thing.

It looks like deploying perf-dash is as simple as deploying this deployment and service:
https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml
https://github.com/kubernetes/perf-tests/blob/master/perfdash/perfdash-service.yaml

On top of that we have the perf-dash.k8s.io domain configured to point to the external IP address of the perf-dash service. I have no knowledge on how the domain is configured though.

Let me know if there is anything else you need from us.

@ameukam
Copy link
Member

ameukam commented Feb 5, 2020

@bartsmykla
Copy link
Contributor

I'm gonna start some work in that area today. :-) If I'll need anything I'm gonna ping you @mm4tt

@bartsmykla
Copy link
Contributor

I have started moving perf-dash to aaa but I faced a problem: aaa currently doesn't have enough resources to support that requests: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml#L39-L44 - but when I'll confirm these are necessary to run that tool I'll start conversation about providing nodes which are able to provide enough resources.

@bartsmykla
Copy link
Contributor

bartsmykla commented Mar 6, 2020

I discussed with @mm4tt and it looks like these are necessary resources for the project, and currently it's running at 1 node: n1-highmem-2.
@dims @thockin what is the current process of requesting new node to be added to our current aaa pool?

/assign @dims @thockin

@bartsmykla
Copy link
Contributor

Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa.
Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

/assign @BenTheElder

@thockin
Copy link
Member

thockin commented Mar 6, 2020 via email

@thockin
Copy link
Member

thockin commented Mar 6, 2020 via email

@BenTheElder
Copy link
Member Author

@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.

Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"?
(Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)

@thockin
Copy link
Member

thockin commented Mar 7, 2020 via email

@bartsmykla
Copy link
Contributor

@BenTheElder @thockin got it, thank you for info, and I'm gonna proceed with that approach! :-)

@bartsmykla
Copy link
Contributor

Also @mm4tt, can you give your opinion about why it needs 8GB of RAM?

/assign @mm4tt
/unassign @BenTheElder

@wojtek-t
Copy link
Member

wojtek-t commented Mar 9, 2020

I know that the memory footprint is a bit unexpected, but it's that, as it stores in memory all the data that it serves.
So basically how it works is:

  • for each job it supports and for each run in the last N days/weeks, it loads a number metrics for this run from GCS and stores them in memory [that is very time consuming - initialization takes ~15-30 minutes]
  • when it serves the request, it simply serves it from memory

And all the metrics that we have, across all jobs sum up to a lot of data.

@bartsmykla
Copy link
Contributor

bartsmykla commented Mar 10, 2020

Thank you @wojtek-t for your voice here.

My suggestion would be to add the proper node to our cluster for now, and then move the discussion if we somehow can and/or should to improve the application itself.

@thockin @dims wdyt?

@dims
Copy link
Member

dims commented Mar 10, 2020

+1 to add an appropriate node to our cluster. if i remember right, we used terraform to stand up this cluster, so the changes would have to be done there.

@bartsmykla
Copy link
Contributor

bartsmykla commented Mar 10, 2020

Thank you for an opinion @dims. I'm gonna wait for @thockin and then I'll create a proper PR. :-)

@thockin
Copy link
Member

thockin commented Mar 13, 2020

Sorry to be a pain in the rear.

@wojtek-t what sort of traffic is this serving to justify being entirely in memory? If we added n1-standard-4 VMs that's $30/mo each. We could add n1-highmem-2 for a bit less but may be less useful.

In other words, if that had to serve from disk, what bad things would happen?

@thockin
Copy link
Member

thockin commented Mar 13, 2020

I am fine to add a second pool of n1-standard-4, I just don't want to be wasteful

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 6, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 6, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
@bartsmykla
Copy link
Contributor

I have created PR with next steps in its description, so feel free to do review :-)

bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 7, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 7, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 8, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
bartsmykla pushed a commit to bartsmykla/k8s.io that referenced this issue Apr 8, 2020
Applies to: kubernetes#549

Signed-off-by: Bart Smykla <bsmykla@vmware.com>
@bartsmykla
Copy link
Contributor

As a followup for the people who don't follow the PR#721. We have perf-dash-canary.k8s.io running and there are some issues with accessing the cluster by people from k8s-infra-rbac-perfdash@kubernetes.io group. When it will be solved and we confirm the data in both perf-dash-canary.k8s.io and perf-dash.k8s.io are equivalent we'll point the subdomain perf-dash.k8s.io to the new cluster and would be able to consider this task as done. :-)

@spiffxp
Copy link
Member

spiffxp commented Apr 15, 2020

/area cluster-infra

@k8s-ci-robot k8s-ci-robot added the area/infra Infrastructure management, infrastructure design, code in infra/ label Apr 15, 2020
@spiffxp spiffxp added this to Needs Triage in sig-k8s-infra via automation Apr 15, 2020
@spiffxp spiffxp moved this from Needs Triage to In Progress in sig-k8s-infra Apr 15, 2020
@spiffxp
Copy link
Member

spiffxp commented Apr 15, 2020

/unassign @dims
@thockin will work with @mm4tt to resolve why they're unable to access the perfdash namespace in the aaa cluster

@spiffxp
Copy link
Member

spiffxp commented Apr 15, 2020

/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Apr 15, 2020
@bartsmykla
Copy link
Contributor

I think I found was was causing the issue with accessing the namespaces! [#758]

@bartsmykla
Copy link
Contributor

@mm4tt as #770 is merged and I think @thockin reconcilled groups, can you confirm you have access to the namespace now?

@mm4tt
Copy link
Contributor

mm4tt commented Apr 21, 2020

I confirm, I have access and everything seems to be working as it should. We can proceed with pointing perf-dash.k8s.io to the new cluster.
Thanks!

@bartsmykla
Copy link
Contributor

bartsmykla commented Apr 21, 2020

My plan right now is:

  1. Switching perf-dash.k8s.io to point to aaa cluster + adding new subdomain to the perf-dash ingress
  2. Waiting for confirmation everything works fine after the change
  3. Removing unnecessary anymore perf-dash-canary.k8s.io record
  4. Creating PR with change the service type at perf-dash repostory from LoadBalancer to NodePort
  5. Removing part of code which was doing it manually from deployment instructions at our repository
  6. Celebrate a small success 🎉

@bartsmykla
Copy link
Contributor

I think it's the time for celebration as we managed to get all steps done! :-)

/close

@k8s-ci-robot
Copy link
Contributor

@bartsmykla: Closing this issue.

In response to this:

I think it's the time for celebration as we managed to get all steps done! :-)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sig-k8s-infra automation moved this from In Progress to Done Apr 22, 2020
@BenTheElder
Copy link
Member Author

BenTheElder commented Apr 22, 2020 via email

@mm4tt
Copy link
Contributor

mm4tt commented Apr 22, 2020

Thanks, @bartsmykla !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra Infrastructure management, infrastructure design, code in infra/ sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
sig-k8s-infra
  
Done
Development

No branches or pull requests

10 participants