Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build-scaleway-x64-ubuntu-16-04-2 "installer" machine is build pipeline bottleneck #952

Closed
andrew-m-leonard opened this issue Oct 7, 2019 · 30 comments
Assignees
Milestone

Comments

@andrew-m-leonard
Copy link
Contributor

Nightly builds are getting bottlenecked on this machine due to it being the only machine capable of installer job due to the require GPG keys. It is not helped by the fact that some very long running jobs also run on it, eg.DockerFile builds taking 4+hrs and openjdk_build_docker_multiarch docker&x64 jobs which take 5-6 hours!! every day!

@sxa sxa added the enhancement label Oct 7, 2019
@sxa sxa self-assigned this Oct 7, 2019
@karianna karianna added bug and removed enhancement labels Oct 15, 2019
@sxa
Copy link
Member

sxa commented Oct 29, 2019

We need more machines capable of performing this fuinction - one across docker and installers is not appropriate.

For the docker work, only docker is required on the machine, so that should be easy to offload

@sxa sxa modified the milestones: November 2019, October 2019 Oct 29, 2019
@sxa
Copy link
Member

sxa commented Oct 29, 2019

System being created - docker-godaddy-ubuntu1604-x64-1 - to offload this work

@sxa
Copy link
Member

sxa commented Oct 29, 2019

Docker images are struggling to connect to external systems, therefore this is not currently working on the machine ...

@sxa
Copy link
Member

sxa commented Oct 29, 2019

@cmdc0de Do you know why GoDaddy provisioned machines appear to have issues with external connectivity to docker containers? We've seen this in issue 721 as well

@Haroon-Khel
Copy link
Contributor

A proposed fix is to spin up the ubuntu image by running docker run --network=host -it ubuntu. The docker container should should then be able to connect to external systems without error

@sxa sxa modified the milestones: October 2019, November 2019 Nov 1, 2019
@sxa sxa pinned this issue Nov 1, 2019
@sxa
Copy link
Member

sxa commented Nov 16, 2019

Considering switching away from godaddy for this purpose since I'd rather have a provider that works out of the box - will investigate...

@sxa sxa modified the milestones: November 2019, December 2019 Nov 29, 2019
@sxa sxa unpinned this issue Dec 12, 2019
@sxa sxa modified the milestones: December 2019, January 2020 Dec 31, 2019
@karianna karianna modified the milestones: January 2020, February 2020 Feb 3, 2020
@sxa
Copy link
Member

sxa commented Feb 25, 2020

@Haroon-Khel Do you know if that option can be set as the default so that we don't need to update the scripts to make it work properly?

@Haroon-Khel
Copy link
Contributor

@sxa555 Ive looked through the documentation, but I cant seem to find a solution which sets that variable globally/as a default. Theres a way to do it by using Docker Compose files, but I think that would be a slight overkill. Updating our existing build scripts would be our best bet, though this issue affects only our Go Daddy machines yes?

@sxa
Copy link
Member

sxa commented Mar 3, 2020

Correct ... I suppose it would depend how many places we needed to make the change in. May be good to get @dinogun involved at this point to see if adding that option to each docker command is feasible and/or whether he knows of a way to default it globally

@karianna karianna modified the milestones: February 2020, March 2020 Mar 6, 2020
@karianna
Copy link
Contributor

Adding Top priority to this as we're holding up pipelines

@sxa
Copy link
Member

sxa commented Mar 31, 2020

@karianna Is it still holding pipelines up? The original problems was the docker builds chewing all the resources on the machine, and the machine now has two executors to prevent that.

@karianna
Copy link
Contributor

It's still only one host that we're relying on though right? I think we should get rid of the single point of failure in that case.

@sxa
Copy link
Member

sxa commented Mar 31, 2020

Yes absolutely, but it's not currently holding up pipelines.

@aahlenst
Copy link
Contributor

aahlenst commented Mar 31, 2020

Yesterday, two parallel Docker jobs blocked that machine for hours. There's a thread on Slack in #infrastructure started by Simon.

@sxa
Copy link
Member

sxa commented Mar 31, 2020

OK thanks - I hadn't seen the system in a state where two docker jobs were running on it. That job should be single threaded I suspect as I'm not sure it's safe to run it in parallel. @dinogun can you comment/confirm?

[EDIT: Just checked and openjdk_build_docker_multiarch is set not to allow concurrent builds]

@sxa
Copy link
Member

sxa commented Mar 31, 2020

It's still only one host that we're relying on though right? I think we should get rid of the single point of failure in that case.

We had looked at moving this to another machine at godaddy but godaddy servers have unresolved issues with their networking in docker images. See also adoptium/temurin-build#1044 where we are covering where we have logged a few single points of failure that exist in the build systems today.

@dinogun
Copy link

dinogun commented Mar 31, 2020

OK thanks - I hadn't seen the system in a state where two docker jobs were running on it. That job should be single threaded I suspect as I'm not sure it's safe to run it in parallel. @dinogun can you comment/confirm?

If two docker jobs of the same type were running in parallel, that would be very strange (and ideally should never happen). Wondering if the for some reason the multiarch job and the manifest job were running at the same time. Though that should not happen either as jenkins should mark the machine as busy once a job begins executing right ?

@sxa
Copy link
Member

sxa commented Mar 31, 2020

Though that should not happen either as jenkins should mark the machine as busy once a job begins executing right ?

Incorrect for this case - that machine has two executors therefore allows two jobs to run in parallel (which is why I said that despite being locked to this machine, the docker build shouldn't hold up other things as they'll run on the second executor)

@dinogun
Copy link

dinogun commented Mar 31, 2020

Hmm can we somehow limit only one executor to run all docker related jobs then ?

@sxa
Copy link
Member

sxa commented Mar 31, 2020

As it happens right now it's running two and it is indeed the multiarch and manifest files that are running together:

image

I hadn't appreciated that both of those run for over 10 hours, so both executors are getting clogged up, preventing other jobs from running on this machine.

@dinogun
Copy link

dinogun commented Mar 31, 2020

I've stopped the multiarch for now. These are not designed to run together as they periodically cleanup all docker images on the box and so would cause both to fail. We need a way to fix the docker jobs to only one executor.

@sxa
Copy link
Member

sxa commented Apr 1, 2020

OK - that hasn't been the case for some time - the builds will likely have been running together unless anything else has changed, although I don't recall anyone reporting it until this week.

@sxa sxa pinned this issue Apr 1, 2020
@sxa
Copy link
Member

sxa commented Apr 1, 2020

Looking at setting up one or two more machines for this. It will potentially destabilise daily docker image creation on the Linux/x64 machine until we have clear setup instructions for the docker jobs. I've got a setup with the docker keys available and am running a test job on it at the moment. For what are obvious reasons this will take a while ;-)

For future reference, the server used for this appears to require at least 8GB of RAM (4GB without swap wasn't enough - I might also try 4GB with a swapfile just to check)

@sxa sxa modified the milestones: March 2020, April 2020 Apr 1, 2020
@sxa
Copy link
Member

sxa commented Apr 2, 2020

docker job ran successfully on docker-aws-ubuntu1604-x64-1. It also completed in 6h8m instead of 31 hours for the last completed run on build-scaleway-x64-ubuntu-16-04-2

The manifest job triggered after that build ran on docker-aws-ubuntu1604-x64-2 and completed in 2h12 where the previous runs on the scaleway box were up to 11 hours (maybe due to contention, fastest of recent runs on that system was 5h29)

Follow-on job on docker-aws-ubuntu1604-x64-2 has also completed (slightly faster at 4h41 compared to the aforementioned 6h08 on the other new machine). I would not be surprised if these times dropped as the machines re-run the jobs and have more data cached locally

I have locked ("Keep this build forever") one multiarch and manifest job from the old machine temporarily so we can compare output if needed)

@sxa sxa closed this as completed Apr 2, 2020
@sxa sxa unpinned this issue Apr 2, 2020
@dinogun
Copy link

dinogun commented Apr 3, 2020

@sxa555 docker push to DockerHub is failing on docker-aws-ubuntu1604-x64-1

The push refers to repository [docker.io/adoptopenjdk/openjdk14]
27480ab25448: Preparing
25866305528d: Preparing
16542a8fc3be: Preparing
6597da2e2e52: Preparing
977183d4e999: Preparing
c8be1b8f4d60: Preparing
c8be1b8f4d60: Waiting
denied: requested access to the resource is denied

This can happen if the auth is missing. Can you please check if the ~jenkins/.docker/config.json has been copied over as well ?

@sxa
Copy link
Member

sxa commented Apr 3, 2020

Fixed - I'd copied the file over with the wrong name onto that machine - apologies

@dinogun
Copy link

dinogun commented Apr 3, 2020

Can you do a quick check and see if docker login works without any prompt for password ?

@sxa
Copy link
Member

sxa commented Apr 3, 2020

Yep it's fine - I'm also re-running multiarch on x64 (we can restrict the jobs to specific combinations now instead of running all architectures!) to verify it

@sxa
Copy link
Member

sxa commented Apr 3, 2020

I think it would be good if we can modify the scripts to return suitable non-zero exit codes in that situation (and others) in order to make it easier to understand the success or otherwise of the jobn from the jenkins job status. I might have a go at that or assign one of my team to look at it.

@dinogun
Copy link

dinogun commented Apr 3, 2020

I have some upcoming changes that will fix this. In general better reporting of failures and a summary of which specific docker images failed to build if any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants