Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Use a new VM instance type #3535

Closed
dduportal opened this issue Apr 21, 2023 · 11 comments
Closed

[ci.jenkins.io] Use a new VM instance type #3535

dduportal opened this issue Apr 21, 2023 · 11 comments

Comments

@dduportal
Copy link
Contributor

What is the problem?

The current VM for ci.jenkins.io starts to show issues:

Also, this VM was sized a few years ago with a slighlty different context: JDK8 for running the controller (e.g. less CPU usage but more memory usage), no UEFI bootloader (v1 generation), Ubuntu 18.04.

Finally, managing this VM is manually managed for the infrasrtucture layer (initially created with Terraform, but then changed to manual management).

** What should we do**

There are numerous tasks for this VM:

** How could we do it**

Proposal: to avoid any maintenance overhead and migration risk, the infra team thought of the following plan:

  • Create a brand new VM for ci.jenkins.io:
    • With Terraform from the beginning, using modern syntax as per @smerle33 's work in [INFRA-1352] Migrate Trusted.ci on Azure #1101
    • Use an instance with better sizing (e.g. more CPUs or at least more powerfull, less RAM, etc.) with generation V2 and a brand new SSD
    • Based on Ubuntu 22.04
    • Already in the new Azure: Re-create networks to fix overlap issues and support of IPv6 #3257 public network with no overlap nor performance issue and closer to the Azure ACP and IPv6 support
    • Managed by puppet with the name ci.jenkins.io (reminder: ci.jenkins.io used to be the named under Puppet for the AWS VM, the current name of the Azure VM is azure.ci.jenkins.io)

=> this would avoid disrupting the current ci.jenkins.io service until the effective migration

Validation steps would be:

@dduportal
Copy link
Contributor Author

Should use a backup policy (merged #3527):

The goal is to ensure we have a daily backup of the JENKINS_HOME of ci.jenkins.io

Azure provides a Backup System, than can be used specifically for managed disks such as this one: https://learn.microsoft.com/en-us/azure/backup/backup-managed-disks.

We don't (and should not) need a VM-level backup as we use Puppet to manage the system: disaster recovery for ci.jenkins.io is to install a blank new VM and mount the resotre of the datadisk for Jenkins.

As per https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/data_protection_backup_instance_disk, we can define this using Terraform which implies importing ci.jenkins.io VM once for all.

A word about encryption:

  • The backup vault is, like the VM disks, encrypted at rest with an Azure PMK key (hardware level).
  • We can keep this behavior (encryption at rest with PMK) for the backup, as ci.jenkins.io deos not have any senstivie data (eventually credentials for GH org, but that is all).
  • Note: This encryption could be provided a custom key private for sensitvie backups such as trusted.ci's

@dduportal dduportal self-assigned this May 9, 2023
dduportal added a commit to jenkins-infra/azure that referenced this issue May 16, 2023
…m resource group (#348)

Related to jenkins-infra/helpdesk#3535, this
PR is the first major step to create a new (faster + cheaper)
ci.jenkins.io VM managed as code.

Please note the following elements:

- A new resource group is created to avoid messing up with the current
one (to avoid any confusion). The goal will be to remove the older
resource group once the migration will be complete
- Neither DNS record, subnets or snapshot (of the data disk for
migration) are constrained by this
- StandardSSD are used instead of PremiumSSD (V1/V2). The goal is to
start with a cheaper storage and see the current IOPS usage (since we
[removed the plugin-config-history
plugin](jenkins-infra/helpdesk#3528) and
[added an S3 artifact caching
proxy](jenkins-infra/helpdesk#3496) the need
for IOPS is lowered).
- Keep using a 512 Gb SSD though, because current usage is ~ 240 Gb: if
we want to stay under 80% usage, AND since [Azure managed
disks](https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types#standard-ssds)
are either making you pay for 256 or 512 Gb (unless Premium SSD v2),
better to have the bigger disk as possible.

- No need for a storage account (it was only used for boot diagnostics,
which we should not need with Terrafomr IAC based VM)

- Registration to Puppet is done under the name `ci.jenkins.io` : the
current VM is [registered as
`azure.ci.jenkins.io`](https://github.com/jenkins-infra/jenkins-infra/blob/e425203e0ffafdcf8b2d5e675ad838c5e73cd687/manifests/site.pp#L76-L79)
so no risk of conflicts

Please note that no security group is added...yet. I want to start with
an "open VM" before applying security groups in a subsequent PR
(everything will be removed and re-created once verified functionnal).

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
@dduportal
Copy link
Contributor Author

dduportal added a commit to jenkins-infra/azure that referenced this issue May 16, 2023
…om internet (#349)

Related to jenkins-infra/helpdesk#3535, this
PR follows up #348

By default, all accesses are forbidden in the security group, so we
cannot reach the VM.

This changes adds a set of security group rules to the ci.jenkins.io
controller subnet to:

- Allow incoming SSH requests from the private VPN (as public and
private networks are peered) to the private IP of the VM
- Nice to have once the access is validated: a private DNS record in the
VPN subnet
- Allow incoming HTTP, HTTPS and TCP Inbound protocols from the internet
to the VM

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Delaying as it's blocked by the network peering accesses part of #3351 and by the work on the new trusted machines in #3486

@dduportal
Copy link
Contributor Author

dduportal commented Jun 26, 2023

Update: ci.jenkins.io is now using inbound agent running from the new virtual network.

Watching the builds (ping @lemeurherve not urgent but I'll try to check the CI integration in datadog to see if any pattern arise here - #3573)

@dduportal
Copy link
Contributor Author

dduportal commented Jul 4, 2023

Next step: bootstraping a fully operational VM for the new ci.jenkins.io

⚠️ The old JENKINS_HOME seems to have inode issues (wether the old disk or the snapshots). Despite copying the full disk yesterday, the rsync sees all files as changed: the copy is being done but will take multiple hours (~ 12 hours).
ci.jenkins.io will remain down until 5 July.

  • Restart of the service
  • Migrate DNS
  • Puppet run with letsencrypt

dduportal added a commit to jenkins-infra/azure that referenced this issue Jul 4, 2023
Related to jenkins-infra/helpdesk#3535

- Adds 2 DNS A records to reach the VM (without changing ci.jenkins.io,
yet)
- Rename resources to stick to the "controller" naming like we did with
trusted.ci (the proposal to make a module from @timja makes sense for a
controller setup as we instantiate it 3 times, so let's get trusted and
CI close in order to make the module happen later)
- Set up a first set of NSG rules to controler outbound

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Update (4th of July):

  • Migration finished late (european time)
  • Removed the cache*/, **/*.tmp, plugins/, war/ and log*/* elements from the new JENKINS_HOME
  • Service started with success ("Jenkins is up and running") and valid locally (curl -v http://localhost:8080)
  • ci.jenkins.io CNAME record created manually to allow letsencrypt
    • Initial failure as Apache expecte the file to be there. Copied the /etc/letsencrypt from former VM
    • letsencrypt enabled with success after this
  • Agent failures due to missing NSGs: added manually
  • Missing plugins which were not part of the puppet "as code" setup
    • startup works (e.g. all JCasc required plugins are defined as code)
    • It's only missing features or pipeline runtime errors (No such DSL method 'xxxx' found among steps <...>)

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

dduportal commented Jul 5, 2023

dduportal added a commit to jenkins-infra/azure that referenced this issue Jul 5, 2023
…426)

The NSG rules names global, I had collisions with
trusted.ci.jenkins.io's NSG while working on
jenkins-infra/helpdesk#3535.

This PR renames ci.jenkins.io's NSG rules that could be confusing.

That kind of problem would be ideally solved by creating our own
Terraform module.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

dduportal commented Jul 10, 2023

Todo list to close this issue:

dduportal added a commit to jenkins-infra/azure that referenced this issue Jul 11, 2023
Related to
jenkins-infra/helpdesk#3535 (comment)

This PR removes:

- The storage account `cijenkinsiovmagents` which was used by the Azure
VM plugin of the former VM
- The resource group `eastus-cijenkinsio` which was used by the Azure VM
and ACI plugins of the former VM

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Closing the issue as the work is finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants