Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

Continuous integration at GDS #106

Closed
wants to merge 1 commit into from
Closed

Continuous integration at GDS #106

wants to merge 1 commit into from

Conversation

tlwr
Copy link
Contributor

@tlwr tlwr commented Nov 12, 2019

⏰ The deadline for first pass of submissions is Friday 29th November ⏰
After this deadline, we will work on presenting a more concrete proposal based on the user needs externalised in this document...

What

We are writing up https://docs.google.com/document/d/1PZXZwD9yrP-toI1gpaSi9NWpwl0zAVHUoJgb_65zKko/edit?ts=5dc99a4a#heading=h.2s8g0l8netcs into some documentation about how GDS should do continuous integration.

We think that a GitHub thread of comments that we can write up will be more collaborative across GDS than a Google doc.

In the following comments I will be posting some snippets from the Google doc.

How to contribute

Please comment your thoughts / feedback and Harker and I will add them to the proposal once we've collated people's thoughts.

Some prompts:

  • What does your ideal development workflow look like? (assuming continuous deployment)

  • What software do your tests require to run?

    • Do you have multi-language tests (e.g. Ruby, Go, Python) in the same repo
    • Do you have backing services? (e.g. Postgres, RabbitMQ, Elasticsearch, Redis)
    • Do your tests require 🐝 dangerous 🐝 credentials to run? (e.g. tests spin RDS instances)
    • Other software?
  • Is your project open-source, does it have open source contributors, how do they expect to contribute?

  • How do you verify the integrity of your code? (e.g. commit signing, etc)


Status

1. Collection of comments/opinions

⏰ Deadline Wednesday 27th November (DONE)

Once this deadline expires we will work to make a concrete proposal out of people's contributions. The proposal will take the form (these are examples not the actual proposal):

  • Procure a SaaS tool for doing continuous integration with the following features:

    • X
    • Y
    • Z
  • Use X open-source tool and run it centrally from TechOps, specifically for Continuous Integration, with the following features:

    • X
    • Y
    • Z

2. Proposal

This is in progress

We will write the proposal, and people should add clarifying comments.

3. Do the work

We will take the actions from the concrete proposal, for instance

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Stephen <stephen.harker@digital.cabinet.office.gov.uk>
Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>
@tlwr
Copy link
Contributor Author

tlwr commented Nov 12, 2019

Thoughts on CI/CD, based on conversations I've had over the last 18 months of talking about pipelines and Concourse


CI / CD

This proposal is to cover continuous integration (CI) rather than continuous deployment (CD).

As an organisation we have a standardised way of continuously deploying things without human intervention using Concourse, and a large number of Concourse deployments for this purpose. We also use Concourse for triggering jobs manually.

We currently do not use software-as-a-service (SaaS) solutions for deployment, because we (depending on the provider):

  • have concerns about the security and information assurance postures of some SaaS CD software
  • have concerns about the usability, flexibility, extensibility of some SaaS CD software

Instead we have met our own user needs using Concourse/Jenkins; this is a valid exception to the TechOps principles/strategy. We are comfortable that we can operate Concourse at scale, given we have years of experience of operating Concourse.

CI / CD and trust

We have concerns about IA/Security of using SaaS solutions for CD, however these concerns do not apply to CI tools.

GOV.UK PaaS and GSP both developed techniques for deploying code hosted by untrusted sources using commit signing. As a result we can host our code wherever we like, and instead trust developers to sign code. A trusted deployer (Concourse pipeline) is responsible for verifying the integrity of the source code from an untrusted repository (e.g. GitHub).

Therefore, we should have no concerns about the security / information assurance posture of any CI tool we pick, and instead we can evaluate the CI tools based on developer experience and cost to the taxpayer.

The following steps can be followed which use an untrusted CI tool, but code is still trusted all the way from laptop to prod:

  • A developer pushes a signed commit from their laptop to version control
  • Version control hosts the signed commit
    • Version control cannot impersonate a developer due to commit signing
  • An untrusted CI tools examines the code and runs the tests
    • CI tool cannot impersonate a developer due to commit signing
  • Pull request is reviewed by another developer
    • Developer merges using a signed commit (PaaS variant)
    • Developer (or many) approves PR and merges using GitHub UI (GSP variant)
  • CD tool (Concourse) watches Git repo for signed commits
    • Verifies the signature based on configured public keys
    • If the commit is signed correctly by trusted developer, deployment proceeds

(This is the happy path, there are other threat models for consideration, but they are not relevant to this proposal)

User needs for CI

This proposal is really about making developers and service teams productive, whilst also getting the benefits of consolidation and consistency. There are two high level user needs (non-exhaustive):

As a developer writing code and pushing it to version control, I expect my tests to be run automatically using a CI service against real (ephemeral) databases, so that I do not have to run tests on my local machine.

As a code reviewer, I expect the tests for an open pull request to be run automatically using a CI service against real (ephemeral) databases, so I do not have to checkout the code and run the tests locally, and so I can at a glance, see if the tests (and other status checks) are passing.

(This does not explicitly mention the "coding in the open aspect" but must be a consideration)

Technical considerations

Concourse, and Jenkins, being two pieces of self-hosted software we currently use for CI meet our needs, but not optimally. For the following (non-exhaustive) reasons:

  • Proliferation of agents, with slightly different versions of software permanently installed.

    • Ideally we could version dependencies with the code that we are testing, so that it is easy to change, and so we do not have long lived infrastructure with specific versions.
  • Use of docker-compose or similar container based toolchains.

    • Operational concerns when the context of a job is not encapsulated within an ephemeral VM:
      • Dangling containers/images/volumes
      • Caching is a hard problem
    • Ends up forcing developers to learn a new toolchain (e.g. Docker), just for running tests.

@46bit
Copy link

46bit commented Nov 13, 2019

I love the thought that's gone into this.

One PaaS tenant deploys each PR to a new, temporary app on the PaaS. This allows real-world previewing of how it works. I think that in your model this would require the untrusted CI tool to have PaaS credentials, or to have the trusted CD tool also watching PRs. Do you have any thoughts on using that model for relatively small GDS services?

@tlwr
Copy link
Contributor Author

tlwr commented Nov 13, 2019

This section of this comment is a direct response to @46bit

I have no concerns for using "untrusted" CI tools for deploying applications to GOV.UK PaaS provided the credentials are scoped accordingly:

ie for PR previews, a SaaS CI tool has:

  • a service account saas-ci-tool-pr-preview@digital.cabinet-office.gov.uk
  • which has SpaceDeveloper permissions on gds-service/pr-previews (org/space)

ie for team-manuals, other microsites, tools, etc a SaaS CI tool has:

  • a service account saas-ci-tool-microsites@digital.cabinet-office.gov.uk
  • which has SpaceDeveloper permissions on gds-service/public (org/space)

I think it would depend on the needs of the developer setting up PR checks.


This section of this comment are more general thoughts

Developer experience

The user experience of setting up deployment pipelines with SaaS CI tools is, in my experience, frustrating; whereas Concourse is flexible and extensible enough for the needs of Platforms at GDS.

The user experience of setting up simple CI pipelines with SaaS CI tools is, in my experience, an absolute dream.

Financial resources

The financial resources required for a self-hosted CD tool for running fairly custom pipelines that deploy our platforms are quite low, because deployment pipelines tend not to require too much compute resources. Paying for capacity (e.g. we have 3 t3.medium VMs running 24x7); also self-hosting allows us to mitigate IA concerns, and to have very flexible pipelines.

The financial resources required for self-hosting CI tools are quite high, because the compute resources are roughly correlated to developer activity/productivity. During the day when people are working we need to ensure tests run quickly, and capacity planning becomes harder: enter autoscaling, etc. A SaaS CI tools pricing models of charging, based on usage rather than capacity, fit this stage of the development cycle better IMO.

@LeePorte
Copy link

The approach of running CI using SaaS products and CD using something self hosted is a sensible approach. I am also fine with PR CD happening with a n appropriately scoped SaaS tool. I suggest that as part of this we write patterns that non RE team members can follow easily to reduce the proliferation of tools. I would lean towards using Circle as the SaaS tool and concourse for the CD.

Jenkins use is reducing and it looks like there is only going to be one programme left using it. That programme has much higher value things to do than migrate CI / CD.

@philandstuff
Copy link
Contributor

Some thoughts

GSP's trust model

GSP's trust model has evolved over time; but we have got to a point where we trust GitHub because a) they are not a major part of our threat model and b) managing our own list of GPG keys has been painful. (I won't go into more detail here because the detail is off topic)

Terminology: PR builds vs CI vs CD

I think there are distinct but related things here:

  • PR builds: does the code on this branch pass tests?
  • CD: the continuous delivery pipeline:
    • CI: does the code on master pass tests?
    • the rest of CD: deployment to successive environments

In particular, I don't see PR builds as "CI": I see CI as the first step of the CD pipeline. It might be a bit prescriptivist of me, but the word "integration" in "continuous integration" means "merging your branch to master", so if it isn't on master, it's not CI.

Although PR builds and the CI step might strongly overlap in what they do, they differ in why: for a PR build, it's "is this code okay to merge?" whereas in the CI step it's "is this code okay to push to production?". CI might have outputs such as built code artefacts which the rest of the pipeline promotes to successive environments; these outputs are not relevant to PR builds.

I don't mind that other people might think differently; I mainly call this out so that we can come to agreement on the definition of terms in the context of this document. I also think it's important to note that while you might be happy running PR builds on CircleCI, you might prefer to run your CI step on Concourse, for example. You also might be happier with a lower level of assurance for PR builds than for your CI step. PR builds and CI have different but overlapping needs.

The other technology bit: artefact repositories

It's (sometimes) impossible to implement a CD pipeline without an artefact repository of some sort. We have a proliferation of tools and SaaS services at GDS:

  • jar files
    • artifactory
    • bintray
  • docker images
    • docker hub
    • ECR
    • harbor
  • .deb packages
    • aptly
  • generic blobs
    • S3

Any discussion of CD is incomplete without a nod to this.

I think content trust is more concerning with artefact repositories than with source code repositories, because injecting nefarious code into a built binary is less obvious than injecting nefarious code into a source code repo.

Deployment privileges and maintenance burden

Any CD server which has privileges to deploy to a production environment is something which we should be careful about. Access to that server should be (roughly) as tightly controlled as access to the production environment itself. In the best case, each prod environment has its own CD server for deploying to it.

However, CD servers cost time and knowledge to maintain. Having lots of them is a maintenance burden, especially in terms of keeping them all up to date. So we may prefer to have fewer CD servers for maintenance simplicity.

In short, we can either:

  • make CD servers easier to maintain by having a single CD server which can deploy everywhere
  • make CD servers less of a juicy target by having each prod environment have its own CD server

I am sure there is a reasonable way of synthesising these approaches usefully. We have something approaching this in the multi-tenant concourse, where we have separate workers for each prod environment; but I have ideas about how we could tighten this further.

My view

My view is that we should have something like this:

  • PR builds: use whatever. Concourse, GitHub actions, CircleCI, Travis, 🤷‍♂
  • CI/CD: use Concourse.
  • artefact repositories: use SaaS (ECR, bintray, github packages maybe), combined with some form of content trust if appropriate (eg pin to specific image digests)

@vixus0
Copy link

vixus0 commented Nov 20, 2019

If you're running more/different tests against code on master than you are against a PR you're not failing fast.

From my experience drawing a big distinction between PR builds and CI means you'll probably end up with two places where you need to maintain the same test setup in slightly different ways. If your tests have any external dependencies that becomes particularly annoying and you soon end up with things like travis.sh and jenkins.sh.

GitHub-integrated CI tools such as Travis can be setup to run tests against master on a merge, so I think that gets rid of the first step of the CD pipeline. At that point your CD system picks up the tested code, builds artifacts and continues from there.

Also maybe we want to stop using giant headings in comments because it draws attention away from comments which don't have them.

@tlwr
Copy link
Contributor Author

tlwr commented Nov 21, 2019

Both Phil and Anshul have raised good points about PR builds & CI/CD. My preference would be for the same checks to be run at PR time and after merge to master, so that:

  • there is a single canonical way of running status checks (there may be many status checks from different tools - e.g. linting, unit tests, integration tests, static analysis - which may run in parallel)
  • master builds are in the open (copy and pasted from doc from @katstevens) - "keeps us honest" and ~"limits scope to the repo"

There are additional comments in the document about:

  • speed of the tools - "occasional slowness" - e.g. being stuck in a queue
  • brittleness of docker-compose in a container (dc-in-d)
  • how easy a tool / SaaS solution is to learn

@barrucadu
Copy link

barrucadu commented Nov 21, 2019

I'm not really sure where this fits into the doc, so I'll comment here.

Something we rely on heavily in GOV.UK is parameterised builds. Many of our tests require running things against another repository - for example, every PR triggers a build of the publishing-e2e-tests integration test suite after the app's own test suite passes. Such a build has many parameters in Jenkins:

Screenshot 2019-11-21 at 09 57 04

Sometimes we need to manually tweak these parameters. For example, if we're making changes to two different apps (which live in two different git repositories) and need to test them against each other. So any CI solution we go for absolutely has to support human-tweaking of build parameters, and preferably in a more friction-free way than having having to commit a list of parameters to the branch.

I accept that this isn't ideal, that it would be nice for the repository to define the exact state of the thing and not need to rely on a human filling in the right parameters, but we're a long way from that at the moment.

A little while back we looked at using Concourse for CI, and this lack was a major part of why we didn't proceed.

@NickColley
Copy link

NickColley commented Nov 25, 2019

Just to represent the GOV.UK Design System team's needs for CI as a public comment:

Is your project open-source, does it have open source contributors, how do they expect to contribute?

The GOV.UK Design System aims to do open source rather than coding in the open so we have many contributions by people outside of GDS (alphagov). Representing the community outside of government is essential for the success of the Design System.

Our workflow relies on the fact that CI runs when they open a pull request, it helps contributors fix mistakes themselves and reduces the overhead for us as we tend to get lots of smaller contributions.

We will not be able to use any CI approach that blocks external contribution, thank you for taking all these needs into account when you figure out the next steps. We're available to chat about details of contribution if you need :)

@tlwr
Copy link
Contributor Author

tlwr commented Dec 16, 2019

People have been asking on the status of this, and we're currently working on back-office things in order to proceed.

We've reviewed the comments and confirmed the hypotheses we had before we started this process:

  • We have multiple ways for doing CI/PR checks using hosted solutions, these work but have user experience tradeoffs, and do not fulfill all requirements (e.g. public / open source PR builds)
  • We have organic adoption of SaaS CI/PR check tools from different teams, and people are happy with these solutions for CI/PR checks
  • IA is happy with us using SaaS tools, for a subset of getting code to production

What we are doing is:

  • reviewing and evaluating the comments
  • working out how those comments manifest themselves in the features offered by CI SaaS tools
  • drawing up a business case for a single centrally procured CI SaaS tool

In the meantime:

  • we do not want to block people from doing work or planning future work
  • we are not making any commitments to a timeline of procuring a SaaS tool (as this involves £££)
  • if you are using Travis/Circle/etc, please continue to do so
  • if you are using Concourse for PR builds as well as deployments, please continue to do so
  • the comments above have not swayed our attitude towards how we use Concourse for continuous deployment.

Separately, we have admin work to do for evaluating how we administer version control, and we will use the comments from this PR to that end.

@tlwr
Copy link
Contributor Author

tlwr commented Jun 24, 2020

Closing this as we now have GitHub Enterprise procured, which is in mild use for CI

We still want a guidance page here, and I'll reference this PR when writing it

@tlwr tlwr closed this Jun 24, 2020
@tlwr tlwr deleted the ci-proposal-outcome branch June 24, 2020 12:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants