Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build & Test Caching / Incremental Builds / "Modular Cloud" (Remote Computation Cache) #121

Closed
sebinsua opened this issue Oct 1, 2020 · 4 comments
Labels
enhancement New feature or request high priority

Comments

@sebinsua
Copy link
Contributor

sebinsua commented Oct 1, 2020

Modular is an open-source framework built for scaling up your development workflow so you can spend more time doing productive work. One important aspect of allowing a project to scale is its build/test time; if this is high, it degrades the efficiency of the whole development workflow.

Early in the project we implemented a cache of node_modules/ to reduce/skip the installation step on CI. This is a tablestakes feature of modern open-source CI systems and benefits projects within the firm by collapsing yarn install times from ~2 minutes to ~18 seconds.

Very recently @NMinhNguyen installed this caching layer into another internal project and found that the impact was far greater, with the CI process going from taking ~19 minutes in total to only ~3 minutes. This outcome was highly surprising and it turned out that the reason for it was because the project in question used Nx which incidentally stores a build cache within the node_modules/ directory we cached.

This technique of caching build information/outputs is extremely common and absolutely necessary if you wish to scale a very large project within a single repository / CI process.

It is widely seen as a way of making builds/tests up to 10x faster, which can mean a 10x higher throughput of PRs by a team, or a very significant cost-saving for the business with regards to CI resource usage.

Nx Remote Cache MarketingBazel Remote Cache Information

e.g. Bazel and Nx

If you're wondering whether remote caching is a basic feature of all projects that allow you to build/test large repositories you'd be correct. It is a foundational feature that every monorepo build toolchain implements to build/test large repositories:

Right from the onset of the internal *-cache project, its purpose has been to create the foundations for fast builds/tests within the firm, that is required for Modular to provide a scalable development workflow. In a way, what has been built essentially acts like 'Nx Private Cloud' but with fewer bells and whistles.

As a remote cache is now deployed internally, we'll now be able to use this to increase the scalability of other work within a repository (e.g. coverage), and we shall discuss tooling and primitives to do so below.

Build in-house vs delegate to opensource tool

Rebuilding a tool like Nx or Bazel would be development intensive.

We could choose to wrap one of these tools or we could choose to provide guidance on how they could be used on top of Modular. However, since build/test scalability appears to be an important aspect of what we're trying to achieve, completely delegating this work to huge/complex toolchains that we aren't able to control or understand well might be a mistake.

My opinion is that it'd be a good idea to create internal caching 'primitives' that can be used by each of the tasks in a development workflow, and so we should experiment with and try to understand the design choices in the bigger toolchains, but also take a look at smaller libraries which appear to deal with small aspects of the problem to see if we can learn approaches from these (e.g. beezel, backfill and @rushstack/package-deps-hash).

Implement CacheStorage class

Something that should be mentioned is that adding a new remote cache to some of the build tools is relatively easy. For example, Nx has the following class (see also nx-remotecache-gcs) and backfill has a similar CacheStorage class.

Minh suggested that we could use patch-package to add an S3 Cache Provider into backfill and then upstream this as a PR once we are happy with it.

Caching Primitives, Caching Strategies & Cache Coarsenesss

Depending on whether you are caching package builds, very large applications, node_modules/ or tests/coverage you could need a completely different caching strategy.

Concepts

Key generation

Action Keys

Keys are constructed from the information used to execute a task. (See also Nx's docs on this.)

Fallbacks when there are cache misses (but yet useful caches are available)

GitHub Cache has the concept of restore-keys to allow fallback to other caches.

Per-package caches

Tools like Nx create a cache per package. These work well for monorepos with many independent packages, since we can decide for each package whether to skip build or test steps depending on whether a cache key exists or not.

Application caches

On the other hand, if you have a webpack application that spans a whole repository and this is updated and persisted on every build, a tool like Nx wouldn't be a good choice. If your cache is produced from a large amount of source code and this is not partitioned by package, then a single file change would cause the cache key for the whole application cache to change and caching would be completely unviable as a solution.

In this situation, we should not generate cache keys in the same way. We need a strategy in which the last known good cache for the branch is retrieved. There are a number of CI providers which seem to have a time-based cache strategy and seem to use the latest uploaded cache.

Additionally, this depends on there being a way of caching the build of applications in a highly granular way (Snowpack might be the best case here?). If we're continuing to use webpack, we would need to upgrade to webpack 5, which adds support for a persistent cache. See the following comment:

The latest beta of webpack 5 has support for persistent caching, which would improve the speed of builds in the majority of cases by re-using previous work. [...]

Unfortunately, CRA doesn’t yet support webpack 5 (although I did start some work towards this upgrade back in March). Since Next.js currently supports webpack 5, if required we should have the necessary context to finish any upgrade to CRA.

Open-sourcing the backend as "Modular Cloud"

If we'd like Modular to scale large development workflows outside the firm, we should consider opensourcing a version of the internal *-cache backend without any logic specific to internal services. We could call it 'Modular Cloud' as it would be synonymous with 'Nx Cloud'.

The other possibility would be to allow users to swap out the cache backend used by Modular.

@NMinhNguyen
Copy link
Contributor

NMinhNguyen commented Oct 5, 2020

Not at all convinced if this is a good idea, but Jest has some logic to figure out changed files and dependencies between files:

It is also possible to define custom Jest runners: https://www.youtube.com/watch?v=U_IYuAXtJZ0

Might be an interesting experiment to see if these packages can be used to detect changed modules/packages?

@sebinsua sebinsua changed the title Build Caching / Incremental Builds Build & Test Caching / Incremental Builds / "Modular Cloud" (Remote Computation Cache) Oct 6, 2020
@sebinsua
Copy link
Contributor Author

sebinsua commented Oct 6, 2020

Btw, this issue assumes that we want to build and deploy all of the widgets/views within the repository and that this repository will grow very large because it's a single repository for (almost) all applications.

If that wasn't the case, or we didn't want to use caching to solve this, we could go back to the original 'module federation' solution that @NMinhNguyen and I created back in April (and only use 'affected' logic). Also, there are other solutions described here (e.g. lazy compilation).

However, presumably we'd still need caching for coverage reports of large repositories, since I understand that they're required for all of the source code.

@NMinhNguyen
Copy link
Contributor

Unfortunately, CRA doesn’t yet support webpack 5 (although I did start some work towards this upgrade back in March). Since Next.js currently supports webpack 5, if required we should have the necessary context to finish any upgrade to CRA.

Related to this is @tannerlinsley's gist "Replacing Create React App with the Next.js CLI".

How dare you make a jab at Create React App!?

Firstly, Create React App is good. But it's a very rigid CLI, primarily designed for projects that require very little to no configuration. This makes it great for beginners and simple projects but unfortunately, this means that it's pretty non-extensible. Despite the involvement from big names and a ton of great devs, it has left me wanting a much better developer experience with a lot more polish when it comes to hot reloading, babel configuration, webpack configuration, etc. It's definitely simple and good, but not amazing.

Now, compare that experience to Next.js which for starters has a much larger team behind it provided by a world-class company (Vercel) who are all financially dedicated to making it the best DX you could imagine to build any React application. Next.js is the 💣-diggity. It has amazing docs, great support, can grow with your requirements into SSR or static site generation, etc.

It would be worth us investigating how a Next.js app using webpack 5 works with persistent caching (perhaps using the internal cache tool we've built, or experiment on a branch on GitHub using GitHub Cache Action).

@threepointone
Copy link
Contributor

tracking issue for webpack 5 support in create react app facebook/create-react-app#9994

Feels like something we could potentially contribute to ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

No branches or pull requests

4 participants