Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating solutions for builds at scale #32

Closed
wants to merge 1 commit into from

Conversation

caylahamann
Copy link
Contributor

@caylahamann caylahamann commented Aug 26, 2020

Investigating solution for builds at scale

Research Goals

  • Identify how long builds will take at scale (perform tests on our current site)
  • Identify possible optimizations / methods to decrease our build time

closes #17

Findings

Currently:

  • Builds take ~70 sec with the developer site for 119 pages
    • 10 seconds to build static HTML
    • Extrapolating ⇒ would take 420 (or 7 minutes) to build around 5000 pages (assuming this linearly increases)

Option 1: use cache API to speed up builds

  • cache API is passed to Gatsby's Node APIs
  • two functions we can use: set cache value and get cache value
  • typically plugins will implement this to optimize build times, we can try to create our own plugin or use the API in a more simple form to get faster builds
  • currently there are no plugins for AWS Amplify (there are for S3, Gatsby Cloud)
  • level of work: maybe somebody else would be able to guage this? to me this seems like a high level of work because we would need to investigate gatsby's internals and figure out how to cache our information, but maybe it's pretty simple to do

Option 2: change our cloud hosting provider

Option 3: Conditional page builds

  • Experimental feature from Gatsby
  • Doesn't re-render HTML for pages with unchanged data
  • Feature works by comparing the page data from the previous build to the new page data. This will create a list of page directories that are passed to the static build process
  • .cache and public directories will need to be persisted between builds (I believe this is the case right now but I could be wrong)
  • Any code changes (in templates, components, plugins, ect) will trigger a full build
  • Tried it out with Developer site: decreased build to 20-25 seconds
    • 4 seconds to build static HTML pages
  • Not sure how exactly this would scale

Option 4: Do nothing?

  • So when it comes to minor changes, we don't necessarily have to wait for a build on amplify (especially if there's only copy changes) so we can merge to main before there is a finished PR build

Recommendation

I am not sure! Would love some feedback from the devs on this. Can talk through all the options above. It seems as the easiest and least painful option would be option 3 (but also potentially risky?), but we might run into issues later.

Open Questions

  • How can I test for builds of 5000 pages? I'm currently just extrapolating, but the results actually might be very different from when we actually have 500 pages

Resources

Resources all linked above throughout the document!

@caylahamann caylahamann changed the base branch from develop to research August 26, 2020 14:55
@zstix
Copy link
Contributor

zstix commented Aug 26, 2020

I'm still reading through the research findings, but I thought it would be helpful link this thread on the Gatsby repository where the maintainers sent a request about large-scale sites using Gatsby: gatsbyjs/gatsby#19512

@zstix
Copy link
Contributor

zstix commented Aug 26, 2020

This is all pretty promising stuff. I was worried we would be looking at 30+ minute build times. I know it's a rough estimate, but if the build time is around the 7 minute mark, that's not terrible. To me, that makes Option 4 (do nothing) a potentially viable route.

I'd be curious to do a little more digging into Option 1 (using a cache). If we could cache all pages and only update new pages, that could be a pretty nice solution (something similar to the process of Option 3). That said, I'm not sure if we have the ability to do that with Gatsby's caching system.

Speaking of Option 3 (conditional builds), you mentioned that it "Doesn't re-render HTML for pages with unchanged data". That makes me think that Gatsby is only looking at the markdown data. Does this mean that if we update the JS code for a page (i.e. the template or styles) the page wont update? If so that might make local development for us a little trickier.

The only option I don't think is on the table is Option 2 (change hosting). Gatsby Cloud looks like it would give us a lot of benefits, but we would need to approve pricing, build a new CI system, etc. Not impossible, but probably not our first choice option.

My gut reaction is to go with Option 4 (do nothing) and get a better sense for how impactful the problem is. If it's too slow, we could look into implementing Option 1 (cache) or Option 3 (conditional builds). Curious to hear what other folks think!

@roadlittledawn
Copy link
Contributor

still reading, but if we do indeed create automatic directory index pages as well (for each directory that doesn't already have an index.mdx landing page), that would probably increase build time as well. the current site currently has ~500 "directories." even after simplifying the IA a bit, i imagine we'd still have 3/4 of that or ~375 directories. using the assumed / extrapolated rate it would increase build time it to ~7.5 minutes

@roadlittledawn
Copy link
Contributor

roadlittledawn commented Aug 26, 2020

Re: @zstix comment on Option 3. not sure i'm fully understanding, but that Gatsby doc does say:

Any code changes (templates, components, source handling, new plugins etc) will prompt the creation of a new webpack compilation hash and trigger a full build.

So if a PR only really has changes to markdown/content, it just looks at those files and builds them. If a PR contains changes to any other part of the site, it fully rebuilds.

If that's the case, that seems like a decent option since many of the PRs will likely be content only, and probably relatively small.

+1 to Option 4 though too. since the site can be easily run locally and i assume content authors will be doing that while making content changes, i can imagine that creating builds for each PR may not be particularly useful / necessary. maybe we just run the build once per day then?

@caylahamann
Copy link
Contributor Author

Yeah so (supposedly) conditional builds (Option 3) won't trigger rebuilds on markdown content changes, but it would for any code change. So there wouldn't be an issue of a full rebuild not triggering when there is just content changes which is likely to be the most common change for content creators.

Still, I'm not sure this is all necessary and I'd like to do more diving into the conditional builds before we use it. Option 4 seems to be the best option?

@zstix
Copy link
Contributor

zstix commented Aug 26, 2020

One thing I didn't originally consider: does the time it takes to start the local dev server (via npm start) get impacted as the site scales in content? If so, then 7+ minutes might not be reasonable.

We might want to run a quick test where we add a bunch of dummy pages to the dev site to scale it up to the docs site and run some gatsby commands. I'm happy to help with that (I have a few bash scripts in mind that could help with this).

@caylahamann
Copy link
Contributor Author

@zstix so running npm start shows that it takes around 9 seconds to transform and source nodes and 6 seconds to run page queries. If you would be able to write a script to add a bunch of dummy pages that would be awesome! I was trying to do this yesterday but it was taking me a little bit of time.

@zstix
Copy link
Contributor

zstix commented Aug 26, 2020

Here's a quick python script I wrote to add some dummy pages to the developer site:

#!/usr/bin/env python

CONTENT = """
---
path: '/{path}'
title: 'Dummy Page'
description: 'This is a dummy page'
template: 'GuideTemplate'
---

This is a dummy page!
"""

def createPage(path):
    content = CONTENT.format(path=path)
    filename = "src/markdown-pages/dummy/%s.mdx" % path
    print("Creating %s" % filename)
    f = open(filename, "w+")
    f.write(content)
    f.close()

if __name__ == "__main__":
    for i in range(5001):
        createPage("page-%s" % i)

After using this to create 9,000 pages, we ran npm start and my machine quickly locked up. We then reduced the number of pages to roughly 5,000 and then tried running npm run build. Unfortunately, the process crashed after about 15 minutes.

These findings make me feel a little less confident about Option 4 (do nothing).

@caylahamann
Copy link
Contributor Author

Running gatsby build for 5000 files takes around ~8 minutes. npm start hasn't stopped yet and hiking up my CPU usage, but it seems to be stuck on source and transform nodes. I looked into some possible solutions but it seems that we would have to dive deep into our internals. We would need to figure out how to cache these nodes for developing.

@caylahamann
Copy link
Contributor Author

so the issue seems to be tied to when the mdx is being transformed, I am going to do some investigating into gatsby internals to see if there are any possible optimizations

@jerelmiller
Copy link
Contributor

I think it might be best to start with option 4 and see where we end up. Ideally we have something in our back pocket in case we run into some crazy scaling issues like mentioned above. Seems that Gatsby thread has some sites upward of ~25k - 50k pages, which are 5-10x this site. Seems Gatsby has this kind of scale in mind, so I think we probably should measure first before prematurely optimizing.

My guess is the type of content/types of plugins we have might have an effect. For example, if we have 10,000 images, that might have a much bigger impact on build time than 10,000 mdx files. My recommendation is that we get everything migrated over and see where our pain points are. That might also give us a good idea of where to apply the optimization (do we need to use a specific type of image, do we need to use plugin x instead of y, etc).

@zstix
Copy link
Contributor

zstix commented Sep 1, 2020

That's a good point. We may be worrying about a problem that doesn't exist.

Option 4 (do nothing) seems like the decision (for now). We can always address this if it truly becomes a problem and it seems like we have some viable options.

@zstix zstix closed this Sep 1, 2020
@jpvajda jpvajda deleted the cayla/research-builds branch December 16, 2020 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Misc. Research] Research and identify how long builds will take at scale
4 participants