-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating solutions for builds at scale #32
Conversation
I'm still reading through the research findings, but I thought it would be helpful link this thread on the Gatsby repository where the maintainers sent a request about large-scale sites using Gatsby: gatsbyjs/gatsby#19512 |
This is all pretty promising stuff. I was worried we would be looking at 30+ minute build times. I know it's a rough estimate, but if the build time is around the 7 minute mark, that's not terrible. To me, that makes Option 4 (do nothing) a potentially viable route. I'd be curious to do a little more digging into Option 1 (using a cache). If we could cache all pages and only update new pages, that could be a pretty nice solution (something similar to the process of Option 3). That said, I'm not sure if we have the ability to do that with Gatsby's caching system. Speaking of Option 3 (conditional builds), you mentioned that it "Doesn't re-render HTML for pages with unchanged data". That makes me think that Gatsby is only looking at the markdown data. Does this mean that if we update the JS code for a page (i.e. the template or styles) the page wont update? If so that might make local development for us a little trickier. The only option I don't think is on the table is Option 2 (change hosting). Gatsby Cloud looks like it would give us a lot of benefits, but we would need to approve pricing, build a new CI system, etc. Not impossible, but probably not our first choice option. My gut reaction is to go with Option 4 (do nothing) and get a better sense for how impactful the problem is. If it's too slow, we could look into implementing Option 1 (cache) or Option 3 (conditional builds). Curious to hear what other folks think! |
still reading, but if we do indeed create automatic directory index pages as well (for each directory that doesn't already have an |
Re: @zstix comment on Option 3. not sure i'm fully understanding, but that Gatsby doc does say:
So if a PR only really has changes to markdown/content, it just looks at those files and builds them. If a PR contains changes to any other part of the site, it fully rebuilds. If that's the case, that seems like a decent option since many of the PRs will likely be content only, and probably relatively small. +1 to Option 4 though too. since the site can be easily run locally and i assume content authors will be doing that while making content changes, i can imagine that creating builds for each PR may not be particularly useful / necessary. maybe we just run the build once per day then? |
Yeah so (supposedly) conditional builds (Option 3) won't trigger rebuilds on markdown content changes, but it would for any code change. So there wouldn't be an issue of a full rebuild not triggering when there is just content changes which is likely to be the most common change for content creators. Still, I'm not sure this is all necessary and I'd like to do more diving into the conditional builds before we use it. Option 4 seems to be the best option? |
One thing I didn't originally consider: does the time it takes to start the local dev server (via We might want to run a quick test where we add a bunch of dummy pages to the dev site to scale it up to the docs site and run some gatsby commands. I'm happy to help with that (I have a few bash scripts in mind that could help with this). |
@zstix so running |
Here's a quick python script I wrote to add some dummy pages to the developer site: #!/usr/bin/env python
CONTENT = """
---
path: '/{path}'
title: 'Dummy Page'
description: 'This is a dummy page'
template: 'GuideTemplate'
---
This is a dummy page!
"""
def createPage(path):
content = CONTENT.format(path=path)
filename = "src/markdown-pages/dummy/%s.mdx" % path
print("Creating %s" % filename)
f = open(filename, "w+")
f.write(content)
f.close()
if __name__ == "__main__":
for i in range(5001):
createPage("page-%s" % i) After using this to create 9,000 pages, we ran These findings make me feel a little less confident about Option 4 (do nothing). |
Running |
so the issue seems to be tied to when the mdx is being transformed, I am going to do some investigating into gatsby internals to see if there are any possible optimizations |
I think it might be best to start with option 4 and see where we end up. Ideally we have something in our back pocket in case we run into some crazy scaling issues like mentioned above. Seems that Gatsby thread has some sites upward of ~25k - 50k pages, which are 5-10x this site. Seems Gatsby has this kind of scale in mind, so I think we probably should measure first before prematurely optimizing. My guess is the type of content/types of plugins we have might have an effect. For example, if we have 10,000 images, that might have a much bigger impact on build time than 10,000 mdx files. My recommendation is that we get everything migrated over and see where our pain points are. That might also give us a good idea of where to apply the optimization (do we need to use a specific type of image, do we need to use plugin x instead of y, etc). |
That's a good point. We may be worrying about a problem that doesn't exist. Option 4 (do nothing) seems like the decision (for now). We can always address this if it truly becomes a problem and it seems like we have some viable options. |
Investigating solution for builds at scale
Research Goals
closes #17
Findings
Currently:
Option 1: use cache API to speed up builds
Option 2: change our cloud hosting provider
gatsby-plugin-s3
configurationgatsby-plugin-netlify-cache
Option 3: Conditional page builds
Option 4: Do nothing?
Recommendation
I am not sure! Would love some feedback from the devs on this. Can talk through all the options above. It seems as the easiest and least painful option would be option 3 (but also potentially risky?), but we might run into issues later.
Open Questions
Resources
Resources all linked above throughout the document!