📝 infra blog post (#1127)

* 📝 first draft blog * 📝 article update * 📝 blog * 📝 improve blog post * 📝blog post improvment * 📝blog post improvment
openstatusHQ · Dec 29, 2024 · 99d3a10 · 99d3a10
1 parent 917979b
commit 99d3a10
Show file tree

Hide file tree

Showing 5 changed files with 117 additions and 1 deletion.
diff --git a/apps/web/.env.example b/apps/web/.env.example
@@ -72,4 +72,7 @@ SLACK_SUPPORT_WEBHOOK_URL=
 WORKSPACES_LOOKBACK_30=
 
 # https://turbo.build/repo/docs/crafting-your-repository/using-environment-variables#loose-mode
-TURBO_ENV_MODE=loose
+TURBO_ENV_MODE=loose
+
+OPENPANEL_CLIENT_ID=something
+OPENPANEL_CLIENT_SECRET=something
diff --git a/apps/web/public/assets/posts/infra-openstatus/hosting.png b/apps/web/public/assets/posts/infra-openstatus/hosting.png
diff --git a/apps/web/public/assets/posts/infra-openstatus/queue.png b/apps/web/public/assets/posts/infra-openstatus/queue.png
diff --git a/apps/web/public/assets/posts/infra-openstatus/tech-infra.png b/apps/web/public/assets/posts/infra-openstatus/tech-infra.png
diff --git a/apps/web/src/content/posts/openstatus-infra.mdx b/apps/web/src/content/posts/openstatus-infra.mdx
@@ -0,0 +1,113 @@
+---
+title: "Building OpenStatus: A Deep Dive into Our Infrastructure Architecture"
+description:
+  Let's deep dive in the infra behind OpenStatus. Learn how we built it and how we host it.
+author:
+  name: Thibault Le Ouay Ducasse
+  url: https://bsky.app/profile/thibaultleouay.dev
+publishedAt: 2024-12-29
+tag: engineering
+image: /assets/posts/infra-openstatus/tech-infra.png
+---
+
+## Infrastructure Overview
+
+OpenStatus is a synthetic monitoring platform designed with resilience, scalability, and efficiency in mind.
+Our users rely on us to provide real-time insights into their service health, making it essential to maintain a robust and performant infrastructure.
+
+In this post, we'll take a deep dive into our infrastructure architecture, exploring the key components, managed services, and design principles that power OpenStatus.
+
+## Application Landscape
+
+Our platform consists of several interconnected applications, each designed for a specific purpose:
+
+1. **Frontend Ecosystem**:
+    - A NextJS application that powers our marketing site, user dashboard, and status page hosted on [Vercel](https://vercel.com/).
+    - An Astro + Starlight-powered documentation application hosted on [Cloudflare Pages](https://pages.cloudflare.com/).
+
+We chose Vercel for the Next.js application because it performs exceptionally well there, the DX is great. And we selected Cloudflare Pages for the documentation since it is a static site and it's super cheap.
+
+2. **Backend Infrastructure** All our backend services are hosted on Fly.io.
+    - API server: Our public API and our alerting engine
+    - Probes/Checker: a golang app deployed globally to monitor your service
+    - Screenshot app: a service that takes screenshot of your website when we detect an downtime (Playwright)
+    - Workflow engine: a server that handles the workflow of alerting,  and our internal workflows (email automation).
+
+We chose Fly.io for our backend services because it's a great platform for deploying globally distributed services. It's also very easy to deploy and manage.
+We are planning to add more providers (e.g. Koyeb) to our probes to have a more resilient system.
+
+<Image
+  alt="Hosting providers"
+  src="/assets/posts/infra-openstatus/hosting.png"
+  width={650}
+  height={575}
+/>
+
+
+## Managed Services
+
+We also rely heavily on managed services to avoid handling it ourselves. Here are the services we use:
+
+### Scheduling
+
+Recognizing the critical nature of monitoring, we've heavily rely on CRON  to ensure timely checks:
+
+- **Cron Jobs**: Currently using Vercel Cron, with plans to migrate to Google Cron for an enhanced user experience (better UI e.g. we can see when the cron ran, retry policy).
+
+
+### Queue Architecture
+
+Due to the critical nature of checks, we are using a queue to handle task processing and retry logic:
+
+Every check is pushed to a queue and processed by our probes. If the probe fails to process the check, it is retried 3 times before being marked as failed.
+
+- **Job Queue**: Google Task Queues provide our distributed task management, with strategically segmented queues for different check frequencies
+
+We've implemented a granular queue system to ensure efficient task processing, each queue is dedicated to a specific check frequency (e.g. every minute, every 10 minutes).
+
+
+<Image
+  alt="Queue providers"
+  src="/assets/posts/infra-openstatus/queue.png"
+  width={650}
+  height={575}
+/>
+
+
+### Data Infrastructure
+
+We also don't want to handle the data infrastructure by ourselves. We rely on managed services for that:
+
+- **Primary Database**: [Turso](https://turso.tech?ref=openstatus.dev), providing a cost efficient data storage solution. We love the fact that's it's hosted SQLite database. It's just a file we can embedded in our services and sync it periodically.
+- **Analytics Database**: [Tinybird](https://www.tinybird.co?ref=openstatus.dev), enabling complex analytical queries and insights.
+
+## Design Philosophy
+
+Our infrastructure design is driven by several key principles:
+
+- **Resilience**: Ensuring high availability and fault tolerance
+- **Scalability**: Architectural choices that allow seamless growth
+- **Cost-Efficiency**: Leveraging managed services and cloud credits
+- **Performance**: Optimizing each component for maximum efficiency.
+
+
+## How much does it cost us?
+
+Our current monthly cost is around $328. This includes:
+
+- Vercel: $40, we are two members in the team, so we had to upgrade to the team plan.
+- Fly.io: $154   36*4 (all our probes at $4 average, not all regions cost the same) + $10 (for the api server)
+- Google Cloud Platform: $0 (We are still using the free credits, but we expect to pay around $50 for the queue)
+- Tinybird: $100
+- Turso: $29
+- Cloudfare: $5
+
+
+
+# Conclusion
+
+Building a resilient synthetic monitoring platform is hard. It's not just a $5 VPS that you can deploy and forget. It requires a more complex infrastructure to be able to provide a reliable service.
+
+The drawback of this approach is the complexity of providing an easy self hostable services. Which is annoying because we are an open-source project and we want to provide a self-hostable version of OpenStatus. But we are working on community edition that will be easier to deploy.
+
+*Want to start monitoring your services with OpenStatus? [Sign up for free](https://www.openstatus.dev/app/login?ref=blogpost-infra) and get started today!*