diff --git a/apps/web/.env.example b/apps/web/.env.example index f97b006ceb..c56610fd7c 100644 --- a/apps/web/.env.example +++ b/apps/web/.env.example @@ -72,4 +72,7 @@ SLACK_SUPPORT_WEBHOOK_URL= WORKSPACES_LOOKBACK_30= # https://turbo.build/repo/docs/crafting-your-repository/using-environment-variables#loose-mode -TURBO_ENV_MODE=loose \ No newline at end of file +TURBO_ENV_MODE=loose + +OPENPANEL_CLIENT_ID=something +OPENPANEL_CLIENT_SECRET=something diff --git a/apps/web/public/assets/posts/infra-openstatus/hosting.png b/apps/web/public/assets/posts/infra-openstatus/hosting.png new file mode 100644 index 0000000000..291845aa03 Binary files /dev/null and b/apps/web/public/assets/posts/infra-openstatus/hosting.png differ diff --git a/apps/web/public/assets/posts/infra-openstatus/queue.png b/apps/web/public/assets/posts/infra-openstatus/queue.png new file mode 100644 index 0000000000..794a203d39 Binary files /dev/null and b/apps/web/public/assets/posts/infra-openstatus/queue.png differ diff --git a/apps/web/public/assets/posts/infra-openstatus/tech-infra.png b/apps/web/public/assets/posts/infra-openstatus/tech-infra.png new file mode 100644 index 0000000000..4787354d9a Binary files /dev/null and b/apps/web/public/assets/posts/infra-openstatus/tech-infra.png differ diff --git a/apps/web/src/content/posts/openstatus-infra.mdx b/apps/web/src/content/posts/openstatus-infra.mdx new file mode 100644 index 0000000000..4df69c32d4 --- /dev/null +++ b/apps/web/src/content/posts/openstatus-infra.mdx @@ -0,0 +1,113 @@ +--- +title: "Building OpenStatus: A Deep Dive into Our Infrastructure Architecture" +description: + Let's deep dive in the infra behind OpenStatus. Learn how we built it and how we host it. +author: + name: Thibault Le Ouay Ducasse + url: https://bsky.app/profile/thibaultleouay.dev +publishedAt: 2024-12-29 +tag: engineering +image: /assets/posts/infra-openstatus/tech-infra.png +--- + +## Infrastructure Overview + +OpenStatus is a synthetic monitoring platform designed with resilience, scalability, and efficiency in mind. +Our users rely on us to provide real-time insights into their service health, making it essential to maintain a robust and performant infrastructure. + +In this post, we'll take a deep dive into our infrastructure architecture, exploring the key components, managed services, and design principles that power OpenStatus. + +## Application Landscape + +Our platform consists of several interconnected applications, each designed for a specific purpose: + +1. **Frontend Ecosystem**: + - A NextJS application that powers our marketing site, user dashboard, and status page hosted on [Vercel](https://vercel.com/). + - An Astro + Starlight-powered documentation application hosted on [Cloudflare Pages](https://pages.cloudflare.com/). + +We chose Vercel for the Next.js application because it performs exceptionally well there, the DX is great. And we selected Cloudflare Pages for the documentation since it is a static site and it's super cheap. + +2. **Backend Infrastructure** All our backend services are hosted on Fly.io. + - API server: Our public API and our alerting engine + - Probes/Checker: a golang app deployed globally to monitor your service + - Screenshot app: a service that takes screenshot of your website when we detect an downtime (Playwright) + - Workflow engine: a server that handles the workflow of alerting, and our internal workflows (email automation). + +We chose Fly.io for our backend services because it's a great platform for deploying globally distributed services. It's also very easy to deploy and manage. +We are planning to add more providers (e.g. Koyeb) to our probes to have a more resilient system. + +Hosting providers + + +## Managed Services + +We also rely heavily on managed services to avoid handling it ourselves. Here are the services we use: + +### Scheduling + +Recognizing the critical nature of monitoring, we've heavily rely on CRON to ensure timely checks: + +- **Cron Jobs**: Currently using Vercel Cron, with plans to migrate to Google Cron for an enhanced user experience (better UI e.g. we can see when the cron ran, retry policy). + + +### Queue Architecture + +Due to the critical nature of checks, we are using a queue to handle task processing and retry logic: + +Every check is pushed to a queue and processed by our probes. If the probe fails to process the check, it is retried 3 times before being marked as failed. + +- **Job Queue**: Google Task Queues provide our distributed task management, with strategically segmented queues for different check frequencies + +We've implemented a granular queue system to ensure efficient task processing, each queue is dedicated to a specific check frequency (e.g. every minute, every 10 minutes). + + +Queue providers + + +### Data Infrastructure + +We also don't want to handle the data infrastructure by ourselves. We rely on managed services for that: + +- **Primary Database**: [Turso](https://turso.tech?ref=openstatus.dev), providing a cost efficient data storage solution. We love the fact that's it's hosted SQLite database. It's just a file we can embedded in our services and sync it periodically. +- **Analytics Database**: [Tinybird](https://www.tinybird.co?ref=openstatus.dev), enabling complex analytical queries and insights. + +## Design Philosophy + +Our infrastructure design is driven by several key principles: + +- **Resilience**: Ensuring high availability and fault tolerance +- **Scalability**: Architectural choices that allow seamless growth +- **Cost-Efficiency**: Leveraging managed services and cloud credits +- **Performance**: Optimizing each component for maximum efficiency. + + +## How much does it cost us? + +Our current monthly cost is around $328. This includes: + +- Vercel: $40, we are two members in the team, so we had to upgrade to the team plan. +- Fly.io: $154 36*4 (all our probes at $4 average, not all regions cost the same) + $10 (for the api server) +- Google Cloud Platform: $0 (We are still using the free credits, but we expect to pay around $50 for the queue) +- Tinybird: $100 +- Turso: $29 +- Cloudfare: $5 + + + +# Conclusion + +Building a resilient synthetic monitoring platform is hard. It's not just a $5 VPS that you can deploy and forget. It requires a more complex infrastructure to be able to provide a reliable service. + +The drawback of this approach is the complexity of providing an easy self hostable services. Which is annoying because we are an open-source project and we want to provide a self-hostable version of OpenStatus. But we are working on community edition that will be easier to deploy. + +*Want to start monitoring your services with OpenStatus? [Sign up for free](https://www.openstatus.dev/app/login?ref=blogpost-infra) and get started today!* \ No newline at end of file