Skip to content

Commit

Permalink
📝 infra blog post (#1127)
Browse files Browse the repository at this point in the history
* 📝 first draft blog

* 📝 article update

* 📝 blog

* 📝 improve blog post

* 📝blog post improvment

* 📝blog post improvment
  • Loading branch information
thibaultleouay authored Dec 29, 2024
1 parent 917979b commit 99d3a10
Show file tree
Hide file tree
Showing 5 changed files with 117 additions and 1 deletion.
5 changes: 4 additions & 1 deletion apps/web/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,7 @@ SLACK_SUPPORT_WEBHOOK_URL=
WORKSPACES_LOOKBACK_30=

# https://turbo.build/repo/docs/crafting-your-repository/using-environment-variables#loose-mode
TURBO_ENV_MODE=loose
TURBO_ENV_MODE=loose

OPENPANEL_CLIENT_ID=something
OPENPANEL_CLIENT_SECRET=something
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
113 changes: 113 additions & 0 deletions apps/web/src/content/posts/openstatus-infra.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: "Building OpenStatus: A Deep Dive into Our Infrastructure Architecture"
description:
Let's deep dive in the infra behind OpenStatus. Learn how we built it and how we host it.
author:
name: Thibault Le Ouay Ducasse
url: https://bsky.app/profile/thibaultleouay.dev
publishedAt: 2024-12-29
tag: engineering
image: /assets/posts/infra-openstatus/tech-infra.png
---

## Infrastructure Overview

OpenStatus is a synthetic monitoring platform designed with resilience, scalability, and efficiency in mind.
Our users rely on us to provide real-time insights into their service health, making it essential to maintain a robust and performant infrastructure.

In this post, we'll take a deep dive into our infrastructure architecture, exploring the key components, managed services, and design principles that power OpenStatus.

## Application Landscape

Our platform consists of several interconnected applications, each designed for a specific purpose:

1. **Frontend Ecosystem**:
- A NextJS application that powers our marketing site, user dashboard, and status page hosted on [Vercel](https://vercel.com/).
- An Astro + Starlight-powered documentation application hosted on [Cloudflare Pages](https://pages.cloudflare.com/).

We chose Vercel for the Next.js application because it performs exceptionally well there, the DX is great. And we selected Cloudflare Pages for the documentation since it is a static site and it's super cheap.

2. **Backend Infrastructure** All our backend services are hosted on Fly.io.
- API server: Our public API and our alerting engine
- Probes/Checker: a golang app deployed globally to monitor your service
- Screenshot app: a service that takes screenshot of your website when we detect an downtime (Playwright)
- Workflow engine: a server that handles the workflow of alerting, and our internal workflows (email automation).

We chose Fly.io for our backend services because it's a great platform for deploying globally distributed services. It's also very easy to deploy and manage.
We are planning to add more providers (e.g. Koyeb) to our probes to have a more resilient system.

<Image
alt="Hosting providers"
src="/assets/posts/infra-openstatus/hosting.png"
width={650}
height={575}
/>


## Managed Services

We also rely heavily on managed services to avoid handling it ourselves. Here are the services we use:

### Scheduling

Recognizing the critical nature of monitoring, we've heavily rely on CRON to ensure timely checks:

- **Cron Jobs**: Currently using Vercel Cron, with plans to migrate to Google Cron for an enhanced user experience (better UI e.g. we can see when the cron ran, retry policy).


### Queue Architecture

Due to the critical nature of checks, we are using a queue to handle task processing and retry logic:

Every check is pushed to a queue and processed by our probes. If the probe fails to process the check, it is retried 3 times before being marked as failed.

- **Job Queue**: Google Task Queues provide our distributed task management, with strategically segmented queues for different check frequencies

We've implemented a granular queue system to ensure efficient task processing, each queue is dedicated to a specific check frequency (e.g. every minute, every 10 minutes).


<Image
alt="Queue providers"
src="/assets/posts/infra-openstatus/queue.png"
width={650}
height={575}
/>


### Data Infrastructure

We also don't want to handle the data infrastructure by ourselves. We rely on managed services for that:

- **Primary Database**: [Turso](https://turso.tech?ref=openstatus.dev), providing a cost efficient data storage solution. We love the fact that's it's hosted SQLite database. It's just a file we can embedded in our services and sync it periodically.
- **Analytics Database**: [Tinybird](https://www.tinybird.co?ref=openstatus.dev), enabling complex analytical queries and insights.

## Design Philosophy

Our infrastructure design is driven by several key principles:

- **Resilience**: Ensuring high availability and fault tolerance
- **Scalability**: Architectural choices that allow seamless growth
- **Cost-Efficiency**: Leveraging managed services and cloud credits
- **Performance**: Optimizing each component for maximum efficiency.


## How much does it cost us?

Our current monthly cost is around $328. This includes:

- Vercel: $40, we are two members in the team, so we had to upgrade to the team plan.
- Fly.io: $154 36*4 (all our probes at $4 average, not all regions cost the same) + $10 (for the api server)
- Google Cloud Platform: $0 (We are still using the free credits, but we expect to pay around $50 for the queue)
- Tinybird: $100
- Turso: $29
- Cloudfare: $5



# Conclusion

Building a resilient synthetic monitoring platform is hard. It's not just a $5 VPS that you can deploy and forget. It requires a more complex infrastructure to be able to provide a reliable service.

The drawback of this approach is the complexity of providing an easy self hostable services. Which is annoying because we are an open-source project and we want to provide a self-hostable version of OpenStatus. But we are working on community edition that will be easier to deploy.

*Want to start monitoring your services with OpenStatus? [Sign up for free](https://www.openstatus.dev/app/login?ref=blogpost-infra) and get started today!*

0 comments on commit 99d3a10

Please sign in to comment.