Skip to content

Commit

Permalink
docs: update technical architecture (#2047)
Browse files Browse the repository at this point in the history
* We've added a lot since this was last touched, including Clickhouse,
  sqlmesh, Apollo, Dagster.
  • Loading branch information
ryscheng committed Sep 3, 2024
1 parent 40271a8 commit 1f0f997
Show file tree
Hide file tree
Showing 5 changed files with 94 additions and 63 deletions.
83 changes: 47 additions & 36 deletions apps/docs/docs/how-oso-works/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,68 +9,79 @@ deployed data pipeline so that the community can build this open data warehouse
together. All of the code for this architecture is available to view/copy/redeploy from the [OSO Monorepo](https://github.com/opensource-observer/oso).
:::

## Diagram
## Pipeline Overview

The following diagram illustrates Open Source Observer's technical architecture.
OSO maintains an [ETL](https://en.wikipedia.org/wiki/Extract%2C_load%2C_transform) data pipeline that is continuously deployed from our [monorepo](https://github.com/opensource-observer/oso/) and regularly indexes all available event data about projects in the [oss-directory](https://github.com/opensource-observer/oss-directory).

- **Extract**: raw event data from a variety of public data sources (e.g., GitHub, blockchains, npm, Open Collective)
- **Transform**: the raw data into impact metrics and impact vectors per project (e.g., # of active developers)
- **Load**: the results into various OSO data products (e.g., our API, website, widgets)

[![OSO Architecture Diagram](https://mermaid.ink/img/pako:eNqNVMtu2zAQ_BWCJxuI0rsPAfLorU4cuO3F7GFFbS0iEinwYcUN8u9dSpQly00RHiRqNcNZzi72jUtTIF_xLMuE9spXuGKCPzWo2dYEK5E95Q7tAS27tbJUHqUPFgUXuqMI_bsyrSzBevb9TmhGy4V8b6EpWRPySsmd4LfSG-vY4tHoDF9LCM6rAy4F_9Uz4gok4wj7I77P_hR4iD8ewAP7qvdK4xwBGqqj8yfUViqk21DkBENd9JtOh2UZXfOnwpZ9Yc8B7VFwit0w40yLeY-Muj1whoBGzRH3Rnur8uCRRUcTVukCX9H24CHJC8L2-VvCS1M3FMnBy3Lm5QsQGnb_rk13y9GOsQCqwYr8ItoalGadOZsUPHMwrpTtYKI0VYWpcPeVCUXnwvKC1oLF0pCtu0ViniJscaf2A205400vu0ses8Yaic51okXuz9VONYxr3KW8L3wtKJdk7CmjkdVXmoTbtr02ZKrrPL02U0_Pwf9vhoSjbzqUnp85lGCXh8a8c5jmOkQGh-OeLTbG-b1Fap-Zu-Nu6nEvtAaPVkGl_uCHcp9hTRwdiWPhO9amL-aHzU315Fe8RluDKmgIvcWw4L7EmmTiICrAvsRZ8044CN5sj1rylbcBr3hoKHN8UEC9Xg9BLBR17Lofat1se_8LDu-M4g?type=png)](https://mermaid.live/edit#pako:eNqNVMtu2zAQ_BWCJxuI0rsPAfLorU4cuO3F7GFFbS0iEinwYcUN8u9dSpQly00RHiRqNcNZzi72jUtTIF_xLMuE9spXuGKCPzWo2dYEK5E95Q7tAS27tbJUHqUPFgUXuqMI_bsyrSzBevb9TmhGy4V8b6EpWRPySsmd4LfSG-vY4tHoDF9LCM6rAy4F_9Uz4gok4wj7I77P_hR4iD8ewAP7qvdK4xwBGqqj8yfUViqk21DkBENd9JtOh2UZXfOnwpZ9Yc8B7VFwit0w40yLeY-Muj1whoBGzRH3Rnur8uCRRUcTVukCX9H24CHJC8L2-VvCS1M3FMnBy3Lm5QsQGnb_rk13y9GOsQCqwYr8ItoalGadOZsUPHMwrpTtYKI0VYWpcPeVCUXnwvKC1oLF0pCtu0ViniJscaf2A205400vu0ses8Yaic51okXuz9VONYxr3KW8L3wtKJdk7CmjkdVXmoTbtr02ZKrrPL02U0_Pwf9vhoSjbzqUnp85lGCXh8a8c5jmOkQGh-OeLTbG-b1Fap-Zu-Nu6nEvtAaPVkGl_uCHcp9hTRwdiWPhO9amL-aHzU315Fe8RluDKmgIvcWw4L7EmmTiICrAvsRZ8044CN5sj1rylbcBr3hoKHN8UEC9Xg9BLBR17Lofat1se_8LDu-M4g)
The following diagram illustrates Open Source Observer's technical architecture.
[![OSO Architecture](https://mermaid.ink/img/pako:eNqVVU1z2jAQ_SsancwktNMcOXSGhHbaTgikNOkB9yDLC1ZjS44-gDST_96VLXDAhja-gFfvzdt9u149U65SoAPa7_djaYXNYUBiOilBkplymgOZJAb0CjQZap4JC9w6DTGNZUWJ5SJXa54xbcmPy1gSfIxLlpqVGSldkgs-j-mQW6UNiW6U7MMmY85YsYJeTH_VDP84lDGIvfO_eycprPzBiFlGPsmlkHCIYJLlT8buUDMuAKvByA4GMq3_VDqk38cy7wWsyXty60A_xRRjH4kyag1JjfS6NfAAwUpxiLhS0mqROAvEOxqwQqawAV2Dt0m2CLPb64DnqigxkjDLsy7SdyfJI-YiwATGmmnIFNZUww8agLnOu7tZ-dIY2LRMlJCjwx-QN2ZCksrPaYjume4fhTMBxmpmhZKe800l-0ESjdjSWNC9Fjm4s20aV3kOYVDS3Lbxu1rnUaDsIiS6FMuqS0jrHfBeuzoPzSSlVhyMqdUSezS7VrdSFO42v0uvpo8ZGiBYLv7AaWpTUMWb1kkenY7dYDev_yPuS0iY1z4xARdo1WeNdaPAW6bg4o1TsE3GN3VyPZxWWj5AoivcHg-VHR1dNY95Aaa7oeHsuFiXvYHUqfJvJ4-0o94W7S1ysgWLYPvrDvjPVnAwbfOrjYXQn5AQVpboWbD8HvCLz89uYGPf_Ta9Libm5tfz9CsJSz76wozT7GxY4teoekfUTm_FvXXbbKXmHx7Sc1qALphI8fJ59uGY2gwKNNZfQCnTD_6OeUEcc1bNniSnA6sdnFOt3DKjgwXLDb65Eq2EkWBoXbGFQCpwj4zrq6264V7-AjpFJyI?type=png)](https://mermaid.live/edit#pako:eNqVVU1z2jAQ_SsancwktNMcOXSGhHbaTgikNOkB9yDLC1ZjS44-gDST_96VLXDAhja-gFfvzdt9u149U65SoAPa7_djaYXNYUBiOilBkplymgOZJAb0CjQZap4JC9w6DTGNZUWJ5SJXa54xbcmPy1gSfIxLlpqVGSldkgs-j-mQW6UNiW6U7MMmY85YsYJeTH_VDP84lDGIvfO_eycprPzBiFlGPsmlkHCIYJLlT8buUDMuAKvByA4GMq3_VDqk38cy7wWsyXty60A_xRRjH4kyag1JjfS6NfAAwUpxiLhS0mqROAvEOxqwQqawAV2Dt0m2CLPb64DnqigxkjDLsy7SdyfJI-YiwATGmmnIFNZUww8agLnOu7tZ-dIY2LRMlJCjwx-QN2ZCksrPaYjume4fhTMBxmpmhZKe800l-0ESjdjSWNC9Fjm4s20aV3kOYVDS3Lbxu1rnUaDsIiS6FMuqS0jrHfBeuzoPzSSlVhyMqdUSezS7VrdSFO42v0uvpo8ZGiBYLv7AaWpTUMWb1kkenY7dYDev_yPuS0iY1z4xARdo1WeNdaPAW6bg4o1TsE3GN3VyPZxWWj5AoivcHg-VHR1dNY95Aaa7oeHsuFiXvYHUqfJvJ4-0o94W7S1ysgWLYPvrDvjPVnAwbfOrjYXQn5AQVpboWbD8HvCLz89uYGPf_Ta9Libm5tfz9CsJSz76wozT7GxY4teoekfUTm_FvXXbbKXmHx7Sc1qALphI8fJ59uGY2gwKNNZfQCnTD_6OeUEcc1bNniSnA6sdnFOt3DKjgwXLDb65Eq2EkWBoXbGFQCpwj4zrq6264V7-AjpFJyI)

## Major Components

The architecture has the following major components.

### Data Orchestration

Dagster is the central data orchestration system, which manages the entire pipeline,
from the data ingestion (e.g. via [dlt](https://docs.dagster.io/integrations/embedded-elt/dlt) connectors), the [dbt](https://docs.dagster.io/integrations/dbt) pipeline, the [sqlmesh](https://github.com/opensource-observer/dagster-sqlmesh) pipeline, to copying mart models to data serving infrastructure.

You can see our public Dagster dashboard at
[https://dagster.opensource.observer/](https://dagster.opensource.observer/).

### Data Warehouse

Currently all data is stored and processed in Google BigQuery.
Currently all data is stored and processed in
[Google BigQuery](https://cloud.google.com/bigquery/?hl=en).
All of the collected data or aggregated views used by OSO is also made publicly available here (if it is not already a public dataset on BigQuery).
Anyone with can view, query, or build off of any stage in the pipeline.
In the future we plan to explore a decentralized lakehouse.

### Data Orchestration
To see all datasets that you can subscribe to, check out our
[Data Overview](../integrate/overview/index.mdx).

Dagster is the central orchestration system, which manages the entire pipeline,
from the data ingestion, the dbt pipeline, to copying marts to data serving infrastructure.
### dbt pipeline

### API
We use a [dbt](https://www.getdbt.com/) pipeline to clean and normalize the data
into a universal event table. You can read more about our event model
[here](./event.md).

The API can be used by external developers to integrate insights from OSO.
Rate limits or cost sharing subscriptions may apply to it's usage depending
on the systems used. This also powers the OSO website.
### OLAP database

### Website
We use [Clickhouse](https://clickhouse.com/)
as a frontend database for serving live queries to the API server
and frontend website, as well as running a sqlmesh data pipeline.

This is the OSO website at [https://www.opensource.observer](https://www.opensource.observer). This website provides an easy to use public view into the data.
### sqlmesh pipeline

## Dependent Technologies
A [sqlmesh](https://sqlmesh.com/) pipeline
is used for computing time series metrics from
the universal event table, which is copied from the BigQuery dbt pipeline.

Our infrastructure is based on many wonderful existing tools. Our major
dependencies are:
### API service

- Google BigQuery
- As explained above, all of the data that OSO collects and materializes lives
in public datasets in BigQuery.
- Dagster
- Dagster orchestrates all data jobs, including the collection of data
from external sources as well as handling the flow of data through the
main data pipeline.
- dbt
- This is used for data transformations to turn collected data into useful
materializations for the OSO API and website.
- OLAP database
- All dbt mart models are copied to an OLAP database for real-time queries.
This database powers the OSO API, which in turn powers the OSO website.

## Indexing Pipeline
We use [Hasura](https://hasura.io/) to automatically generate
a GraphQL API from our Clickhouse database.
We then use an [Apollo Router](https://www.apollographql.com/docs/router/)
to service user queries to the public.
The API can be used by external developers to integrate insights from OSO.
Rate limits or cost sharing subscriptions may apply to it's usage depending
on the systems used. This also powers the OSO website.

OSO maintains an [ETL](https://en.wikipedia.org/wiki/Extract%2C_load%2C_transform) data pipeline that is continuously deployed from our [monorepo](https://github.com/opensource-observer/oso/) and regularly indexes all available event data about projects in the [oss-directory](https://github.com/opensource-observer/oss-directory).
### OSO Website

- **Extract**: raw event data from a variety of public data sources (e.g., GitHub, blockchains, npm, Open Collective)
- **Transform**: the raw data into impact metrics and impact vectors per project (e.g., # of active developers)
- **Load**: the results into various OSO data products (e.g., our API, website, widgets)
The OSO website is served at
[https://www.opensource.observer](https://www.opensource.observer).
This website provides an easy to use public view into the data.
We currently use [Next.js](https://nextjs.org/)
hosted by [Vercel](https://vercel.com/).

## Open Architecture for Open Source Data

The architecture is designed to fully open to open source collaboration.
The architecture is designed to be fully open to maximum open source collaboration.
With contributions and guidance from the community,
we want Open Source Observer to evolve as we better understand
what impact looks like in different domains.
Expand Down
2 changes: 1 addition & 1 deletion apps/docs/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"@docusaurus/theme-common": "3.4.0",
"@laxels/docusaurus-plugin-segment": "^1.0.6",
"@mdx-js/react": "^3.0.1",
"@plasmicapp/react-web": "^0.2.346",
"@plasmicapp/react-web": "^0.2.350",
"clsx": "^2.1.1",
"prism-react-renderer": "^2.3.1",
"react": "^18.3.1",
Expand Down
8 changes: 4 additions & 4 deletions apps/docs/plasmic.json
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@
"icons": [
{
"id": "bBICBhwqdJEn",
"name": "ChecksvgIcon",
"moduleFilePath": "generated/docs_opensource_observer/icons/PlasmicIcon__Checksvg.tsx"
"name": "CheckSvgIcon",
"moduleFilePath": "generated/docs_opensource_observer/icons/PlasmicIcon__CheckSvg.tsx"
},
{
"id": "4hUpITJttWoK",
Expand All @@ -82,8 +82,8 @@
},
{
"id": "IUL_Z3b2VK27",
"name": "IdeaSvgrepoComsvgIcon",
"moduleFilePath": "generated/docs_opensource_observer/icons/PlasmicIcon__IdeaSvgrepoComsvg.tsx"
"name": "IdeaSvgrepoComSvgIcon",
"moduleFilePath": "generated/docs_opensource_observer/icons/PlasmicIcon__IdeaSvgrepoComSvg.tsx"
}
],
"images": [
Expand Down
14 changes: 7 additions & 7 deletions apps/docs/plasmic.lock
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,22 @@
{
"type": "renderModule",
"assetId": "z50hW5Ihi9k5",
"checksum": "8041afb9d5261cb9dcfdde02df90afa5"
"checksum": "25dc6f80fe7c254e792c51fe2d43b4a9"
},
{
"type": "cssRules",
"assetId": "z50hW5Ihi9k5",
"checksum": "8041afb9d5261cb9dcfdde02df90afa5"
"checksum": "25dc6f80fe7c254e792c51fe2d43b4a9"
},
{
"type": "renderModule",
"assetId": "8u0yNVg3vXsq",
"checksum": "2d14ebf9cc6b71d89f1731028117ffc5"
"checksum": "4deb8270912232ef29700376c2ee2340"
},
{
"type": "cssRules",
"assetId": "8u0yNVg3vXsq",
"checksum": "2d14ebf9cc6b71d89f1731028117ffc5"
"checksum": "4deb8270912232ef29700376c2ee2340"
},
{
"type": "renderModule",
Expand All @@ -44,7 +44,7 @@
{
"type": "icon",
"assetId": "bBICBhwqdJEn",
"checksum": "6c340bbb97a866e45667be367e634e39"
"checksum": "f69ac7871123b31bcfd09c527de66a36"
},
{
"type": "icon",
Expand All @@ -54,7 +54,7 @@
{
"type": "icon",
"assetId": "IUL_Z3b2VK27",
"checksum": "f747464e248d0fd06360dae746c9d7f8"
"checksum": "f8ee83a372a68c5c28e98fcdd36a7be6"
},
{
"type": "image",
Expand All @@ -64,7 +64,7 @@
{
"assetId": "2CtczDeUz9jL9qnFi6NWuQ",
"type": "projectCss",
"checksum": "fb9ec7d982cabcf310fdbed935cf6a98"
"checksum": "ca22df23535ee01348fd2e1587fefcbb"
}
],
"codegenVersion": "0.0.1"
Expand Down
50 changes: 35 additions & 15 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 1f0f997

Please sign in to comment.