Analysis of the Trino project CI workflows.
This repo is an example of using Trino to perform ETL (extract, transform, and load) and generate basic reports. The complete workflow is:
- Use a custom connector to read Github's API data and save it into an S3 bucket.
- Run some queries to analyze the data stored in the S3 bucket, save results to a file and publish it to Github Pages.
All the of the above is repeatable and executed on a schedule using Github Actions.
Queries could be executed against the tables in the Github connector, but it has a few downsides:
- it would perform many API requests, possibly hitting rate limits
- range of the data is limited - for example, Github Workflow runs are only available from 2 last months
Data is saved to an S3 bucket, since it's cheap and easy to set-up. Since there's no database server running, there's no maintenance required.
Materialized views are not used, because incremental updates are tricky and different for many tables (Github API endpoints).
- Downloads and extracts a zip with the custom
trino-rest
connector. - Starts Trino in a container with both this additional connector (plugin) and its configuration mounted as volumes. It also includes configuration for a built-in connector that can read and write to an S3 bucket. The required credentials are passed as environmental variables, populated from Github Secrets.
- It runs a series of queries similar to
INSERT INTO hive.<table> SELECT * FROM github.<table>
to save the data. See the Sync class fromtrino-rest
for more details.
- Starts Trino in a container with only the connector required to read data from the S3 bucket.
- Executes multiple queries, saving results as a simple text table in a file.
- Commits the updated results file to this repository, which triggers publishing it to Github Pages.