This is a public repo used for collaboration on a student project between Solent University and Rockstone Data Ltd.
All the databases below are running 1 Year and 10 Minutes worth of data. This equates to 1 data point per second or roughly 31.5 Million Rows
ClickhouseDB | PostgreSQL | TimeScaleDB | ArcticDB (Personal Tier AWS) | |
---|---|---|---|---|
Raw Read (Seconds) | 0.339s | 3.207s | 4.399s | 8.250s |
Read Total Disk Usage (Megabytes) | 240 MB | 1091 MB | 1091 MB | 481 MB |
Downsampled (500 Samples) (Seconds) | 1.961s | 11.370s | 11.563s | 5.972s |
Write (Seconds) | 1.553s | 28.750s | 26.299s | 59.987s |
Write Total Disk Usage (Megabytes) | 241.26 MB | 1093 MB | 1093 MB | 482.52 MB |
The number of different database technologies is expanding. Becoming more specialised to particular applications, with associated potential performance boosts (eg OLTP vs OLAP) and cost savings.
At Rockstone Data we have utilised a relatively new column orientated database called ClickhouseDB to host very large time series data tables (700m+ rows). This class of database was first described in ‘C-Store: A Column-orientated DBMS’ and outperforms traditional row orientated databases in both speed and storage.
The project aim is to create an interactive demonstrator running on the company website clearly demonstrating these benefits vs ‘traditional’ or row-orientated databases.
Create a one page ‘dashboard’ web app which has a line plot of a scalar value over time, a start / end datetime picker and a database source picker. The database source options will be one of postgres, postgres + timescaleDB or ClickhouseDB.
There will be submit button that when pressed fetches the data for the plot. There will also be a ‘downsampling on-off’ toggle. And a downsampling count text entry.
On pressing ‘submit’ a timer is started that times how long it takes to fetch the data (note this will not include the time taken for the charting library to load it).
The elapsed time will be displayed on the dashboard near the line chart. Text boxes will be populated showing the space taken up on disk for the table, and the number of rows in the table. A text box showing GB of disk storage per million rows will be shown.
The project will be built using Python Plotly Dash. Development will use Python venv and pip to ensure a machine independent virtual environment. VS Code will be used for the IDE with step by step debug.
For local development the databases will be run inside Docker containers, with one for postgres, two (chs and chc) for Clickhouse and, as required another for postgres + Timescale (the author is unfamiliar with TimescaleDB so looks to the student to advise).
Docker-compose will be used in deployment and for local start/stopping of the databases.
Test data will need to be created on eg a once/second basis for 1 to 100 years. Ideally something with variations that are visible on a 24 hour, monthly and yearly plot. Advice can be provided on the best way to do this. Test data is only created once before deployment.
Identify a class of data that can be plotted as eg a geo-heatmap or spectrograph or surface plot (ie more dimensions), maybe with animations or orientation controls. Generate the data and add the plot (map or other) below the line chart.
Benchmark Mongo DB timeseries collections. Compare costs with Snowflake/Data Dog etc see X post here Benchmark writing data to the large tables.
Rockstone Data has created this public repo along with a ‘Project’ view onto which the student will curate a backlog of ‘Issues’, moving them across the board Kanban style as they progress. The student will be expected to use ‘Issue comments’ as a form of log-book to facilitate communications. This is known as 'Feature Branch Workflow' or 'Git Flow'. A high level of autonomy and problem-solving initiative will be required.
The GitHub development flow will be used where individuals create a branch for issues they are working on. The branch will be named eg: nt_123_short_desc So 'nt' is the developer's initials, 123 is the Issue Number followed by a one or two word description of the Issue. We will try to adhere to the 'Joel Test' of good software practice.
We will create a docs directory and make notes in Markdown to share knowledge across the team. If we need to add screenshots, then place them in a GitHub Discussions post and link to the docs entry.
A good team cosists of everyone trying to help eachother, the leader is there as a coach and to maximise productivity of the team, finding and removing blockers.
This project has four main componenents initially,
- the Flask/Plotly/Dash python module.
- the Postgres Docker container
- the two Clickhouse DB containers
The steps to install and run it are:
Git clone this repo then create the virtual environment and install the packages:
cd db_bench
python -m venv .venv
pip install -r requirements.txt
Create local versions of the config files
cp .vscode/launch.json.example .vscode/launch.json
cp .env_example .env
cp postgres.env_example postgres.env
Enable all user permissions
cd etc\clickhouse-server
In chuser.xml, add the grant to the user profile:
<chuser>
<profile>ch_profile</profile>
<networks>
<ip>::/0</ip>
</networks>
<password>chuser_pwd</password>
<quota>ch_quota</quota>
<grants>
<query>GRANT ALL ON *.*</query>
</grants>
</chuser>
This will create a subdirectory .venv
containing a virtual Python environment isolating the project from other projects on your computer. You may want to move across to using the poetry package manager as one of your deliverables. It handles dependencies in a more intelligent way than venv and pip.
If you're using VS Code, note the .vscode directory which contains an entry allowing you to start and debug the project.
You can try this now, but will likely get errors about not being able to connect to the database. So the next step is to run up the Docker containers for Clickhouse and configure them. You will need Docker Desktop installed on your machine.
cd db_bench
docker-compose up ch_server ch_client
This will build your containers and run them locally. You can see their status with docker container ls -a
.
Now we need to check that the clickhouse database is running locally, choose your preferred SQL client. I like to use DBeaver. Create a connection of type Clickhouse on localhost, port 8124
(specified in docker-compose.yml), user chuser
and password chuser_pwd
(specified in /etc/clickhouse-server/users.d/chuser.xml and .env) and we start with database default
.
You should now be able to connect to your locally running Clickhouse docker container. When you are connected, open an SQL terminal and create the database. Disconnect and reconnect as this will refresh DBeaver - the new database will not show up on the GUI if you don't do this.
CREATE DATABASE ts_db;
Now create the demo timeseries table with the following SQL command. This only creates a small table. Once you're sure of the installation, change all the toDate(2021
to toDate(2022
to generate a year and 10 minute's worth of 1 second time series data. Once again, refresh DBeaver.
CREATE TABLE ts_db.demo_ts
ENGINE = MergeTree
ORDER BY tuple()
AS
SELECT toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) )) as cdatetime,
toSecond(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
toMinute(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
2 * toHour(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
5 * toDayOfWeek(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
8 * toWeek(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
12 * toMonth(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) ))) +
20 * (toYear(toDateTime(arrayJoin(range(toUInt32(toDateTime('2021-01-01 00:00:00')), toUInt32(toDateTime('2022-01-01 00:10:00')), 1) )))-2021) as ts_values
Make sure all the packages in chdemoapp.py
have been installed, and then you can start the app and it should connect to the ClickHouse database and show some data. This can now also be done with the db_bench.py
application.
To configure Postgres, run the command docker compose up db
. This will create the psql_db
container. Go to DBeaver and create a new connection to a Postgres database on port 5432 with the username postgres
and password postgres
.
Once connected, create a table with the SQL command
CREATE TABLE demo_ts (
cdatetime DATE,
ts_values INTEGER
);
and generate some data with
WITH time_series AS (
SELECT * FROM generate_series(
'2021-01-01 00:00:00'::timestamp,
'2022-01-01 00:10:00'::timestamp,
'1 second'::interval
) as cdatetime
),
random_values AS (
SELECT random() * 100 AS ts_values -- Adjust range as needed
FROM generate_series(1, 5) -- Generate 5 random values
)
INSERT INTO demo_ts (cdatetime, ts_values)
SELECT time_series.cdatetime, random_values.ts_values
FROM time_series
CROSS JOIN random_values;
Lastly, in order to display the data on the Streamlit app, navigate to your .streamlit
folder (default is at C:\Users\Username\.streamlit
) and create a secrets.toml
file. Add the following code:
CREATE TABLE demo_ts (
cdatetime DATE,
ts_values INTEGER
);
To configure Timescale, run the command docker compose up timescaledb
. This will create the tmscl_db
container. Go to DBeaver and create a new connection to a Timescale database on port 5433 with the username postgres
and password postgres
. (Timescale uses Postgres)
Once connected, create a table with the SQL command.
CREATE TABLE demo_ts (
cdatetime DATE,
ts_values INTEGER
);
and generate some data with
WITH time_series AS (
SELECT * FROM generate_series(
'2021-01-01 00:00:00'::timestamp,
'2022-01-01 00:10:00'::timestamp,
'1 second'::interval
) as cdatetime
),
random_values AS (
SELECT random() * 100 AS ts_values -- Adjust range as needed
FROM generate_series(1, 5) -- Generate 5 random values
)
INSERT INTO demo_ts (cdatetime, ts_values)
SELECT time_series.cdatetime, random_values.ts_values
FROM time_series
CROSS JOIN random_values;
To first install ArcticDB locally, run the command pip install arcticdb
.
Create an Amazon AWS Account and set up an S3 bucket. Within the project .env file, add the URL for the S3 bucket 's3s://s3.<REGION_NAME>.amazonaws.com:<BUCKET_NAME>?aws_auth=true'
to ARCTIC_URL.
To access the S3 bucket, IAM User Access can be set up on AWS. See https://docs.arcticdb.io/4.4.1/#getting-started for more information. The other option which is less recommended is making the bucket publically available by disabling the 'Block public access' settings and adding a statement to the 'Bucket Display'. These options can be found within the AWS bucket options.
Public Bucket Policy:
{
"Id": "BucketPolicy",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllAccess",
"Action": "s3:*",
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>",
"arn:aws:s3:::<BUCKET_NAME>/*"
],
"Principal": "*"
}
]
}
Run the arcticdb_setup.py
file by running python .\arcticdb_setup.py
from the root folder (This may take some time). This sends the same dataset from the Clickhouse database to the Arctic storage.
ImportError: cannot import name 'load_dotenv' from 'dotenv'
If you get the error message shown above, install the package python_dotenv
instead of dotenv
. You do not need to change the import name, as dotenv
will automatically be installed with python_dotenv
.
toml.decoder.TomlDecodeError: Key group not on a line by itself. (line 1 column 1 char 0)
If you get the error message shown above, go to your .streamlit
folder on your computer (default is at C:\Users\Username\.streamlit
) and delete the config.toml
file.