How to improve performance #4816

jason-heo · 2024-10-11T07:37:23Z

jason-heo
Oct 11, 2024

Hello, I'm new to GreptimeDB.

Currently, I use "Spark + Parquet" for data analysis.

I'm thinking of GreptimeDB as a distributed version of DataFusion that supports external secondary indexes.

At first glance, the query performance of GreptimeDB's EXTERNAL TABLE was significantly better compared to "Spark + Parquet." However, I'm struggling with the performance of NATIVE TABLEs.

Could someone advise me on how to improve the performance?

test result

Query No.	GreptimeDB EXTERNAL TABLE (standalone mode)	GreptimeDB NATIVE TABLE (standalone mode)	Spark + Parquet (local, vcore=2, vram=16GB, no spill)
Q1	0.04 sec	1 min 34 sec	0.2 sec
Q2	1.42 sec	30 sec	1.4 sec
Q3	6.8 sec	2 min 25 sec	19 sec
Q4	1.64 sec	1 min 53 sec	4.0 sec

queries

Q1
```
SELECT COUNT(*)
FROM tab;
```

Q2

SELECT COUNT(*)
FROM tab
WHERE blog_id = 'foo';

Q3

SELECT blog_id, doc_id, visitor_id, COUNT(*) AS page_view
FROM tab
GROUP BY blog_id, doc_id, visitor_id
ORDER BY COUNT(*) DESC
LIMIT 10;

Q4

SELECT COUNT(DISTINCT blog_id)
FROM tab;

Data

/path/to/17M-rows.gz.parquet has 17,000,000 rows
cardinalty
- row_id: 17,000,000 (PK field)
- blog_id: 1,080,315
- doc_id: 4,798,262
- visitor_id: 2,139,200

DDL

EXTERNAL TABLE

CREATE EXTERNAL TABLE external_table
WITH (
  LOCATION="/path/to/17M-rows.gz.parquet",
  FORMAT="parquet"
);

NATIVE TABLE: After INSERT INTO SELECT, 4 parquet files are saved in the data dir.

CREATE TABLE IF NOT EXISTS native_table (
  row_id STRING,
  blog_id STRING,
  doc_id STRING,
  visitor_id STRING,
  ts TIMESTAMP TIME INDEX DEFAULT '1970-01-01 00:00:00+0000',
  PRIMARY KEY(row_id)
);

INSERT INTO native_table
SELECT row_id, blog_id, row_id, visitor_id, ts FROM external_table;

spark

CREATE TABLE logs
USING PARQUET
LOCATION "file:///path/to/17M-rows.gz.parquet";

GreptimeDB installaion

standalone mode

./bin/greptime \
    standalone start \
    --http-addr 0.0.0.0:4000 \
    --rpc-addr 0.0.0.0:4001 \
    --mysql-addr 0.0.0.0:4002 \
    --postgres-addr 0.0.0.0:4003 \
    -c conf/greptimedb.conf

pod has total 8 cores, 32GB ram
SSD local disk

GreptimeDB conf

I don't know which configurations are important.

After goolging I just added following lines to greptimedb.conf (but not related to query performance)

[wal]
provider = "raft_engine"
file_size = "256MB"
purge_threshold = "4GB"
purge_interval = "10m"
read_batch_size = 128
sync_write = false

Answered by fengjiachun

Oct 11, 2024

Hello, looking at the four types of query statements you listed, external tables seem to be a more suitable choice.

If you still want to try the native table, modify the create table statement to:

CREATE TABLE IF NOT EXISTS native_table (
  row_id STRING,
  blog_id STRING,
  doc_id STRING,
  visitor_id STRING,
  ts TIMESTAMP TIME INDEX DEFAULT '1970-01-01 00:00:00+0000',
) WITH ('append_mode'='true');

I made two modifications based on your table creation statement:

Removed the setting of PRIMARY KEY(row_id)
'append_mode' allows the table to append data without overwriting duplicates

I also have a question: Does the ts column in your data file represent actual time, or is it just a fixe…

View full answer

fengjiachun · 2024-10-11T12:41:29Z

fengjiachun
Oct 11, 2024
Maintainer

Hello, looking at the four types of query statements you listed, external tables seem to be a more suitable choice.

If you still want to try the native table, modify the create table statement to:

CREATE TABLE IF NOT EXISTS native_table (
  row_id STRING,
  blog_id STRING,
  doc_id STRING,
  visitor_id STRING,
  ts TIMESTAMP TIME INDEX DEFAULT '1970-01-01 00:00:00+0000',
) WITH ('append_mode'='true');

I made two modifications based on your table creation statement:

Removed the setting of PRIMARY KEY(row_id)
'append_mode' allows the table to append data without overwriting duplicates

I also have a question: Does the ts column in your data file represent actual time, or is it just a fixed value? (GreptimeDB organizes data by time, so having ts truly represent time would be beneficial.)

3 replies

jason-heo Oct 12, 2024
Author

@fengjiachun

Hello.

After applying your recommendations, it became a lot FASTER! Many thanks.

The 3rd column "NATIVE TABLE v2" in blow shows the speed.

Query No.	GreptimeDB EXTERNAL TABLE (standalone, pod core=8)	GreptimeDB NATIVE TABLE v1 (with PK, not append mode)	GreptimeDB NATIVE TABLE v2 (without PK, append mode)	local spark (--master=local[8] --driver-memory=16G)
Q1	0.04 sec	1 min 34 sec	0.01 sec	0.23 sec
Q2	1.42 sec	30 sec	0.47 sec	0.65 sec
Q3	6.8 sec	2 min 25 sec	3.26 sec	6.41 sec
Q4	1.64 sec	1 min 53 sec	0.70 sec	1.70 sec

I also have a question: Does the ts column in your data file represent actual time, or is it just a fixed value? (GreptimeDB organizes data by time, so having ts truly represent time would be beneficial.)

Yes, ts has actual time.

Can I have one more question?

How can I use secondary index. Data Model says that "Tag columns are indexed, making queries on tags performant". So I thought every tags are indexed by default and when I created and loaded data in native table, there were puffin files.

But after removing PRIMARY KEY there was no puffin file in data directory. In CREATE TABLE, I couldn't find how to enable index.

Have a nice weekend.

jason-heo Oct 13, 2024
Author

After reading the documentation on 'append_mode', I noticed that append mode can have a PRIMARY KEY, and it seems that fields in the primary key are treated as tags and automatically indexed. Is my understanding correct?

I wanted to leverage GreptimeDB's secondary index to quickly find rows.

My query patterns are:

WHERE blog_id = 'foo'
WHERE doc_id = 'foo'
WHERE visitor_id = 'foo'

I added PRIMARY KEY (blog_id, doc_id, visitor_id) to the CREATE TABLE statement, which resulted in the creation of an index at index/bar.puffin after performing an INSERT INTO.

However, my SELECT queries have become slow again.

I am not sure this is normal in GreptimeDB or I have some mistake.

zhongzc Oct 14, 2024
Maintainer

So I thought every tags are indexed by default and when I created and loaded data in native table, there were puffin files.
it seems that fields in the primary key are treated as tags and automatically indexed.

Your understanding is correct. In the current version, indexes are indeed associated with tags, and the fields within the primary key in GreptimeDB are referred to as tags.

But after removing PRIMARY KEY there was no puffin file in data directory. In CREATE TABLE, I couldn't find how to enable index.

In the current version, there is no way to enable an index without the primary key. However, this feature is under development. This means that in the future, you will be able to designate non-primary key columns as indexes, as well as exclude certain columns from the primary key as indexes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greptime

How to improve performance #4816

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Greptime

How to improve performance #4816

jason-heo Oct 11, 2024

test result

queries

Data

DDL

GreptimeDB installaion

GreptimeDB conf

Replies: 1 comment · 3 replies

fengjiachun Oct 11, 2024 Maintainer

jason-heo Oct 12, 2024 Author

jason-heo Oct 13, 2024 Author

zhongzc Oct 14, 2024 Maintainer

jason-heo
Oct 11, 2024

Replies: 1 comment 3 replies

fengjiachun
Oct 11, 2024
Maintainer

jason-heo Oct 12, 2024
Author

jason-heo Oct 13, 2024
Author

zhongzc Oct 14, 2024
Maintainer