-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc_values
for fields that need to be sorted or aggregated in ElasticSearch, and disable all others.
#12782
Conversation
Unlike what is suggested in #12741, I use |
if (!elasticSearchExtension.isDocValuesEnabled()) { | ||
columnProperties.put("doc_values", false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity_id
in all OAL generated metrics, how do we add this for that? I don't see OAL engine relative changes.
@@ -78,6 +78,7 @@ public StorageID id() { | |||
private String id0; | |||
@Column(name = ID1, storageOnly = true) | |||
private String id1; | |||
@ElasticSearch.EnableDocValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does alarm start time need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does alarm start time need this?
All fields used for sorting and aggregation need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does alarm start time need this?
This reminds me one potential issue here, if we restrict to only add fields used in this repo, it might break third parties’s custom plugin if they add more features to their own plugin by aggregating/sorting some fields that we didn’t enable doc_values. The extensibility will be restricted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will. Generally, it may be worth to see how much benefit we will get from this change. Could you try a benchmark about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to be honest, we never guarantee users could read data from elasticsearch on their own, we only guarantee from our GraphQL/PromQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanahmily can you share where you find disabling doc_values
can speed up the query performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, he was talking about BanyanDB, and the concept was from this config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key use case is when we talk about _traffic
indices and trace/log indices. Conditions for that, is not used for sorting and aggregation. So, we could reduce the payload.
1. Value column of metrics.
2. Conditions of logs and traces(skywalking and zipkin) exclude latency and timestamp, which are used in sorting.
3. All searchable field in metadata(*_traffic)
In the original issue, I only proposed three use cases. Nothing more in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditions for that, is not used for sorting and aggregation. So, we could reduce the payload.
What I’m wondering is this, what “payload” are we trying to reduce, as mentioned, disabling doc_values
is mainly for reducing disk space, I don’t see how it would speed up in terms of query performance like you said here #12782 (comment). If reducing disk space is our goal, disabling all possible fields will maximize the outcome, that’s why I tend to disable doc_values by default and opt in those fields that need this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing disk space is a benefit, and speeding up
is just from smaller files/indices perspective, which is BanyanDB side asking about.
Sorry for misleading.
Please note, we are better to verify what is the impact to existing(last version) indices? Does Elasticsearch and our storage implementation support to change this config automatically when upgrade. |
We support modifying index/template mapping actually, just run a test on the same ES server from the previous commit (master branch) and then upgrade to this branch, the existing indices's mappings remain the same and the template mappings changed as expected, which will impact the new indices created in the future. Diff of sw_metrics_all template mapping
diff --git a/tmp/before-template.json b/tmp/after-template.json
index 603432b7f9..113f050e94 100644
--- a/tmp/before-template.json
+++ b/tmp/after-template.json
@@ -27,90 +27,110 @@
},
"properties": {
"dest_service_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"agent_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"last_ping": {
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"precision": {
"index": false,
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"double_summation": {
"index": false,
- "type": "double"
+ "type": "double",
+ "doc_values": false
},
"labels_json": {
"index": false,
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"tag_key": {
"type": "keyword"
},
"type": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"uuid": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"summation": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"instance_traffic_name": {
"index": false,
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"percentage": {
"type": "integer"
},
"total_num": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"time_bucket": {
"type": "long"
},
"service_layer": {
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"component_id": {
"index": false,
"type": "integer"
},
"service_name": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"count": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"entity_id": {
"type": "keyword"
},
"denominator": {
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"numerator": {
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"dest_process_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"start_time": {
"type": "long"
},
"related_service_layer": {
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"tag_value": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"ranks": {
"index": false,
@@ -118,58 +138,71 @@
},
"t_num": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"service_traffic_name_match": {
"analyzer": "oap_analyzer",
"type": "text"
},
"related_service_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"name": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"service_traffic_name": {
"copy_to": "service_traffic_name_match",
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"short_name": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"s_num": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"parameters": {
"index": false,
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"process_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"span_name": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"datatable_summation": {
"index": false,
"type": "text"
},
"detect_type": {
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"tag_type": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"task_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"component_ids": {
"index": false,
"type": "keyword"
},
"layer": {
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"int_value": {
"type": "integer"
@@ -178,122 +211,154 @@
"type": "keyword"
},
"remote_service_name": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"endpoint": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"attr0": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"total": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"ebpf_profiling_schedule_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"endpoint_traffic_name_match": {
"analyzer": "oap_analyzer",
"type": "text"
},
"service_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"datatable_count": {
"index": false,
"type": "text"
},
"source_service_instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"service_instance": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"profiling_support_status": {
- "type": "integer"
+ "type": "integer",
+ "doc_values": false
},
"value": {
"type": "long"
},
"source_service_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"address": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"datatable_value": {
"index": false,
"type": "text"
},
"represent_service_instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"dest_endpoint": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"represent_service_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"end_time": {
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"match": {
"index": false,
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"service_group": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"attr5": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"label": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"service_instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"related_instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"source_process_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"message": {
"index": false,
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"attr2": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"attr1": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"double_value": {
"type": "double"
},
"attr4": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"endpoint_traffic_name": {
"copy_to": "endpoint_traffic_name_match",
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"attr3": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"dest_service_instance_id": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"last_update_time_bucket": {
- "type": "long"
+ "type": "long",
+ "doc_values": false
},
"source_endpoint": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"service": {
- "type": "keyword"
+ "type": "keyword",
+ "doc_values": false
},
"dataset": {
"index": false,
|
OK, this seems good enough not breaking anything once we added all necessary annotations. |
CHANGES
log.