Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceName column missing in the Otel Table in Clickhouse #1166

Closed
anmolghosh opened this issue Sep 12, 2024 · 10 comments
Closed

ServiceName column missing in the Otel Table in Clickhouse #1166

anmolghosh opened this issue Sep 12, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@anmolghosh
Copy link

anmolghosh commented Sep 12, 2024

Component(s)

otelcollector

Component version

0.17.1

wgc version

0.40.2

controlplane version

0.107.0

router version

0.109.1

What happened?

Description

We have Cosmo deployed inside kubernetes cluster using the standard helm chart provided by Wundergraph team. When we are looking at the logs of otelcollector, we see following errors

insert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum
insert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge
insert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram

Running a count query on the above table table inside clickhouse show all 3 tables have zero record.

Also, if we review the schema on the following file, we can verify that the column ServiceName was never defined in first place
https://github.com/wundergraph/cosmo/blob/main/controlplane/db/schema.sql

Steps to Reproduce

Install Wundergraph cosmo using the standard helm chart with otelcollector enabled
Prometheus is kept disabled on all components.

Expected Result

There are no errors in otelcollector logs and records are being inserted into corresponding clickhouse tables

Actual Result

All inserts to otel_metrics_sum, otel_metrics_gauge and otel_metrics_histogram are failing with error

 No such column ServiceName in table

Environment information

Environment

WunderGraph Cosmo Stack running on AWS EKS Kubernetes Cluster.
Stack deployed using helm chart

oci://ghcr.io/wundergraph/cosmo/helm-charts/cosmo

Versions:

Helm Chart: 0.11.1
cdn: 0.10.1
controlplane: 0.107.0
studio: 0.88.1
router: 0.109.1
otelcollector: 0.17.1
graphqlmetrics: 0.22.0

Router configuration

version: "1"
headers:
  all: # Header rules for all origin requests.
    request:
      - op: "propagate"
        named: Authorization
      - op: "propagate"
        named: ClientKey
      - op: "propagate"
        named: refresh_token
      - op: "propagate"
        named: Traceparent
subgraph_error_propagation:
  enabled: true
  mode: "pass-through"
traffic_shaping:
  all:
    request_timeout: 120s
cors:
  allow_headers:
    - Origin
    - Content-Length
    - Content-Type
    - ClientKey
    - Refresh_Token

Router execution config

No response

Log output

2024-09-13T00:56:28+05:30 {"level":"info","ts":1726169188.0839438,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"metrics","name":"clickhouse","error":"insert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum (fa3d81a4-19e7-4ef0-9832-dabab1acd4c5)\ninsert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge (a3cb4634-2048-4bad-9ac6-8828c9631d21)\ninsert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram (edbfb99c-5432-418d-875c-1538d263e1f8)","interval":"35.094430427s"}
2024-09-13T00:57:03+05:30 {"level":"info","ts":1726169223.1858504,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"metrics","name":"clickhouse","error":"insert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge (a3cb4634-2048-4bad-9ac6-8828c9631d21)\ninsert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum (fa3d81a4-19e7-4ef0-9832-dabab1acd4c5)\ninsert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram (edbfb99c-5432-418d-875c-1538d263e1f8)","interval":"18.860788202s"}

Additional context

No response

@anmolghosh anmolghosh added the bug Something isn't working label Sep 12, 2024
Copy link

WunderGraph commits fully to Open Source and we want to make sure that we can help you as fast as possible.
The roadmap is driven by our customers and we have to prioritize issues that are important to them.
You can influence the priority by becoming a customer. Please contact us here.

@StarpTech
Copy link
Contributor

StarpTech commented Sep 12, 2024

Hi @anmolghosh, this must be related to an old migration issue. I'd assume that you run Cosmo already for a while. You should be able to fix this by adding the column manually to the tables. The schema definition can be found here.

@anmolghosh
Copy link
Author

Hi @StarpTech

Yes we are running it for some time now. We are also in process of migrating the clickhouse from kubernetes hosted to cloud managed. I am assuming running the migration from scratch on the new cloud instance should fix any table inconsistencies.

@anmolghosh
Copy link
Author

Hi @StarpTech

Any reason the Table Schema is not updated in either of the schema file or migration folder:

Since the Cosmo's helm chart run the Clickhouse migration job by default using dbmate when OtelCollector is enabled, this will never execute because of the clause CREATE TABLE IF NOT EXISTS
Ref:

So even if anyone is setting up cosmo from scratch, might face similar issue

@anmolghosh
Copy link
Author

Hi @StarpTech

I see the team have added helm hooks to handle it. Will do a fresh run and see how it goes.

Thanks again.

Suggestion: We should either delete the tables from controlplane/db/schema.sql if the goal is to create initial tables with this or alternatively we can update the file to be in sync with latest schema

@StarpTech
Copy link
Contributor

Hi, yes this should fix it.

Any reason the Table Schema is not updated in either of the schema file or migration folder:

This file is no longer in active use and is considered obsolete, so we plan to clean it up.

Unfortunately, OTEL collector migration was not included, and the fix has only been applied to customers who encountered issues. We understand this is not ideal, and we will make every effort to prevent such situations in the future.

Since the Cosmo's helm chart run the Clickhouse migration job by default using dbmate when OtelCollector is enabled, this will never execute because of the clause CREATE TABLE IF NOT EXISTS
Ref:

The OTEL collector apply its own migrations. There is no automatism yet to modify the OTEL collector schema after the initial migration.

@StarpTech
Copy link
Contributor

StarpTech commented Sep 12, 2024

Feel free to reopen it when you think it was not resolved.

@MnLujan
Copy link

MnLujan commented Dec 27, 2024

Hi @StarpTech, is this issue still unresolved? Currently, I’m deploying Cosmo on Kubernetes using the Helm chart version 0.12.3, and I’m encountering problems with the clickhouse-migration job in controlplane because the otel_metrics table does not exist. Shouldn't OtelCollector create these tables in ClickHouse?

versions:
controlplane: 0.118.1
otelcollector: 0.17.1

@StarpTech
Copy link
Contributor

StarpTech commented Dec 27, 2024

HI @MnLujan, this issue has been resolved a long time ago. Please share your exact steps and complete logs of the error. Also ensure that you are on the latest helm version. The tables are created by the otelcollector automatically + a few by the clickhouse migration job from the gqlmetrics collector.

@MnLujan
Copy link

MnLujan commented Dec 30, 2024

Hey @StarpTech, thanks for your response.

Here’s the flow of events:

We updated Cosmo from Helm chart version 0.8.0 to 0.12.3 and also updated the Docker images to the following versions:

  • controlplane: 0.118.1
  • studio: 0.96.0
  • otelcollector: 0.17.1
  • router: 0.150.0
  • graphqlmetrics: 0.30.2
  • cdn: 0.11.2
  • keycloak: 0.10.4

After the update, we noticed that metrics weren’t working. It turned out to be a communication issue between the Router and OtelCollector. Once that was resolved, I encountered another issue, which led me here.

I tried running the migrations manually and faced issues with another column, so I decided to recreate the PVC (since the data wasn’t critical). When recreating the PVC via Helm, the clickhouse-migrations-job from ControlPlane is triggered before completion, and I'm now getting the following errors:

  • clickhouse migration job:
Migrating ClickHouse database
ClickHouse database migration failed

ExecaError: Command failed with exit code 2: dbmate --wait --wait-timeout 30s -u 'clickhouse://user:password@clickhouse-url:9000/cosmo?dial_timeout=15s&max_execution_time=60' --no-dump-schema -d clickhouse/migrations up

Error: code: 60, message: Unknown table expression identifier 'cosmo.otel_traces' in scope SELECT TraceId, SpanId, toDateTime(Timestamp, 'UTC') AS Timestamp, SpanAttributes['wg.operation.name'] AS OperationName, toLowCardinality(SpanAttributes['wg.operation.type']) AS OperationType, SpanAttributes['wg.federated_graph.id'] AS FederatedGraphID, toLowCardinality(SpanAttributes['wg.organization.id']) AS OrganizationID, Duration, toLowCardinality(StatusCode) AS StatusCode, if((StatusMessage = 'STATUS_CODE_ERROR') OR (position(SpanAttributes['http.status_code'], '5') = 1) OR (position(SpanAttributes['http.status_code'], '4') = 1) OR mapContains(SpanAttributes, 'wg.request.error'), true, false) AS HasError, StatusMessage, SpanAttributes['wg.operation.hash'] AS OperationHash, SpanAttributes['wg.operation.content'] AS OperationContent, SpanAttributes['wg.operation.persisted_id'] AS OperationPersistedID, SpanAttributes['http.status_code'] AS HttpStatusCode, SpanAttributes['http.host'] AS HttpHost, SpanAttributes['http.user_agent'] AS HttpUserAgent, SpanAttributes['http.method'] AS HttpMethod, SpanAttributes['http.target'] AS HttpTarget, SpanAttributes['wg.client.name'] AS ClientName, SpanAttributes['wg.client.version'] AS ClientVersion, mapContains(SpanAttributes, 'wg.subscription') AS Subscription FROM cosmo.otel_traces WHERE ((SpanAttributes['wg.router.root_span']) = 'true') OR ((SpanAttributes['wg.component.name']) = 'router-server') ORDER BY Timestamp DESC

Applying: 20230825095359_traces_mv.sql
at getFinalError (file:///app/node_modules/.pnpm/execa@9.3.0/node_modules/execa/lib/return/final-error.js:6:9)
at makeError (file:///app/node_modules/.pnpm/execa@9.3.0/node_modules/execa/lib/return/result.js:108:16)
at getAsyncResult (file:///app/node_modules/.pnpm/execa@9.3.0/node_modules/execa/lib/methods/main-async.js:167:4)
at handlePromise (file:///app/node_modules/.pnpm/execa@9.3.0/node_modules/execa/lib/methods/main-async.js:150:17)
at async file:///app/dist/bin/ch-migrate.js:9:24 {
shortMessage: "Command failed with exit code 2: dbmate --wait --wait-timeout 30s -u 'clickhouse://user:password@clickhouse-url:9000/cosmo?dial_timeout=15s&max_execution_time=60' --no-dump-schema -d clickhouse/migrations up",
command: 'dbmate --wait --wait-timeout 30s -u clickhouse://user:password@clickhouse-url:9000/cosmo?dial_timeout=15s&max_execution_time=60 --no-dump-schema -d clickhouse/migrations up',
escapedCommand: "dbmate --wait --wait-timeout 30s -u 'clickhouse://user:password@clickhouse-url:9000/cosmo?dial_timeout=15s&max_execution_time=60' --no-dump-schema -d clickhouse/migrations up",
cwd: '/app',
durationMs: 180.621146,
failed: true,
timedOut: false,
isCanceled: false,
isGracefullyCanceled: false,
isTerminated: false,
isMaxBuffer: false,
isForcefullyTerminated: false,
exitCode: 2,
stdout: 'Applying: 20230825095359_traces_mv.sql',
stderr: "Error: code: 60, message: Unknown table expression identifier 'cosmo.otel_traces' in scope SELECT TraceId, SpanId, toDateTime(Timestamp, 'UTC') AS Timestamp, SpanAttributes['wg.operation.name'] AS OperationName, toLowCardinality(SpanAttributes['wg.operation.type']) AS OperationType, SpanAttributes['wg.federated_graph.id'] AS FederatedGraphID, toLowCardinality(SpanAttributes['wg.organization.id']) AS OrganizationID, Duration, toLowCardinality(StatusCode) AS StatusCode, if((StatusMessage = 'STATUS_CODE_ERROR') OR (position(SpanAttributes['http.status_code'], '5') = 1) OR (position(SpanAttributes['http.status_code'], '4') = 1) OR mapContains(SpanAttributes, 'wg.request.error'), true, false) AS HasError, StatusMessage, SpanAttributes['wg.operation.hash'] AS OperationHash, SpanAttributes['wg.operation.content'] AS OperationContent, SpanAttributes['wg.operation.persisted_id'] AS OperationPersistedID, SpanAttributes['http.status_code'] AS HttpStatusCode, SpanAttributes['http.host'] AS HttpHost, SpanAttributes['http.user_agent'] AS HttpUserAgent, SpanAttributes['http.method'] AS HttpMethod, SpanAttributes['http.target'] AS HttpTarget, SpanAttributes['wg.client.name'] AS ClientName, SpanAttributes['wg.client.version'] AS ClientVersion, mapContains(SpanAttributes, 'wg.subscription') AS Subscription FROM cosmo.otel_traces WHERE ((SpanAttributes['wg.router.root_span']) = 'true') OR ((SpanAttributes['wg.component.name']) = 'router-server') ORDER BY Timestamp DESC",
stdio: [
undefined,
'Applying: 20230825095359_traces_mv.sql',
"Error: code: 60, message: Unknown table expression identifier 'cosmo.otel_traces' in scope SELECT TraceId, SpanId, toDateTime(Timestamp, 'UTC') AS Timestamp, SpanAttributes['wg.operation.name'] AS OperationName, toLowCardinality(SpanAttributes['wg.operation.type']) AS OperationType, SpanAttributes['wg.federated_graph.id'] AS FederatedGraphID, toLowCardinality(SpanAttributes['wg.organization.id']) AS OrganizationID, Duration, toLowCardinality(StatusCode) AS StatusCode, if((StatusMessage = 'STATUS_CODE_ERROR') OR (position(SpanAttributes['http.status_code'], '5') = 1) OR (position(SpanAttributes['http.status_code'], '4') = 1) OR mapContains(SpanAttributes, 'wg.request.error'), true, false) AS HasError, StatusMessage, SpanAttributes['wg.operation.hash'] AS OperationHash, SpanAttributes['wg.operation.content'] AS OperationContent, SpanAttributes['wg.operation.persisted_id'] AS OperationPersistedID, SpanAttributes['http.status_code'] AS HttpStatusCode, SpanAttributes['http.host'] AS HttpHost, SpanAttributes['http.user_agent'] AS HttpUserAgent, SpanAttributes['http.method'] AS HttpMethod, SpanAttributes['http.target'] AS HttpTarget, SpanAttributes['wg.client.name'] AS ClientName, SpanAttributes['wg.client.version'] AS ClientVersion, mapContains(SpanAttributes, 'wg.subscription') AS Subscription FROM cosmo.otel_traces WHERE ((SpanAttributes['wg.router.root_span']) = 'true') OR ((SpanAttributes['wg.component.name']) = 'router-server') ORDER BY Timestamp DESC"
],
ipcOutput: [],
pipedFrom: []
}

I executed the command SHOW TABLES FROM cosmo; into clickhouse pod and I see:

SHOW TABLES FROM cosmo

Query id: ce2c1bb1-d9eb-4552-b2a9-f81085db9c20

    ┌─name──────────────────────────────────────┐
 1. │ operation_latency_metrics_5_30            │
 2. │ operation_request_metrics_5_30            │
 3. │ router_metrics_30                         │
 4. │ router_uptime_30                          │
 5. │ schema_migrations                         │
 6. │ subgraph_latency_metrics_5_30             │
 7. │ subgraph_request_metrics_5_30             │
 8. │ traces                                    │
 9. │ traces_by_client_quarter_hourly           │
10. │ traces_by_http_status_code_quarter_hourly │
11. │ traces_by_operation_quarter_hourly        │
    └───────────────────────────────────────────┘

My understanding is that the table cosmo.otel_traces should be created by OtelCollector, but it doesn't appear to be happening.
OtelCollector is enabled via the Helm chart's values file.

Do you have any suggestions? I’d really appreciate your help. Thanks! 🙌🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants