Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cortex is in pending state without starting workloads #1735

Open
alexandreLamarre opened this issue Sep 25, 2023 · 6 comments · Fixed by #1746
Open

Cortex is in pending state without starting workloads #1735

alexandreLamarre opened this issue Sep 25, 2023 · 6 comments · Fixed by #1746
Assignees

Comments

@alexandreLamarre
Copy link
Contributor

alexandreLamarre commented Sep 25, 2023

steps to reproduce

  • Start with a fresh cluster
  • Install via HA preset with S3
  • UI shows monitoring backend is installed

expected

UI

  • expected UI to show some warning that cortex is in fact not installed (driver status = Pending here)
  • UI is not using the correct APIs or correct API ordering, needs further investigating

Backend

  • expected cortex to deploy workloads

Not classified

  • expected monitoring cluster CRD preset to not be "all", expected replicas to be 3

opni-manager logs

[20:46:48] ERROR monitoring failed to reconcile monitoring cluster {"gateway": "opni", "namespace": "opni", "error": "Gateway.core.opni.io \"\" not found"}
github.com/rancher/opni/controllers.(*CoreMonitoringReconciler).Reconcile
	github.com/rancher/opni/controllers/core_monitoring_controller.go:86
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226
[20:46:48] ERROR Reconciler error {"controller": "monitoringcluster", "controllerGroup": "core.opni.io", "controllerKind": "MonitoringCluster", "MonitoringCluster": {"name":"opni","namespace":"opni"}, "namespace": "opni", "name": "opni", "reconcileID": "6bd50549-e075-4975-b56c-e912bc095992", "error": "Gateway.core.opni.io \"\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226

monitoring cluster CRD

spec:
  cortex:
   // .... (this is correct)
    cortexWorkloads:
      targets:
        all:
          replicas: 1
    enabled: true
  gateway: {}
  grafana:
    config: {}
    dashboardContentCacheDuration: 0s
    enabled: true
    hostname: // ... (this is correct)

gateway pod ENV

// ...
GATEWAY_NAME=opni-gateway
POD_NAMESPACE=opni
@alexandreLamarre
Copy link
Contributor Author

Deleting the CRD and attempting to re-install via the UI results in the call:

http://localhost:12080/opni-api/CortexOps/configuration/default

�
�(nats: wrong last sequence: 1: key exists�6
(type.googleapis.com/google.rpc.ErrorInfo�

�CONFLICT

@alexandreLamarre
Copy link
Contributor Author

alexandreLamarre commented Sep 25, 2023

there is something weird going on with the gateway ref, after I fixed it manually and got cortex to all-in-one to install, when switching to HA preset mode in the UI, it edited the CRD to have:

gateway:
  name : opni-gateway

but did not include the namespace

@alexandreLamarre
Copy link
Contributor Author

this also did not switch cortex to HA preset:

The status indicates it thinks it should be all in one and has deleted the HA setup

status:
  cortex:
    version: v1.16.0-opni.8
    workloadStatus:
      alertmanager:
        conditions: StatefulSet has been successfully deleted
        ready: true
      all:
        conditions: All replicas are ready
        ready: true
      compactor:
        conditions: StatefulSet has been successfully deleted
        ready: true
      distributor:
        conditions: Deployment has been successfully deleted
        ready: true
      ingester:
        conditions: StatefulSet has been successfully deleted
        ready: true
      purger:
        conditions: Deployment has been successfully deleted
        ready: true
      querier:
        conditions: StatefulSet has been successfully deleted
        ready: true
      query-frontend:
        conditions: Deployment has been successfully deleted
        ready: true
      ruler:
        conditions: Deployment has been successfully deleted
        ready: true
      store-gateway:
        conditions: StatefulSet has been successfully deleted
        ready: true
    workloadsReady: true
  image: >-
    alex7285/opni@sha256:6180a4e04fe1b310b02c437766fbe20bf3702304e5cebf38a797140647d46435
  imagePullPolicy: IfNotPresent
spec:
  cortex:
    cortexConfig:
      limits:
        compactor_blocks_retention_period: {}
      log_level: debug
      storage:
        backend: s3
        filesystem: {}
        s3:
        // ....
    cortexWorkloads:
      targets:
        all:
          replicas: 1
    enabled: true
  gateway:
    name: opni-gateway
    namespace: opni
  grafana:
    config: {}
    dashboardContentCacheDuration: 0s
    enabled: true

@alexandreLamarre
Copy link
Contributor Author

Trying to then edit the config in the UI, results in the following bug:
invalid

@alexandreLamarre
Copy link
Contributor Author

When the HA configuration was accepted by the UI, it was not applied to the backend:

{
    "enabled": true,
    "revision": {
        "revision": "56399193"
    },
    "cortexWorkloads": {
        "targets": {
            "all": {
                "replicas": 1
            }
        }
    },
    "cortexConfig": {
        "limits": {
            "compactorBlocksRetentionPeriod": "0s"
        },
        "storage": {
            "backend": "s3",
            "s3": {
                "endpoint": "s3.us-east-1.amazonaws.com",
                "region": "us-east-1",
                "secretAccessKey": "***",
                "accessKeyId": "AKIARHLSZXXGKCKBHQVX",
                "sse": {},
                "http": {}
            },
            "filesystem": {}
        },
        "logLevel": "debug"
    },
    "grafana": {
        "enabled": true,
        "hostname": "//...
    }
}

@alexandreLamarre
Copy link
Contributor Author

This is also tracks the UI failures and expected behavior for the UI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants