Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow Job run duration metrics are always "+Inf" #3522

Open
7 tasks done
jeremydonahue opened this issue May 15, 2024 · 1 comment · May be fixed by #3532
Open
7 tasks done

Workflow Job run duration metrics are always "+Inf" #3522

jeremydonahue opened this issue May 15, 2024 · 1 comment · May be fixed by #3532
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers

Comments

@jeremydonahue
Copy link

jeremydonahue commented May 15, 2024

Checks

Controller Version

v0.27.6

Helm Chart Version

0.23.7

CertManager Version

v1.10.1

Deployment Method

Helm

cert-manager installation

Not applicable to this issue but is installed correctly.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: arc-system
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.27.6
    core.zr.org/team: core
    helm.sh/chart: actions-runner-controller-0.23.7
  name: arc-system-actions-runner-controller-actions-metrics-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: arc-system-actions-metrics-server
      app.kubernetes.io/name: actions-runner-controller
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: arc-system-actions-metrics-server
        app.kubernetes.io/name: actions-runner-controller
    spec:
      containers:
      - args:
        - --metrics-addr=127.0.0.1:8080
        - --log-format=json
        command:
        - /actions-metrics-server
        env:
        - name: GITHUB_WEBHOOK_SECRET_TOKEN
          valueFrom:
            secretKeyRef:
              key: github_webhook_secret_token
              name: app-secrets
              optional: true
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              key: github_token
              name: app-secrets
              optional: true
        - name: GITHUB_APP_ID
          valueFrom:
            secretKeyRef:
              key: github_app_id
              name: app-secrets
              optional: true
        - name: GITHUB_APP_INSTALLATION_ID
          valueFrom:
            secretKeyRef:
              key: github_app_installation_id
              name: app-secrets
              optional: true
        - name: GITHUB_APP_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              key: github_app_private_key
              name: app-secrets
              optional: true
        - name: GITHUB_BASICAUTH_PASSWORD
          valueFrom:
            secretKeyRef:
              key: github_basicauth_password
              name: app-secrets
              optional: true
        image: summerwind/actions-runner-controller:v0.27.6
        name: actions-metrics-server
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
      - args:
        - --secure-listen-address=0.0.0.0:8443
        - --upstream=http://127.0.0.1:8080/
        - --logtostderr=true
        - --v=10
        image: quay.io/brancz/kube-rbac-proxy:v0.13.1
        name: kube-rbac-proxy
        ports:
        - containerPort: 8443
          name: metrics-port
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
      serviceAccountName: arc-system-actions-runner-controller-actions-metrics-server
      terminationGracePeriodSeconds: 10

To Reproduce

1. Enable actions metrics server
2. Configure webhook to send events to metrics server
3. View metrics

Describe the bug

Workflow Job run duration metrics are always in the +Inf bucket due to a bug in the code used to determine the run time. It is relying on the following code to get the queued and started time:

	var (
		queuedTime    time.Time
		startedTime   time.Time
		completedTime time.Time
	)

...

			if strings.HasPrefix(line, "Waiting for a runner to pick up this job...") {
				queuedTime, _ = time.Parse(time.RFC3339, timestamp)
				continue
			}

			if strings.HasPrefix(line, "Job is about to start running on the runner:") {
				startedTime, _ = time.Parse(time.RFC3339, timestamp)
				continue
			}

			// Last line in the log will count as the completed time
			completedTime, _ = time.Parse(time.RFC3339, timestamp)

And then it subtracts the started time from the completion time to calculate the run duration of the job. However, the "Waiting for a runner to pick up this job..." and "Job is about to start running on the runner:" lines only exist in the job output until the job is completed, at which point Github removes from from the output. When those lines aren't present, completedTime is correct but startedTime is the zero value for time.Time, which means you end up with a run duration of eg. 2562047h47m16.854775807s. See https://go.dev/play/p/fPfsc6Wp-gm for example.

Describe the expected behavior

The job duration metric should be correct.

Whole Controller Logs

I know it says not to omit this, but they are not relevant here and they contain repository information I'd rather not be public. There are no errors in the logs. Sorry. Let me know if you really want these and I'll find a private way to get them to you.

Whole Runner Pod Logs

N/A, see above.

Additional Context

Here are the job logs from the same job before and after the job is complete:

Before job is complete:

2024-05-15T20:02:25.0874171Z Requested labels: self-hosted, cicd
2024-05-15T20:02:25.0874499Z Job defined at: <omitted>
2024-05-15T20:02:25.0874598Z Waiting for a runner to pick up this job...
2024-05-15T20:02:25.2431948Z Job is about to start running on the runner: runner-cicd-tdrg7-2fcbn (organization)

After job is complete:

2024-05-15T20:02:29.8079838Z Current runner version: '2.316.1'
2024-05-15T20:02:29.8086381Z Runner name: 'runner-cicd-tdrg7-2fcbn'
2024-05-15T20:02:29.8087297Z Runner group name: 'Default'
2024-05-15T20:02:29.8088200Z Machine name: 'runner-cicd-tdrg7-2fcbn'
2024-05-15T20:02:29.8091270Z ##[group]GITHUB_TOKEN Permissions
2024-05-15T20:02:29.8093123Z Contents: read
2024-05-15T20:02:29.8093671Z Metadata: read
2024-05-15T20:02:29.8094172Z Packages: read
2024-05-15T20:02:29.8094649Z ##[endgroup]
2024-05-15T20:02:29.8098048Z Secret source: Actions
2024-05-15T20:02:29.8098758Z Prepare workflow directory
2024-05-15T20:02:29.9653636Z Prepare all required actions
2024-05-15T20:02:29.9843545Z Getting action download info
...

I'll be submitting a PR to fix this shortly PR which fixes the issue: #3532

@jeremydonahue jeremydonahue added bug Something isn't working community Community contribution needs triage Requires review from the maintainers labels May 15, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

jeremydonahue added a commit to jeremydonahue/actions-runner-controller that referenced this issue May 15, 2024
…,completed time

Default these values to the timestamps contained in the event. They will
be updated if the log contains more accurate values, however the
"Waiting for a runner to pick up this job..." and "Job is about to start
running on the runner:" lines are only present in the logs until the job
has finished, at which point they are removed from the job logs by
Github, so they can't be used to calculate the job duration after the
job has finished.

Fixes actions#3522
jeremydonahue added a commit to jeremydonahue/actions-runner-controller that referenced this issue May 20, 2024
…,completed time

Default these values to the timestamps contained in the event. They will
be updated if the log contains more accurate values, however the
"Waiting for a runner to pick up this job..." and "Job is about to start
running on the runner:" lines are only present in the logs until the job
has finished, at which point they are removed from the job logs by
Github, so they can't be used to calculate the job duration after the
job has finished.

Fixes actions#3522
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers
Projects
None yet
1 participant