Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Jenkins alerts and reporting #3088

Closed
UlisesGascon opened this issue Nov 22, 2022 · 13 comments
Closed

Improve Jenkins alerts and reporting #3088

UlisesGascon opened this issue Nov 22, 2022 · 13 comments
Assignees

Comments

@UlisesGascon
Copy link
Member

I was checking some issues regarding down machines(#3083, #3084...) and I thought that maybe we can implement a little dashboard in Grafana to check the machine status (ping + latencies) maybe SSH connectivity in the future and trigger alerts (if we want).

I created this POC repo that parsers the current inventory (excluding localhost IPs, etc..) and generate a local dockerized environment (Telegraf + influxDb + Grafana). It is just a fast raw prototype to illustrate the idea.

I saw in #3084 that we use the same stack, so it won't be very complex to adapt. What do you think? Should we work on it? Are there other alternatives like Jenkins-status that cover this gap currently?

@UlisesGascon
Copy link
Member Author

I have reconsidered this issue and, drawing from the Security WG's experience in implementing the OpenSSF Scorecard Monitor, I believe we can adopt a similar approach.

We can create a Github Action that parses the inventory file, extracts the IPs, and attempts to ping or SSH into the machines (in the future). The output will be stored as a markdown file (similar to this one), making it easy to identify which machines are UP/DOWN. We can even automatically generate new issues (similar to this one) when a machine becomes unreachable.

This process can be initiated on demand and/or scheduled as a daily CRON job.

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Mar 24, 2023

I created this Github Action Jenkins status alerts and reporting in the marketplace based on my last message idea.

I still need to do some work to fine tune details like unit testing, but in general is already stable and we can use it.

What this Action can do for us?

  1. This action will store a simple database file (Here is an example).
  2. The github action will compare the stored database with the last state in Jenkins and will create a new issue per machine that was previously online and currently offline (It will ignore the machines that we manually disable in Jenkins). The issues can be tagged and asigned to specific users (Here is an example)

Additional features:

  • It can generate a markdown report that summarize the Jenkins Nodes status (Here is an example)

Setup proposal

I will need to create a new github action pipeline in .github/workflows with this setup:

name: "Jenkins Nodes"
on: 
  workflow_dispatch:   

permissions:
  contents: write
  pull-requests: none 
  issues: write
  packages: none

jobs:
  security-scoring:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Jenkins Alert and Reporting
        uses: UlisesGascon/jenkins-status-alerts-and-reporting@v1.0.0
        id: jenkins-status-alerts-and-reporting
        with:
          database: experimental/database.json
          jenkins-domain: 'ci.nodejs.org'
          jenkins-username: ${{ secrets.JENKINS_USERNAME }}
          jenkins-token: ${{ secrets.JENKINS_TOKEN }}
          # Issues
          generate-issue: true
          issue-assignees: 'UlisesGascon'
          issue-labels: 'incident,infra'
          create-issues-for-new-offline-nodes: false
          # Report
          report: experimental/jenkins-report.md
          report-tags-enabled: true
          # Git changes
          auto-commit: true
          auto-push: true
          github-token: ${{ secrets.GITHUB_TOKEN }}

This setup will require to generate a Jekins API token and include it in the repo settings (JENKINS_USERNAME and JENKINS_TOKEN )

Next steps

I will love to try the tool in the Build team and collect feedback to improve the Github Action for the next release as we did for OpenSSF Scorecard Monitor in the Security WG when we adopted the tool (nodejs/security-wg#886)

There is an opportunity also to evolve the tool to create tickets when nodes disk usage is high, to give us time to fix them before they go offline.

What do think @nodejs/build ? Should we try it? Do we want to wait to discuss it in the next meeting?

@UlisesGascon UlisesGascon changed the title Visualize machines availability in Grafana Improve Jenkins alerts and reporting Mar 24, 2023
@UlisesGascon UlisesGascon self-assigned this Mar 24, 2023
@UlisesGascon
Copy link
Member Author

As agreed in #3299 I will transfer the alerts demo repository to the Node.js Org. I made a rename and I will do a separate PR once is migrated to clean up the experimental pipeline

Captura de pantalla 2023-04-11 a las 20 44 19

@UlisesGascon
Copy link
Member Author

The migration seems to be completed in https://github.com/nodejs/jenkins-alerts

@UlisesGascon
Copy link
Member Author

I believe that we will need some kind of settings change in order to make the @nodejs/build team owner of the repo with the expected write access and so on.

@richardlau
Copy link
Member

The migration seems to be completed in https://github.com/nodejs/jenkins-alerts

We should probably have followed the steps in https://github.com/nodejs/admin/blob/main/transfer-repo-into-the-org.md before doing the transfer. Maybe open an issue in the admin repo explaining it was suggested transferring the repo into the org during the Build WG call and detailing whatever needs to happen next?

@UlisesGascon
Copy link
Member Author

Thank you for bringing this to my attention, @richardlau. I did not read the documentation before submitting the transfer request, and I also mistakenly believed that the transfer would not be automatic 🤦.

I will create an issue in the Admin repository to clarify the transfer process and outline the expected future for this tool.

@targos
Copy link
Member

targos commented Apr 12, 2023

I added the build team to the repo with the "Maintain" role.

@UlisesGascon
Copy link
Member Author

UlisesGascon commented May 23, 2023

Next steps, as agreed in #3362 :

UlisesGascon added a commit to nodejs/jenkins-alerts that referenced this issue May 24, 2023
@UlisesGascon
Copy link
Member Author

@richardlau can you grant me access? I requested the Github integration access between the jenkins-alerts repo and the Node.js Slack.

This will push the notifications to the #nodejs-build-infra-alerts channel.

Captura de pantalla 2023-05-24 a las 15 36 49

Captura de pantalla 2023-05-24 a las 15 39 16

@richardlau
Copy link
Member

@UlisesGascon We probably should run that by https://github.com/nodejs/admin

@UlisesGascon
Copy link
Member Author

Thanks for the suggestion @richardlau! I moved the discussion to nodejs/admin#799

@UlisesGascon
Copy link
Member Author

As the pending items in #3088 (comment) are completed. I will close the issue. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants