-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941
Comments
Tagging @peterzhuamazon @gaiksaya @getsaurabh02 @prudhvigodithi @dblock for feedback and way forward. |
Thanks @rishabh6788 this this an important enhancement. With the gathered data of GitHub Action Workflows we can even have summary of force merged pull requests, which is an important metric for the OpenSearch repo health. @getsaurabh02 @dblock I would vote for 1st option to collect the incremental PR workflows, index the data and create a monitoring tool on top of the indexed raw data. Going with option 2, even if we created a custom GitHun action for this purpose it would be tough to update the 100's of workflows files across all the repos and ensuring that for new repos this action exists is tedious job. Thank you |
I am also in line with the pull based monitoring and carefully choose the data source we want to monitor. However, there will be still gaps where certain actions only run once per a month during release phase. We need to figure out a consistent way to dry-run these actions in order to detect issues beforehand. Thanks. |
Going with option 1 we can do the following:
@getsaurabh02 @dblock @rishabh6788 @peterzhuamazon @gaiksaya |
Following is the sample schema that can be indexed to the metrics cluster.
Once we have the above information:
Thank you |
Did some more deep dive on the possible repo workflows.
|
Sync up with Prudhvi today and confirm that automation app is able to grab all the necessary context for the requirements. We will see if we can combine the automation app and metrics cluster together on this. Thanks. |
Here is the final flow details, implemented based on all the merged pull requests linked to this issue. graph LR
A[GitHub Workflow Events] --> B[GitHub Automation App]
B --> C[Failure Detection]
C --> D[Workflow Failure Identified]
D --> E[CloudWatch Alarms Update]
D --> F[Failures Indexed]
E --> I{Alarm Triggered?}
I -- Yes --> G[Alerts Sent to Teams]
I -- No --> J[No Action]
F --> H[Data for Debugging and Trend Analysis]
|
Closing this issue. |
Is your feature request related to a problem? Please describe
Background
We recently had a situation where
publish snapshots to maven
github actions workflow started failing across all the repositories due to an issue on sonatype central side. They had accidently deleted user tokens during maintainence and our jobs started failing with 401 errors.The operator accidently happen to check the failed workflow on the commit they merged and saw snapshot workflow failure, upon further investigation it was found that the same workflow had been failing across all the repositories with same error for past 24-hours.
We need to implement a system to monitor critical GitHub Actions workflows across multiple repositories in our organization. This will help us quickly identify and respond to workflow failures or issues.
Describe the solution you'd like
Proposed Solutions
We have identified two broad categories of approaches: pull-based and push-based monitoring.
1. Pull-based Monitoring
Description
a) Oboard github actions workflow metrics onto existing metrics framework (Recommended)
b) Use GitHub REST APIs to periodically fetch the GitHub Actions status
Advantages
Challenges
2. Push-based Monitoring
Description
a) Slack Notifications Integration in Workflows
b) Email Notifications
d) Webhook Integration
Advantages
Challenges
Next Steps
Questions to Consider
Please comment with your thoughts, preferences, or any additional considerations for this monitoring system.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: