Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Storage Calculator to be more robust #2947

Merged
merged 5 commits into from
Dec 20, 2021
Merged

Conversation

seanhamlin
Copy link
Contributor

@seanhamlin seanhamlin commented Nov 27, 2021

Checklist

  • Affected Issues have been mentioned in the Closing issues section
  • Documentation has been written/updated
  • PR title is ready for changelog and subsystem label(s) applied

Explain the details for making this change. What existing problem does the pull request solve?

It was noticed that storage-calculator was not running for all projects in a large dedicated cluster with it's own Lagoon Core. This PR refactors the storage calculator to:

  • Use consistent code style
  • Easier to read debug messages
  • Move guard statements to the top to avoid unnecessary nesting of code
  • Remove exit 1 which potentially halts the storage calculator before it runs over all projects
  • Use of --ignore-not-found=true and less piping to /dev/null
  • Have a timeout of 30 seconds for the storageCalc pod rollout. To prevent RWO PVs from stalling the script for too long. The current default is 10 minutes.
  • Add a new environment variable LAGOON_STORAGE_IGNORE_REGEX to which can be used to selectively skip certain PVCs by name. e.g. it is easy now to skip reading all the solr PVCs now.

I also added README.md file to explain a lot of the things.

Things that would make great follow up PRs:

Related issues

…atements to the top. Better logging. README.
@seanhamlin seanhamlin added the 3-logging-reporting Logging & Reporting subsystem label Nov 27, 2021
@rocketeerbkw
Copy link
Member

Adding more details about the issue and reason the changes here will help from internal comms:

  • there are a lot of projects (each with 1+ environments)
  • projects are enumerated through in id order, meaning new projects are done last
  • if there is a RWO PV in use, then the process stalls for 10 minutes to wait for it to fail to spawn
  • the cron process physically cannot complete in 24 hours, and thus new projects are just skipped.

@tobybellwood tobybellwood merged commit 82bc50f into main Dec 20, 2021
@tobybellwood tobybellwood deleted the storage-calc-refine branch December 20, 2021 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3-logging-reporting Logging & Reporting subsystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants