-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update search-engine resource allocations #137
Conversation
I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption. |
Start parameters for Solr (to look into the other options):
|
Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning. |
|
Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day. |
Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different. |
Rolled back to the previous search-engine image. |
Image comparison:
There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same. |
Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); bcgov/openshift-solr#15 cmd line for Solr 8.11.2 container:
|
New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior. |
Manually adjusted memory allocation and settings for testing:
|
After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the |
The search-engine pod faired much better over the past 19 hours, than it did the previous few days. The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed. search-engine restart times (due to OOM issues, via OOM killer script):
Recovery time (restart time) is typically under a minute. |
We need a better way to parse the IP addresses out of the logs in Kibana |
Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times. |
search-engine pod metrics since Sep 14, 2023, 9:50 AM: The pod has restarted 8 times due to OOM errors:
|
Captured a log containing an OOM triggered restart.
|
Summary over the weekend: Resource settings:
Restarts (19):
|
Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.
|
I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads. |
Resource Settings:
The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing. Restarts since yesterday:
|
Resource Settings:
Restarts since yesterday:
|
Resource Settings:
Restarts since yesterday:
|
Resource Settings:
Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run. Note:
Restarts since yesterday:
|
Resource Settings:
Restarts since last report:
|
Clearly there is no correlation between query volume and the OOM restarts. |
All environments have been updated with the latest configurations. |
Disabled the Solr cache on the search-engine; bcgov/aries-vcr#758 |
Reminder to monitor the active threads (
|
search-engine right before being OOM killed:
|
The implementation of the blocklist, #138, as allowed us to limit the unbounded queries (bcgov/aries-vcr#762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction. |
Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts. |
2e7d002
to
37cf707
Compare
Search-engine: - Upgrade to Solr 8.11.2. - Update load tests with real world heavy load queries. - Refactor load tests to exercise the search-engine more. - Ensure each search query requests a random page within a particular page range. This helps to minimize the number of cache hits, and therefore exercise the heap and garbage collection settings of the search-engine. - Solr uses SOLR_JAVA_MEM to set the JAVA memory options, which overrides JAVA_OPTS. - Adjust CPU and memory allocation to minimize container throttling and thrashing under load. - Adjust health checks to reduce loading. API: - Adjust API resources to minimize container throttling and OOM issues during heavier load. Signed-off-by: Wade Barnes <wade@neoterictech.ca>
Commit history has been cleaned up, and the configurations have been deployed and tested in |
No description provided.