Update search-engine resource allocations #137

WadeBarnes · 2023-09-12T14:17:36Z

No description provided.

WadeBarnes · 2023-09-12T18:14:48Z

I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption.

WadeBarnes · 2023-09-12T18:18:04Z

At first the high CPU usage seemed to be related to the times the BC Registries agent would process new credentials (BC Reg 0 and 30, LEAR 15 and 45), however that does not seem to be the case>

WadeBarnes · 2023-09-12T18:24:05Z

Start parameters for Solr (to look into the other options):

/usr/local/openjdk-11/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http1002370000@search-engine-bc-59-498gs

WadeBarnes · 2023-09-12T18:27:52Z

Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning.

WadeBarnes · 2023-09-12T18:35:14Z

-XX:+UseLargePages might be something to look into. There are warnings related to this on pod startup; OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)

WadeBarnes · 2023-09-12T20:45:54Z

Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day.

WadeBarnes · 2023-09-12T21:38:21Z

Increased CPU allocation to see if it would peek just about the 4CPU mark. CPU was pinned shortly after the pod was started:

WadeBarnes · 2023-09-12T21:40:15Z

Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different.

WadeBarnes · 2023-09-13T11:41:22Z

Info for the last 14hours. Became completely unresponsive at 4:23 AM. The pod was running and passing health checks, but was not responding to any search queries.

WadeBarnes · 2023-09-13T11:56:20Z

Rolled back to the previous search-engine image.

WadeBarnes · 2023-09-13T13:06:33Z

Image comparison:

search-engine@sha256:e6abd08ea2079377a6afc5b2754c2d2ad6129e0672078758a868b957f7a870d4
- The one with the reported issues.
- Base image (search-engine-base-93), search-engine-base@sha256:a36c37e028eee079f8811b9cecd571ec1051b6a5254b8df8a4f11d44c172be73
  - This is the most recent base image.
  - Built off https://github.com/bcgov/openshift-solr/releases/tag/8.8.3
- Aries-VCR build commit, f77b33f78fb949cf2323c5ac641eaf3b62b890a5
search-engine@sha256:c092c6b9a5a004e173b97af9e026c62557b121434cd2bf8459fcdba40f925cff
- The "previous" image
- Base image (search-engine-base-89), image-registry.openshift-image-registry.svc:5000/8ad0ea-tools/search-engine-base@sha256:7069b8e7bca1770a529f14701e26781b025e13af31ad7e1e45f9046cb2700249
  - This base image does not exist anymore.
- Aries-VCR build commit, dd2a7fb08b46773ccfe139c69c7d3e132acacfc9

There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same.

WadeBarnes · 2023-09-13T14:44:13Z

Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); bcgov/openshift-solr#15

cmd line for Solr 8.11.2 container:

/opt/java/openjdk/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http--module=gzip10

WadeBarnes · 2023-09-13T15:45:28Z

New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior.

WadeBarnes · 2023-09-13T18:57:49Z

Manually adjusted memory allocation and settings for testing:

- resources:
    limits:
      cpu: '5'
      memory: 8Gi
    requests:
      cpu: 500m
      memory: 3Gi

SOLR_JAVA_MEM=-Xms1024m -Xmx4096m

GC_TUNE=-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:InitiatingHeapOccupancyPercent=75

Notable changes to GC_TUNE: Removed -XX:+UseLargePages which eliminates the OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12) warning messages. Added -XX:InitiatingHeapOccupancyPercent=75 to control the frequency of garbage collection.

WadeBarnes · 2023-09-13T21:29:39Z

After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the loadTest scripts. The search-engine as been stable under load from 4 instances of the script running concurrently for the last hour.

WadeBarnes · 2023-09-14T13:29:58Z

The search-engine pod faired much better over the past 19 hours, than it did the previous few days.

The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed.

search-engine restart times (due to OOM issues, via OOM killer script):

September 13th 2023, 16:04:09.591
September 13th 2023, 16:20:32.341
September 13th 2023, 16:39:53.102
September 13th 2023, 17:05:03.713

Recovery time (restart time) is typically under a minute.

WadeBarnes · 2023-09-14T15:09:49Z

Bytespider Traffic - Not huge
Amazon Singapore
47.128.0.0/14 (47.128.0.0 - 47.131.255.255)

Other traffic:

Other top IPs:

50.98.49.96
141.95.45.36
172.218.186.179

WadeBarnes · 2023-09-14T15:51:38Z

We need a better way to parse the IP addresses out of the logs in Kibana

WadeBarnes · 2023-09-14T16:20:19Z

@i5okie, @esune mentioned you might be able to help with the Kibana log parsing. What I'd like to do is parse the X-Forwarded-For field of the log message from logs like this:

WadeBarnes · 2023-09-14T16:54:45Z

Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times.

WadeBarnes · 2023-09-15T13:06:44Z

search-engine pod metrics since Sep 14, 2023, 9:50 AM:

The pod has restarted 8 times due to OOM errors:

September 14th 2023, 15:15:55.697
September 14th 2023, 15:46:25.683
September 14th 2023, 17:34:22.525
September 14th 2023, 23:12:35.830
September 14th 2023, 23:15:28.230
September 14th 2023, 23:18:41.520
September 14th 2023, 23:26:18.286
September 14th 2023, 23:34:55.028

WadeBarnes · 2023-09-15T18:54:07Z

Captured a log containing an OOM triggered restart.
Solr-Error.log

Reviewed. Nothing obvious.

WadeBarnes · 2023-09-18T12:32:10Z

Summary over the weekend:

Resource settings:

- resources:
    limits:
      cpu: '5'
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 6Gi

  - name: SOLR_JAVA_MEM
    value: '-Xms2048m -Xmx5120m'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem
      -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
      -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

Restarts (19):

September 15th 2023, 14:35:59.778
September 15th 2023, 14:39:42.382
September 15th 2023, 15:08:11.088
September 16th 2023, 17:46:15.851
September 16th 2023, 17:48:13.431
September 16th 2023, 17:52:19.325
September 16th 2023, 17:59:19.111
September 16th 2023, 18:13:22.786
September 16th 2023, 18:57:31.627
September 16th 2023, 21:41:00.354
September 17th 2023, 06:42:45.533
September 17th 2023, 06:45:57.840
September 17th 2023, 11:26:50.096
September 17th 2023, 11:28:08.882
September 17th 2023, 11:30:13.455
September 17th 2023, 11:38:28.395
September 17th 2023, 11:45:50.987
September 17th 2023, 12:17:43.509
September 17th 2023, 12:19:08.664

WadeBarnes · 2023-09-18T16:30:30Z

Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.

WadeBarnes · 2023-09-18T16:36:28Z

I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads.

WadeBarnes · 2023-09-19T12:10:45Z

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 8Gi

- name: SOLR_JAVA_MEM
  value: '-Xms4g -Xmx8g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing.

Restarts since yesterday:

September 18th 2023, 06:50:31.287
September 18th 2023, 13:00:43.415
September 18th 2023, 13:10:27.709
September 18th 2023, 14:21:41.840
September 18th 2023, 15:08:26.997
September 18th 2023, 15:38:31.013
September 18th 2023, 15:49:57.219
September 18th 2023, 16:05:05.748
September 18th 2023, 16:19:51.972
September 18th 2023, 16:38:30.411
September 18th 2023, 20:35:29.097

WadeBarnes · 2023-09-20T13:00:27Z

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since yesterday:

September 19th 2023, 12:56:24.264
September 19th 2023, 13:05:06.261
September 19th 2023, 13:31:00.405
September 19th 2023, 15:46:08.488

WadeBarnes · 2023-09-21T14:07:57Z

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since yesterday:

September 20th 2023, 11:49:08.473
September 20th 2023, 12:08:22.863
September 20th 2023, 12:14:43.854
September 20th 2023, 12:38:25.032
September 20th 2023, 13:06:03.496
September 20th 2023, 15:07:28.761

WadeBarnes · 2023-09-22T14:15:37Z

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run.

Note:

Typical multiple restart pattern did not occur in the 12:00 time block.
3 restarts occurred in the 15:00 time block, when search queries were lower than normal.

Restarts since yesterday:

September 21st 2023, 08:38:59.721
September 21st 2023, 09:07:06.038
September 21st 2023, 16:48:09.813
September 21st 2023, 16:58:45.291
September 21st 2023, 17:06:56.915

WadeBarnes · 2023-10-06T15:15:00Z

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Restarts since last report:

October 1st 2023, 14:04:24.287
October 1st 2023, 19:44:39.397
October 1st 2023, 22:19:47.885
October 2nd 2023, 04:22:02.830
October 2nd 2023, 06:27:20.274
October 2nd 2023, 07:12:07.662
October 2nd 2023, 07:16:47.664
October 2nd 2023, 14:05:32.630
October 4th 2023, 18:47:10.224
October 4th 2023, 18:54:13.308
October 4th 2023, 19:06:05.535
October 4th 2023, 19:14:47.824
October 4th 2023, 19:28:33.086
October 4th 2023, 19:51:07.088
October 4th 2023, 19:56:51.623
October 4th 2023, 20:11:00.963
October 4th 2023, 20:22:16.056
October 4th 2023, 20:28:56.536
October 4th 2023, 20:46:18.216
October 4th 2023, 20:55:36.567
October 4th 2023, 21:01:36.922
October 4th 2023, 21:09:06.920
October 4th 2023, 21:32:11.678
October 4th 2023, 21:50:30.785
October 4th 2023, 21:59:23.275
October 4th 2023, 22:10:16.848
October 4th 2023, 22:33:34.781
October 4th 2023, 22:50:00.354
October 4th 2023, 22:55:43.136
October 4th 2023, 23:06:57.429
October 4th 2023, 23:14:23.512
October 4th 2023, 23:21:32.571
October 4th 2023, 23:28:12.465
October 5th 2023, 00:01:34.585
October 5th 2023, 00:15:20.176
October 5th 2023, 00:25:50.091
October 5th 2023, 00:31:24.233
October 5th 2023, 01:02:56.832
October 5th 2023, 01:11:11.181
October 5th 2023, 01:20:57.755
October 5th 2023, 01:29:48.860
October 5th 2023, 01:54:59.748
October 5th 2023, 02:06:10.983
October 5th 2023, 02:15:55.757
October 5th 2023, 02:24:57.085
October 5th 2023, 02:46:06.866
October 5th 2023, 02:56:00.629
October 5th 2023, 03:11:29.147
October 5th 2023, 03:19:52.533
October 5th 2023, 03:30:10.058
October 5th 2023, 03:47:53.257
October 5th 2023, 03:55:20.794
October 5th 2023, 04:22:42.235
October 5th 2023, 04:29:18.952
October 5th 2023, 04:48:15.072
October 5th 2023, 04:56:06.118
October 5th 2023, 05:05:37.694
October 5th 2023, 05:13:36.939
October 5th 2023, 05:26:46.935
October 5th 2023, 05:52:43.276
October 5th 2023, 06:00:06.666
October 5th 2023, 06:10:39.611
October 5th 2023, 06:23:52.624
October 5th 2023, 06:30:18.834
October 5th 2023, 06:48:18.164
October 5th 2023, 06:58:28.304
October 5th 2023, 07:10:35.642
October 5th 2023, 07:17:50.913
October 5th 2023, 07:27:28.950

WadeBarnes · 2023-10-06T15:36:34Z

Clearly there is no correlation between query volume and the OOM restarts.

WadeBarnes · 2023-10-06T17:55:17Z

All environments have been updated with the latest configurations.

WadeBarnes · 2023-10-06T18:17:05Z

Disabled the Solr cache on the search-engine; bcgov/aries-vcr#758

WadeBarnes · 2023-10-16T15:09:25Z

Resource Settings:

2023-10-06 to 2023-10-08

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

2023-10-08 onward

- resources:
    limits:
      cpu: '6'
      memory: 30Gi
    requests:
      cpu: '6'
      memory: 20Gi

 - name: SOLR_JAVA_MEM
    value: '-Xms16g -Xmx24g'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
      -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
      -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

Search-engine container throttling:

Restarts since last report:

October 10th 2023, 03:31:07.930
October 12th 2023, 03:04:11.843
October 12th 2023, 03:56:30.013
October 12th 2023, 05:01:22.260
October 12th 2023, 05:55:11.003
October 12th 2023, 06:28:28.116
October 13th 2023, 12:13:39.361
October 13th 2023, 12:25:56.895
October 13th 2023, 12:40:45.049
October 13th 2023, 12:49:51.075
October 13th 2023, 13:24:17.387
October 13th 2023, 13:34:41.587
October 13th 2023, 13:44:11.618
October 13th 2023, 13:58:55.254
October 13th 2023, 14:27:28.602
October 14th 2023, 07:13:07.981
October 14th 2023, 07:39:07.003
October 14th 2023, 08:25:35.409
October 14th 2023, 09:26:22.372
October 14th 2023, 09:33:54.701
October 14th 2023, 09:39:10.708
October 14th 2023, 09:47:08.524
October 14th 2023, 10:20:28.245
October 14th 2023, 10:25:58.895
October 14th 2023, 11:09:55.271
October 14th 2023, 11:41:24.026
October 14th 2023, 12:28:00.140
October 14th 2023, 12:44:51.028
October 14th 2023, 13:20:45.687
October 14th 2023, 14:25:40.141
October 14th 2023, 16:43:50.757
October 14th 2023, 19:30:12.245
October 14th 2023, 21:59:35.091
October 14th 2023, 22:05:41.511

Note resources were significantly increased in response to a heavy query load on 2023-10-08. Resources on other services were further adjusted on 2023-10-13 in response to the significant spike in query volume. The number of frontend and API instances were decreased and the query rate limit was also decreased.

One thing to note is that the search-engine experiences a very high percentage of throttling during the periods when it's restarting the most, even when the container has been given a significant allocation of CPU. CPU usage also spikes during these periods to the point where the container goes beyond it's allocation for long enough to be taken offline by k8s.

WadeBarnes · 2023-10-16T16:00:38Z

Reminder to monitor the active threads (top -H) during heavy load. I suspect the GC threads will be the most active.

top - 16:00:14 up 130 days, 22:34,  0 users,  load average: 9.78, 8.58, 9.21
Threads:  56 total,   3 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s):  24.9/9.4    34[                                                                                                    ]
MiB Mem : 38.0/257605.4 [                                                                                                    ]
MiB Swap:  0.0/0.0      [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
      1 1002370+  20   0    2620    600    528 S   0.0   0.0   0:00.08 sh                                                                                                                                                      
      7 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02  `- java                                                                                                                                                
     58 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:02.06      `- java                                                                                                                                            
     59 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.99      `- GC Thread#0                                                                                                                                     
     60 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:01.46      `- G1 Main Marker                                                                                                                                  
     61 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.16      `- G1 Conc#0                                                                                                                                       
     62 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.85      `- GC Thread#1                                                                                                                                     
     63 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.04      `- GC Thread#2                                                                                                                                     
     64 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#3                                                                                                                                     
     65 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#4                                                                                                                                     
     66 1002370+  20   0   35.5g  26.1g   1.2g S  12.0  10.4  37:13.48      `- GC Thread#5                                                                                                                                     
     67 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4  13:42.93      `- G1 Refine#0                                                                                                                                     
     68 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:34.28      `- G1 Young RemSet                                                                                                                                 
     69 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:19.12      `- VM Thread                                                                                                                                       
     70 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02      `- Reference Handl                                                                                                                                 
     71 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Finalizer                                                                                                                                       
     72 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Signal Dispatch                                                                                                                                 
     73 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Service Thread                                                                                                                                  
     74 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   1:15.37      `- C2 CompilerThre                                                                                                                                 
     75 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:07.41      `- C1 CompilerThre                                                                                                                                 
     76 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.22      `- Sweeper thread                                                                                                                                  
     77 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:45.94      `- VM Periodic Tas                                                                                                                                 
     78 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.08      `- Common-Cleaner                                                                                                                                  
     81 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4   3:20.85      `- Log4j2-TF-3-Asy                                                                                                                                 
     82 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- ShutdownMonitor                                                                                                                                 
     86 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:03.22      `- qtp1548081008-1                                                                                                                                 
     93 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.04      `- Session-HouseKe                                                                                                                                 
     94 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.68      `- G1 Conc#1                                                                                                                                       
     95 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- h2sc-6-thread-1                                                                                                                                 
     96 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.14      `- Connection evic                                                                                                                                 
     97 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.13      `- Connection evic

WadeBarnes · 2023-10-19T15:22:17Z

search-engine right before being OOM killed:

top - 15:18:30 up 19:56,  0 users,  load average: 16.74, 30.24, 26.22
Threads: 257 total,   7 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.8 us,  4.1 sy,  0.0 ni, 68.5 id,  0.0 wa,  0.3 hi,  0.4 si,  0.0 st
MiB Mem : 257605.4 total, 121852.3 free,  64512.1 used,  71241.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 190113.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 
     63 1002370+  20   0   46.8g  26.6g 301972 R  99.3  10.6   2:32.57 GC Thread#2                                                                                             
     59 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.81 GC Thread#0                                                                                             
     65 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.30 GC Thread#4                                                                                             
     66 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.71 GC Thread#5                                                                                             
     62 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.95 GC Thread#1                                                                                             
     64 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.60 GC Thread#3                                                                                             
     69 1002370+  20   0   46.8g  26.6g 301972 S   0.7  10.6   0:01.38 VM Thread                                                                                               
    407 1002370+  20   0    9356   3972   3188 R   0.3   0.0   0:00.66 top                                                                                                     
      1 1002370+  20   0    2620    604    528 S   0.0   0.0   0:00.06 sh                                                                                                      
      7 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.01 java                                                                                                    
     58 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:01.95 java                                                                                                    
     60 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 G1 Main Marker                                                                                          
     61 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:18.66 G1 Conc#0                                                                                               
     67 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:34.62 G1 Refine#0                                                                                             
     68 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.16 G1 Young RemSet                                                                                         
     70 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Reference Handl                                                                                         
     71 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Finalizer                                                                                               
     72 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Signal Dispatch                                                                                         
     73 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Service Thread                                                                                          
     74 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:16.11 C2 CompilerThre

WadeBarnes · 2023-10-20T20:53:27Z

The implementation of the blocklist, #138, as allowed us to limit the unbounded queries (bcgov/aries-vcr#762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction.

WadeBarnes · 2023-10-23T15:14:32Z

Since the introduction of the blocklist both the api and search-engine pods have seen far less activity and have been maintaining more sane resource levels. There have been no pod restarts, though the memory usage on the api pods is growing.

For example:

WadeBarnes · 2023-10-25T13:34:24Z

Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts.

Search-engine: - Upgrade to Solr 8.11.2. - Update load tests with real world heavy load queries. - Refactor load tests to exercise the search-engine more. - Ensure each search query requests a random page within a particular page range. This helps to minimize the number of cache hits, and therefore exercise the heap and garbage collection settings of the search-engine. - Solr uses SOLR_JAVA_MEM to set the JAVA memory options, which overrides JAVA_OPTS. - Adjust CPU and memory allocation to minimize container throttling and thrashing under load. - Adjust health checks to reduce loading. API: - Adjust API resources to minimize container throttling and OOM issues during heavier load. Signed-off-by: Wade Barnes <wade@neoterictech.ca>

WadeBarnes · 2023-10-25T14:22:30Z

Commit history has been cleaned up, and the configurations have been deployed and tested in dev, test, and prod.

WadeBarnes · 2023-10-25T14:25:55Z

Search-engine resource use:

i5okie mentioned this pull request Sep 14, 2023

Implement Grafana Loki, Mimir, and Grafana for better log parsing and monitoring of the search-engine-bc bcgov/DITP-DevOps#117

Open

WadeBarnes mentioned this pull request Oct 6, 2023

Investigate how Aries-VCR is performing paging when querying Solr bcgov/aries-vcr#757

Open

WadeBarnes mentioned this pull request Oct 6, 2023

Disable Solr caches bcgov/aries-vcr#758

Closed

WadeBarnes force-pushed the main branch 2 times, most recently from 2e7d002 to 37cf707 Compare October 25, 2023 14:04

WadeBarnes force-pushed the main branch from 37cf707 to 7f0a2cb Compare October 25, 2023 14:05

WadeBarnes marked this pull request as ready for review October 25, 2023 14:22

WadeBarnes requested review from esune, i5okie and rajpalc7 October 25, 2023 14:22

esune approved these changes Oct 25, 2023

View reviewed changes

esune merged commit 75fb1e0 into bcgov:main Oct 25, 2023
1 check passed

swcurran mentioned this pull request Dec 19, 2023

Determine options to switch search engine bcgov/aries-vcr#761

Open

esune mentioned this pull request Feb 6, 2024

OrgBook BC API returns a 400 Error bcgov/aries-vcr#765

Closed

Update search-engine resource allocations #137

Update search-engine resource allocations #137

Conversation

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 12, 2023

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023 • edited Loading

WadeBarnes commented Sep 14, 2023 • edited Loading

WadeBarnes commented Sep 14, 2023 • edited Loading

WadeBarnes commented Sep 14, 2023

WadeBarnes commented Sep 14, 2023

WadeBarnes commented Sep 14, 2023

WadeBarnes commented Sep 15, 2023 • edited Loading

WadeBarnes commented Sep 15, 2023 • edited Loading

WadeBarnes commented Sep 18, 2023

WadeBarnes commented Sep 18, 2023

WadeBarnes commented Sep 18, 2023

WadeBarnes commented Sep 19, 2023 • edited Loading

WadeBarnes commented Sep 20, 2023 • edited Loading

WadeBarnes commented Sep 21, 2023 • edited Loading

WadeBarnes commented Sep 22, 2023 • edited Loading

WadeBarnes commented Oct 6, 2023

WadeBarnes commented Oct 6, 2023

WadeBarnes commented Oct 6, 2023

WadeBarnes commented Oct 6, 2023

WadeBarnes commented Oct 16, 2023 • edited Loading

WadeBarnes commented Oct 16, 2023

WadeBarnes commented Oct 19, 2023

WadeBarnes commented Oct 20, 2023 • edited Loading

WadeBarnes commented Oct 23, 2023

WadeBarnes commented Oct 25, 2023

WadeBarnes commented Oct 25, 2023

WadeBarnes commented Oct 25, 2023 • edited Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 13, 2023 •

edited

Loading

WadeBarnes commented Sep 14, 2023 •

edited

Loading

WadeBarnes commented Sep 14, 2023 •

edited

Loading

WadeBarnes commented Sep 15, 2023 •

edited

Loading

WadeBarnes commented Sep 15, 2023 •

edited

Loading

WadeBarnes commented Sep 19, 2023 •

edited

Loading

WadeBarnes commented Sep 20, 2023 •

edited

Loading

WadeBarnes commented Sep 21, 2023 •

edited

Loading

WadeBarnes commented Sep 22, 2023 •

edited

Loading

WadeBarnes commented Oct 16, 2023 •

edited

Loading

WadeBarnes commented Oct 20, 2023 •

edited

Loading

WadeBarnes commented Oct 25, 2023 •

edited

Loading