[Bug] Continued performance issues after upgrade to 4.1.1 #1896

jhk70 · 2021-03-25T21:26:39Z

Continued performance issues after upgrade to 4.1.1

Request Type

Bug

Work Environment

Question	Answer
OS version (server)	Ubuntu
OS version (client)	18.04
TheHive version / git hash	4.1.1 (docker image 4.1.1-2
Package Type	Docker
Browser type & version	Various

Problem Description

After upgrading from 4.0.5-1 to 4.1.0 and then 4.1.1:

audit entries don't show in the application "live stream" view.
I get the familiar "AuditSrv" error after a while
the "Data Index Status" section of the "Platform Status" page does not load (i.e. user session times out before it loads).
This was consistent behaviour for 4.1.0 and 4.1.1.
The Audit table has 1,265,475 entries.

Steps to Reproduce

Upgrade the hive as described here
Configure local lucene index.
Start server.
Use Server

Complementary information

Other observations / debug actions:

During initial indexing, there were a number of "org.janusgraph.diskstorage.TemporaryBackendException: Temporary failure in storage backend" errors. Removing MAX_HEAP_SIZE and HEAP_NEWSIZE settings on cassandra removed these.
During initial periods after the upgrade, there was evidence of memory exhaustion. More RAM was added and the host and thehive was given 16g via -e JAVA_OPTS='-Xms16g -Xmx16g'
Without the "Platform Status" page, I have been able to reindex with curl:
curl -k "https://<host>:9000/api/v1/admin/index/Case/reindex" -H 'Authorization: Bearer *authwibble*'
I have re-run these for each Index and the logs show that these complete successfully.

Snippets from the Audit reindex logs:

Mar 25 21:39:52 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 1265475 record(s) indexed
Mar 25 21:39:53 hivehost01 docker[26287]: [info] o.j.g.d.m.ManagementSystem [|] Index update job successful for [AuditRequestidMainaction]
Mar 25 21:39:53 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is finished

Mar 25 21:47:59 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
Mar 25 21:48:00 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
Mar 25 21:48:01 hivehost01 docker[26287]: [info] o.j.g.o.j.IndexRepairJob [|] Found index Audit
Mar 25 21:48:01 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is running: 0 record(s) indexed
Mar 25 21:48:02 hivehost01 docker[26287]: [info] o.j.g.d.m.ManagementSystem [|] Index update job successful for [Audit]
Mar 25 21:48:02 hivehost01 docker[26287]: [info] o.t.s.m.Database [00000020|] Reindex job is finished

Our implementation had been "misusing" tags (per the 4.1.0 release blog) and had some long tags containing links to raw alerts etc. This was evidenced with a 6sec load time on /api/v1/query?name=list-tags. I have deleted these tags from the "Custom Tags" view. Is it possible something in the Audit content could be causing this? Is it possible to truncate / compact the Audit table?
Probably unrelated but I see this on start of the server:
Mar 25 21:27:40 hivehost01 docker[26287]: [warn] c.d.d.c.RequestHandler [|] Query '[4 bound values] SELECT column1,value,writetime(value) AS writetime,ttl(value) AS ttl F ROM thehive.graphindex WHERE key=:key AND column1>=:sliceStart AND column1<:sliceEnd LIMIT :maxRows;' generated server side warning(s): Read 947 live rows and 5788 tombstone cells for query SELECT * FROM thehive.graphindex WHERE key = 022689a05461e7 AND column1 >= 00 AND column1 < ff LIMIT 5000; token -8419547459570797906 (see tombstone_warn_threshold)
I have multiple times deleted & reconfigured the index. After restart (and before index), the "platform status" page loads (all indexes = "ERROR"). After I click "Reindex" on Audit, the indexing completes and the same performance issue is present. I can then no longer refresh / view the Index Status section of the Platform Status page.

The text was updated successfully, but these errors were encountered:

nadouani · 2021-03-26T12:42:39Z

@To-om I assigned this issue to 4.1.2 but it needs investigation. Feel free to move it out of this milestone if it requires more investigation

jhk70 · 2021-03-26T21:33:03Z

The problem is that this issue prevents a production upgrade to 4.1.1 (I didn't mention that this is a UAT instance) and leaves us stranded on 3.x. The UI is just too slow and if I have 2 or 3 analysts logged in, the CPU on the host becomes saturated.

jhk70 added TheHive4 TheHive4 related issues bug labels Mar 25, 2021

nadouani added the scope:performance label Mar 26, 2021

nadouani assigned To-om Mar 26, 2021

nadouani added the need:investigation label Mar 26, 2021

nadouani added this to the 4.1.2 milestone Mar 26, 2021

To-om added a commit that referenced this issue Mar 26, 2021

#1896 Limit number of flow element by age

66cbe14

To-om removed the need:investigation label Mar 26, 2021

To-om closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Continued performance issues after upgrade to 4.1.1 #1896

[Bug] Continued performance issues after upgrade to 4.1.1 #1896

jhk70 commented Mar 25, 2021

nadouani commented Mar 26, 2021 •

edited

Loading

jhk70 commented Mar 26, 2021

[Bug] Continued performance issues after upgrade to 4.1.1 #1896

[Bug] Continued performance issues after upgrade to 4.1.1 #1896

Comments

jhk70 commented Mar 25, 2021

Continued performance issues after upgrade to 4.1.1

Request Type

Work Environment

Problem Description

Steps to Reproduce

Complementary information

nadouani commented Mar 26, 2021 • edited Loading

jhk70 commented Mar 26, 2021

nadouani commented Mar 26, 2021 •

edited

Loading