QA testing - Behaviour of cluster restart procedure in the UI #3178

gdiazlo · 2022-08-16T14:58:09Z

Target version	Related issue	Related PR
4.3.7	wazuh/wazuh-dashboard-plugins#4277	wazuh/wazuh-dashboard-plugins#4365

Description

We need to ensure these situations work as expected, and check if we should test more situations we haven't identified yet.

Proposed checks

Restart correctly
- Synchronizes cluster and restarts correctly
- Restart node successfully
Failed to restart
- Cluster failed to restart
- Restart Node - Node disconnected during restart
- Failed to restart - unexpected API response
- Failed to synchronize after a successful restart
- Synchronize failed after an error on restart
- Synchronize successful and error on restart
- Error restarting node

Expected results

All the cases inform the user about what happened and what to do.

Configuration and considerations

It would be interesting to test timeouts between the API and the UI in the middle of a restart cycle.

Deblintrake09 · 2022-08-16T21:35:58Z

Review data

Tester	PR commit
@Deblintrake09	Tag v4.3.7-1

Testing environment

OS	OS version	Deployment	Image/AMI	Notes
Centos	8	`<REMOTE \| Deployer - EC2`	ami-0e65b7ce2dab78ec9	Box Size c5.xlarge

Tested packages

Installed Indexer, Manager and Dashboard using the wazuh-install.sh script

Conclusion 🟡

Unexpected behavior was found during testing. Unhandled responses and missing messages for the user. Also when trying to manually cause a sync error, the application timedout and was unable to reconnect to the API.

Status

In progress
Pending Review
Team leader approved
Manager approved

Deblintrake09 · 2022-08-16T21:45:54Z

Task Results

Restart correctly

Synchronizes cluster and restarts correctly 🟢

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Restart Cluster

restart-cluster.mp4

Check that cluster restarted correctly and is synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Restart node successfully 🟢

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Restart a worker node wazuh-2

Restart.node.mp4

Failed to restart

Cluster failed to restart 🟢

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
  {
     "data": {
        "affected_items": [
           {
              "name": "wazuh-1",
              "synced": true
           },
           {
              "name": "wazuh-2",
              "synced": true
           },
           {
              "name": "wazuh-3",
              "synced": true
           }
        ],
        "total_affected_items": 3,
        "total_failed_items": 0,
        "failed_items": []
     },
     "message": "Nodes ruleset synchronization status was successfully read",
     "error": 0
  }

Restart Cluster

restart_cluster_failed.mp4

Restart Node - Node disconnected during restart 🟡

Check that cluster is running and synchronized

{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   }

Restart Worker node and stop the VM while it is restarting

restart_node_disconnects.mp4
Note: The node's VM is rebooted during the restart, and the node is removed from the cluster, so it just skips the restart and assumes everything is normal, but no information is given to the user that the node was removed (even when it was actively monitoring the restart). The user may not notice this, and may assume all nodes are working correctly. There should be some message indicating to the user this occurrence.

Failed to restart - unexpected API response 🟡

Restart a node from outside the web app right before clicking restart (This will trigger new API response).

Check that cluster is running and synchronized

{"title": "Bad Request", "detail": "Some Wazuh daemons are not ready yet in node \"wazuh-3\" (wazuh-modulesd->restarting, wazuh-analysisd->restarting, wazuh-execd->restarting, wazuh-db->restarting, wazuh-remoted->restarting)", "dapi_errors": {"wazuh-3": {"error": "Some Wazuh daemons are not ready yet in node \"wazuh-3\" (wazuh-modulesd->restarting, wazuh-analysisd->restarting, wazuh-execd->restarting, wazuh-db->restarting, wazuh-remoted->restarting)"}}, "error": 1017}

Restart Master node

restart_error_caused_by_sync_error.mp4
Note: Since this response was not expected or handled there it cannot restart Wazuh and goes straight to healthcheck. This response should be handled.

Error restarting node 🟢

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Restart Cluster

restart_node_failed.mp4

Error restarting cluster with intensive file sync - 1000 copies 🔴

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Execute copy script that copies 1000 rule files inside the master node.
Restart Cluster

failed_restart_during_sync.mp4

Check the API

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{"title": "Wazuh Internal Error", "detail": "Timeout executing API request", "dapi_errors": {"wazuh-1": {"error": "Timeout executing API request", "logfile": "WAZUH_HOME/logs/api.log"}}, "error": 3021}

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": false
         },
         {
            "name": "wazuh-3",
            "synced": false
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Refresh the healthcheck
After 5 minute wait, cluster is still not in sync. Start recording again. It does not refresh or allow access to API and fails.

sync_video_2.mp4
Wait another 10 minutes and continue waiting for sync

Sync3.mp4

Wazuh-Indexer crashed after too much timeout

● wazuh-indexer.service - Wazuh-indexer
   Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Wed 2022-08-17 18:43:37 UTC; 5min ago
     Docs: https://documentation.wazuh.com
  Process: 11172 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (c>
 Main PID: 11172 (code=killed, signal=ABRT)

ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: _java_thread_list=0x00007fd7680198b0, length=>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd81095f0a0, 0x00007fd8109607e0, 0x000>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109693e0, 0x00007fd81096ac10, 0x000>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109d1790,
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109d5330, 0x00007fd8124bc580, 0x000>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd812c6d400, 0x00007fd812c73040, 0x000>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd76800dde0, 0x00007fd813e9da60, 0x000>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd79c005230# [ timer expired, abort...>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd[1]: wazuh-indexer.service: Main process exited, code=killed, sta>
ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd[1]: wazuh-indexer.service: Failed with result 'signal'.

Error restarting cluster with intensive file sync - 50 copies 🔴

Check that cluster is running and synchronized

# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
{
   "data": {
      "affected_items": [
         {
            "name": "wazuh-1",
            "synced": true
         },
         {
            "name": "wazuh-2",
            "synced": true
         },
         {
            "name": "wazuh-3",
            "synced": true
         }
      ],
      "total_affected_items": 3,
      "total_failed_items": 0,
      "failed_items": []
   },
   "message": "Nodes ruleset synchronization status was successfully read",
   "error": 0
}

Execute copy script that copies 50 rule files inside the master node.
Restart Cluster

sync_50_copies.mp4
OS got stuck, could not get commands to work. Calling API from console got stuck.

jmv74211 · 2022-08-22T08:18:36Z

After testing the proposed changes of this issue and supervising all the results obtained, it has been decided that the new development does not meet the expected expectations, and therefore, it will be necessary to give it a new iteration to improve all the following aspects:

Regarding the restart/synchronization waiting modal:

The loading bar does not have uniform progress, and does not indicate to the user how long approximately he/she is going to have to wait. It is proposed that the bar progresses continuously, or in case of not knowing how long the task is going to take, instead of using a horizontal bar, a circular loading bar should be used that only indicates that it is doing things.
It does not have elements that show the status information about what is being done. For example, "Synchronizing Wazuh," synchronizing what?

Regarding performance:

When doing data synchronizations of several GB of data (see tests performed), the app is unmanageable, giving API timeouts all over the place and even making the wazuh-indexer crash.

jmv74211 · 2022-08-22T16:09:55Z

Following the reports made, we have proceeded to perform a rollback of the changes made to the interface that have been tested in this issue. In the future, a better design will be made and its development will be included in a future release.

In the meantime, the next testing issue #3178 has been opened to test the behavior after the rollback.

gdiazlo added team/qa dev-testing labels Aug 16, 2022

jmv74211 added the target/4.3.7 label Aug 16, 2022

jmv74211 added this to the Release 4.3.7 RC-1 milestone Aug 16, 2022

jmv74211 removed the status/not-tracked label Aug 16, 2022

jmv74211 added this to Release 4.3.7 Aug 16, 2022

jmv74211 moved this to Todo in Release 4.3.7 Aug 16, 2022

damarisg assigned Deblintrake09 Aug 16, 2022

damarisg added the subteam/qa-storm label Aug 16, 2022

Deblintrake09 moved this from Todo to In Progress in Release 4.3.7 Aug 16, 2022

jmv74211 added the release test/4.3.7 label Aug 17, 2022

vikman90 removed this from Release 4.3.7 Aug 19, 2022

mauceballos mentioned this issue Aug 19, 2022

Node down during node restarting isn't informed wazuh/wazuh-dashboard-plugins#4404

Closed

jmv74211 closed this as completed Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA testing - Behaviour of cluster restart procedure in the UI #3178

QA testing - Behaviour of cluster restart procedure in the UI #3178

gdiazlo commented Aug 16, 2022 •

edited by Deblintrake09

Loading

Deblintrake09 commented Aug 16, 2022 •

edited

Loading

Deblintrake09 commented Aug 16, 2022 •

edited

Loading

jmv74211 commented Aug 22, 2022

jmv74211 commented Aug 22, 2022

QA testing - Behaviour of cluster restart procedure in the UI #3178

QA testing - Behaviour of cluster restart procedure in the UI #3178

Comments

gdiazlo commented Aug 16, 2022 • edited by Deblintrake09 Loading

Description

Proposed checks

Expected results

Configuration and considerations

Deblintrake09 commented Aug 16, 2022 • edited Loading

Review data

Testing environment

Tested packages

Conclusion 🟡

Status

Deblintrake09 commented Aug 16, 2022 • edited Loading

Task Results

Restart correctly

Failed to restart

jmv74211 commented Aug 22, 2022

jmv74211 commented Aug 22, 2022

gdiazlo commented Aug 16, 2022 •

edited by Deblintrake09

Loading

Deblintrake09 commented Aug 16, 2022 •

edited

Loading

Deblintrake09 commented Aug 16, 2022 •

edited

Loading