Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA testing - Behaviour of cluster restart procedure in the UI #3178

Closed
6 of 9 tasks
gdiazlo opened this issue Aug 16, 2022 · 4 comments
Closed
6 of 9 tasks

QA testing - Behaviour of cluster restart procedure in the UI #3178

gdiazlo opened this issue Aug 16, 2022 · 4 comments
Assignees

Comments

@gdiazlo
Copy link
Member

gdiazlo commented Aug 16, 2022

Target version Related issue Related PR
4.3.7 wazuh/wazuh-dashboard-plugins#4277 wazuh/wazuh-dashboard-plugins#4365

Description

We need to ensure these situations work as expected, and check if we should test more situations we haven't identified yet.

Proposed checks

  • Restart correctly

    • Synchronizes cluster and restarts correctly
    • Restart node successfully
  • Failed to restart

    • Cluster failed to restart
    • Restart Node - Node disconnected during restart
    • Failed to restart - unexpected API response
    • Failed to synchronize after a successful restart
    • Synchronize failed after an error on restart
    • Synchronize successful and error on restart
    • Error restarting node

Expected results

All the cases inform the user about what happened and what to do.

Configuration and considerations

It would be interesting to test timeouts between the API and the UI in the middle of a restart cycle.

@Deblintrake09
Copy link
Contributor

Deblintrake09 commented Aug 16, 2022

Review data

Tester PR commit
@Deblintrake09 Tag v4.3.7-1

Testing environment

OS OS version Deployment Image/AMI Notes
Centos 8 <REMOTE | Deployer - EC2 ami-0e65b7ce2dab78ec9 Box Size c5.xlarge

Tested packages

Conclusion 🟡

Unexpected behavior was found during testing. Unhandled responses and missing messages for the user. Also when trying to manually cause a sync error, the application timedout and was unable to reconnect to the API.

Status

  • In progress
  • Pending Review
  • Team leader approved
  • Manager approved

@Deblintrake09
Copy link
Contributor

Deblintrake09 commented Aug 16, 2022

Task Results

Restart correctly

Synchronizes cluster and restarts correctly 🟢
  • Check that cluster is running and synchronized

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Restart Cluster

    restart-cluster.mp4
  • Check that cluster restarted correctly and is synchronized

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
Restart node successfully 🟢
  • Check that cluster is running and synchronized

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Restart a worker node wazuh-2

    Restart.node.mp4

Failed to restart

Cluster failed to restart 🟢
  • Check that cluster is running and synchronized
# curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
  {
     "data": {
        "affected_items": [
           {
              "name": "wazuh-1",
              "synced": true
           },
           {
              "name": "wazuh-2",
              "synced": true
           },
           {
              "name": "wazuh-3",
              "synced": true
           }
        ],
        "total_affected_items": 3,
        "total_failed_items": 0,
        "failed_items": []
     },
     "message": "Nodes ruleset synchronization status was successfully read",
     "error": 0
  }
  • Restart Cluster

    restart_cluster_failed.mp4
Restart Node - Node disconnected during restart 🟡
  • Check that cluster is running and synchronized

    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       }
    
  • Restart Worker node and stop the VM while it is restarting

    restart_node_disconnects.mp4
  • Note: The node's VM is rebooted during the restart, and the node is removed from the cluster, so it just skips the restart and assumes everything is normal, but no information is given to the user that the node was removed (even when it was actively monitoring the restart). The user may not notice this, and may assume all nodes are working correctly. There should be some message indicating to the user this occurrence.

Failed to restart - unexpected API response 🟡
  • Restart a node from outside the web app right before clicking restart (This will trigger new API response).

  • Check that cluster is running and synchronized

    {"title": "Bad Request", "detail": "Some Wazuh daemons are not ready yet in node \"wazuh-3\" (wazuh-modulesd->restarting, wazuh-analysisd->restarting, wazuh-execd->restarting, wazuh-db->restarting, wazuh-remoted->restarting)", "dapi_errors": {"wazuh-3": {"error": "Some Wazuh daemons are not ready yet in node \"wazuh-3\" (wazuh-modulesd->restarting, wazuh-analysisd->restarting, wazuh-execd->restarting, wazuh-db->restarting, wazuh-remoted->restarting)"}}, "error": 1017}
    
  • Restart Master node

    restart_error_caused_by_sync_error.mp4
  • Note: Since this response was not expected or handled there it cannot restart Wazuh and goes straight to healthcheck. This response should be handled.

Error restarting node 🟢
  • Check that cluster is running and synchronized
    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Restart Cluster
restart_node_failed.mp4
Error restarting cluster with intensive file sync - 1000 copies 🔴
  • Check that cluster is running and synchronized

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Execute copy script that copies 1000 rule files inside the master node.

  • Restart Cluster

    failed_restart_during_sync.mp4
  • Check the API

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {"title": "Wazuh Internal Error", "detail": "Timeout executing API request", "dapi_errors": {"wazuh-1": {"error": "Timeout executing API request", "logfile": "WAZUH_HOME/logs/api.log"}}, "error": 3021}
    
    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": false
             },
             {
                "name": "wazuh-3",
                "synced": false
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Refresh the healthcheck

    imagen

  • After 5 minute wait, cluster is still not in sync. Start recording again. It does not refresh or allow access to API and fails.

    sync_video_2.mp4
  • Wait another 10 minutes and continue waiting for sync

    Sync3.mp4

    image

    imagen

  • Wazuh-Indexer crashed after too much timeout

    ● wazuh-indexer.service - Wazuh-indexer
       Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
       Active: failed (Result: signal) since Wed 2022-08-17 18:43:37 UTC; 5min ago
         Docs: https://documentation.wazuh.com
      Process: 11172 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (c>
     Main PID: 11172 (code=killed, signal=ABRT)
    
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: _java_thread_list=0x00007fd7680198b0, length=>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd81095f0a0, 0x00007fd8109607e0, 0x000>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109693e0, 0x00007fd81096ac10, 0x000>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109d1790,
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd8109d5330, 0x00007fd8124bc580, 0x000>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd812c6d400, 0x00007fd812c73040, 0x000>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd76800dde0, 0x00007fd813e9da60, 0x000>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd-entrypoint[11172]: 0x00007fd79c005230# [ timer expired, abort...>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd[1]: wazuh-indexer.service: Main process exited, code=killed, sta>
    ago 17 18:43:37 ip-172-31-8-249.ec2.internal systemd[1]: wazuh-indexer.service: Failed with result 'signal'.
    
    
Error restarting cluster with intensive file sync - 50 copies 🔴
  • Check that cluster is running and synchronized

    # curl -k -X GET "https://localhost:55000/cluster/ruleset/synchronization?pretty=true" -H "Authorization: Bearer $TOKEN"
    {
       "data": {
          "affected_items": [
             {
                "name": "wazuh-1",
                "synced": true
             },
             {
                "name": "wazuh-2",
                "synced": true
             },
             {
                "name": "wazuh-3",
                "synced": true
             }
          ],
          "total_affected_items": 3,
          "total_failed_items": 0,
          "failed_items": []
       },
       "message": "Nodes ruleset synchronization status was successfully read",
       "error": 0
    }
    
  • Execute copy script that copies 50 rule files inside the master node.

  • Restart Cluster

    sync_50_copies.mp4
  • OS got stuck, could not get commands to work. Calling API from console got stuck.

@jmv74211
Copy link
Contributor

After testing the proposed changes of this issue and supervising all the results obtained, it has been decided that the new development does not meet the expected expectations, and therefore, it will be necessary to give it a new iteration to improve all the following aspects:

Regarding the restart/synchronization waiting modal:

  • The loading bar does not have uniform progress, and does not indicate to the user how long approximately he/she is going to have to wait. It is proposed that the bar progresses continuously, or in case of not knowing how long the task is going to take, instead of using a horizontal bar, a circular loading bar should be used that only indicates that it is doing things.

  • It does not have elements that show the status information about what is being done. For example, "Synchronizing Wazuh," synchronizing what?

Regarding performance:

  • When doing data synchronizations of several GB of data (see tests performed), the app is unmanageable, giving API timeouts all over the place and even making the wazuh-indexer crash.

@jmv74211
Copy link
Contributor

Following the reports made, we have proceeded to perform a rollback of the changes made to the interface that have been tested in this issue. In the future, a better design will be made and its development will be included in a future release.

In the meantime, the next testing issue #3178 has been opened to test the behavior after the rollback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants