TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

SrinivasaBharath · 2024-11-18T01:34:47Z

Description

In test case execution failed and the failed log is -
http://magna002.ceph.redhat.com/cephci-jenkins/results/openstack/RH/8.0/rhel-9/Regression/19.2.0-52/rados/45/tier-2_rados_test-drain-customer-issue

Jira tasks to track the issue are -

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist

Create a test case in Polarion reviewed and approved.
Create a design/automation approach doc. Optional for tests with similar tests already automated.
Review the automation design
Implement the test script and perform test runs
Submit PR for code review and approve
Update Polarion Test with Automation script details and update automation fields
If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

openshift-ci · 2024-11-18T01:34:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SrinivasaBharath · 2024-11-18T04:14:25Z

Multiple times I tested the scenario and the following are the pass logs-

pdhiran · 2024-11-18T16:32:52Z

ceph/rados/serviceability_workflows.py

+                status_flag = False
+                end_time = datetime.datetime.now() + datetime.timedelta(seconds=600)
+                while end_time > datetime.datetime.now():
+                    out, err = self.cephadm.shell([status_cmd])


Any reason for executing commands with self.cephadm.shell() and then performing operations on data?

Please add reason for self.cephadm.shell() and then performing operations on data

The information is added.

pdhiran · 2024-11-18T16:36:25Z

ceph/rados/serviceability_workflows.py

+                                f"OSD remove operation is in progress {osd_id}\nOperations: {entry}"
+                            )
+                    except json.JSONDecodeError:
+                        log.info(f"The OSD removal is completed on OSD : {osd_id}")


This could either mean that OSD removal is complete, or No OSD removal was started in the 1st place.

pdhiran · 2024-11-18T16:49:01Z

tests/rados/test_node_drain_customer_bug.py

@@ -135,6 +135,12 @@ def run(ceph_cluster, **kw):
        log.info(
            f"The OSDs in the drain node before starting the test - {osd_count_before_test} "
        )
+        cmd_set_unmanaged = (


This is expecting that the OSD spec on the cluster is with all-available-devices.

Can we write a generic method to export the spec from cluster, add a new unmanaged=True key and apply the spec?

Something similar to set_mon_service_managed_type() present in monitor workflows.

New method is created and included.

pdhiran · 2024-11-18T16:49:30Z

tests/rados/test_node_drain_customer_bug.py

@@ -152,6 +158,12 @@ def run(ceph_cluster, **kw):
                "The traceback messages are noticed in logs.The error snippets are noticed in the MGR logs"
            )
            return 1
+        cmd_unset_unmanaged = (


Same as above. let's use the new method that will be created.

Used new method.

pdhiran

Overall looks good.

New method to set services to managed and unmanaged to true/ false should be created and used here.

mergify · 2024-11-22T07:39:54Z

"This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes?"

… failover commands and preempt scrub fix Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>

SrinivasaBharath · 2024-11-22T10:05:29Z

Preempt scrub fix pass log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-HWPGNS

harshkumarRH · 2024-11-22T10:17:10Z

ceph/rados/utils.py

@@ -45,7 +45,6 @@ def set_osd_devices_unmanaged(ceph_cluster, osd_id, unmanaged):
            break

    if not service_name:
-        log.error(f"No orch service found for osd: {osd_id}")


@SrinivasaBharath please explain why we are making this change?

harshkumarRH · 2024-11-22T10:20:13Z

ceph/rados/core_workflows.py

+            log.debug(
+                f"Setting the {service_type} service as unmanaged by cephadm. current status : {out}"
+            )
+            out["unmanaged"] = "false"


Here instead of explicitly setting it to False, we should remove the "unmanaged" key from the dictionary if it exists

harshkumarRH · 2024-11-22T10:22:40Z

ceph/rados/core_workflows.py

+
+        time.sleep(10)
+        # Checking for the unmanaged setting on service
+        cmd = "ceph orch ls"


Please replace the command with "ceph orch ls {service_type}" to avoid the for loop traversal

[ceph: root@ceph-hakumar-ryth74-node1-installer /]# ceph orch ls mon -f json-pretty [ { "placement": { "label": "mon" }, "service_name": "mon", "service_type": "mon", "status": { "created": "2024-11-15T20:59:25.948847Z", "last_refresh": "2024-11-22T10:12:24.482009Z", "running": 3, "size": 3 } } ]

harshkumarRH · 2024-11-22T10:25:14Z

tests/rados/test_node_drain_customer_bug.py

    replicated_config = config.get("replicated_pool")
    pool_name = replicated_config["pool_name"]
-    active_osd_list = rados_obj.get_osd_list(status="up")
+    active_osd_list = rados_obj.get_active_osd_list()


Please revert the change in this line to "active_osd_list = rados_obj.get_osd_list(status="up")"
The get_active_osd_list() method no longer exists

harshkumarRH · 2024-11-22T10:25:27Z

tests/rados/test_node_drain_customer_bug.py

@@ -133,8 +134,10 @@ def run(ceph_cluster, **kw):
    try:
        osd_count_before_test = get_node_osd_list(rados_obj, ceph_nodes, drain_host)
        log.info(
-            f"The OSDs in the drain node before starting the test - {osd_count_before_test} "
+            f"st The OSDs in the drain node before starting the te- {osd_count_before_test} "


Please fix typo

harshkumarRH · 2024-11-22T10:26:04Z

tests/rados/test_node_drain_customer_bug.py

@@ -194,7 +200,7 @@ def run(ceph_cluster, **kw):
            return 1

        if bug_exists:
-            active_osd_list = rados_obj.get_osd_list(status="up")
+            active_osd_list = rados_obj.get_active_osd_list()


Please revert to using get_osd_list(status="up")

harshkumarRH · 2024-11-22T10:31:31Z

tests/rados/test_rados_preempt_scrub.py

@@ -85,7 +85,7 @@ def run(ceph_cluster, **kw):

            log_lines = [
                "head preempted",
-                "WaitReplicas::react(const GotReplicas&) PREEMPTED",
+                "WaitReplicas::react(const GotReplicas&) PREEMPTED!",


@SrinivasaBharath I request more clarity here.
The comparison happening in def verify_preempt_log is not in
So previously if a subset of the sentence was not being found in log lines, how is adding an additional character '!' going to make any difference?

SrinivasaBharath force-pushed the rados_tfa_wip branch from 35175b4 to 51b25da Compare November 18, 2024 04:04

SrinivasaBharath requested review from harshkumarRH and pdhiran November 18, 2024 04:28

SrinivasaBharath added RADOS Rados Core tfa-issue-fix TFA automation issue fix labels Nov 18, 2024

pdhiran reviewed Nov 18, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase label Nov 22, 2024

SrinivasaBharath force-pushed the rados_tfa_wip branch from 4b5e8fa to ffb1680 Compare November 22, 2024 09:33

openshift-merge-robot removed the needs-rebase label Nov 22, 2024

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr…

01241e3

… failover commands and preempt scrub fix Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>

SrinivasaBharath force-pushed the rados_tfa_wip branch from dee438f to 01241e3 Compare November 22, 2024 10:04

SrinivasaBharath requested a review from pdhiran November 22, 2024 10:08

harshkumarRH reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

SrinivasaBharath commented Nov 18, 2024 •

edited

Loading

openshift-ci bot commented Nov 18, 2024

SrinivasaBharath commented Nov 18, 2024

pdhiran Nov 18, 2024

pdhiran Nov 18, 2024

SrinivasaBharath Nov 22, 2024

pdhiran Nov 18, 2024

pdhiran Nov 18, 2024

SrinivasaBharath Nov 22, 2024

pdhiran Nov 18, 2024

SrinivasaBharath Nov 22, 2024

pdhiran left a comment

mergify bot commented Nov 22, 2024

SrinivasaBharath commented Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024

harshkumarRH Nov 22, 2024 •

edited

Loading

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

Are you sure you want to change the base?

TFA-FIX:CEPH-83595932-To verify crashes while executing drain and mgr failover commands #4230

Conversation

SrinivasaBharath commented Nov 18, 2024 • edited Loading

Description

openshift-ci bot commented Nov 18, 2024

SrinivasaBharath commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdhiran left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 22, 2024

SrinivasaBharath commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harshkumarRH Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

SrinivasaBharath commented Nov 18, 2024 •

edited

Loading

harshkumarRH Nov 22, 2024 •

edited

Loading