Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (key symptom) in UpgradeBackToBackTest.test_upgrade_with_all_workloads #21624

Closed
vbotbuildovich opened this issue Jul 24, 2024 · 9 comments
Assignees
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/infra CI Root Cause Analysis - Infrastructure Issue

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 24, 2024

https://buildkite.com/redpanda/vtools/builds/15926

Module: rptest.tests.upgrade_test
Class: UpgradeBackToBackTest
Method: test_upgrade_with_all_workloads
Arguments: {
    "single_upgrade": false
}
test_id:    UpgradeBackToBackTest.test_upgrade_with_all_workloads
status:     FAIL
run time:   573.837 seconds

RemoteCommandError({'ssh_config': {'host': 'ducktape-node-10-amazingly-saving-quetzal', 'hostname': '10.168.0.124', 'user': 'root', 'port': 22, 'password': None, 'identityfile': '/home/ubuntu/.ssh/id_rsa'}, 'hostname': 'ducktape-node-10-amazingly-saving-quetzal', 'ssh_hostname': '10.168.0.124', 'user': 'root', 'externally_routable_ip': '34.102.33.140', '_logger': <Logger rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False-490 (DEBUG)>, 'os': 'linux', '_ssh_client': <paramiko.client.SSHClient object at 0x79b870b9b6a0>, '_sftp_client': <paramiko.sftp_client.SFTPClient object at 0x79b870bbd3c0>, '_custom_ssh_exception_checks': None}, 'python3 /opt/scripts/offline_log_viewer/viewer.py --path /var/lib/redpanda/data --type controller_snapshot', 1, b'INFO:viewer:starting metadata viewer with options: Namespace(path=\'/var/lib/redpanda/data\', type=\'controller_snapshot\', topic=None, verbose=False, dump=False, force=False)\nTraceback (most recent call last):\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 235, in <module>\n    main()\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 215, in main\n    print_controller_snapshot(store, options.dump)\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 70, in print_controller_snapshot\n    SerializableGenerator(snap.to_dict().items()))\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1346, in to_dict\n    return self.parse_snapshot(sf)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1336, in parse_snapshot\n    data = reader.read_checksum_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 139, in read_checksum_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1337, in <lambda>\n    type_read=lambda r, _: self.read_snapshot(r), max_version=2)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1305, in read_snapshot\n    data[\'topics\'] = rdr.read_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 133, in read_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1306, in <lambda>\n    type_read=lambda r, v: self.read_topics(r, v), max_version=1)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1186, in read_topics\n    rdr.read_serde_map(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 204, in read_serde_map\n    key = k_reader(self)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1134, in read_tp_ns_to_str\n    return f"{v[\'namespace\']}/{v[\'topic\']}"\nTypeError: string indices must be integers\n')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/upgrade_test.py", line 262, in test_upgrade_with_all_workloads
    controller_snapshot = log_viewer.read_controller_snapshot(
  File "/home/ubuntu/redpanda/tests/rptest/clients/offline_log_viewer.py", line 53, in read_controller_snapshot
    return self._json_cmd(node, "--type controller_snapshot")
  File "/home/ubuntu/redpanda/tests/rptest/clients/offline_log_viewer.py", line 34, in _json_cmd
    json_out = node.account.ssh_output(cmd, combine_stderr=False)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 397, in ssh_output
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
ducktape.cluster.remoteaccount.RemoteCommandError: root@ducktape-node-10-amazingly-saving-quetzal: Command 'python3 /opt/scripts/offline_log_viewer/viewer.py --path /var/lib/redpanda/data --type controller_snapshot' returned non-zero exit status 1. Remote error message: b'INFO:viewer:starting metadata viewer with options: Namespace(path=\'/var/lib/redpanda/data\', type=\'controller_snapshot\', topic=None, verbose=False, dump=False, force=False)\nTraceback (most recent call last):\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 235, in <module>\n    main()\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 215, in main\n    print_controller_snapshot(store, options.dump)\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 70, in print_controller_snapshot\n    SerializableGenerator(snap.to_dict().items()))\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1346, in to_dict\n    return self.parse_snapshot(sf)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1336, in parse_snapshot\n    data = reader.read_checksum_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 139, in read_checksum_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1337, in <lambda>\n    type_read=lambda r, _: self.read_snapshot(r), max_version=2)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1305, in read_snapshot\n    data[\'topics\'] = rdr.read_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 133, in read_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1306, in <lambda>\n    type_read=lambda r, v: self.read_topics(r, v), max_version=1)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1186, in read_topics\n    rdr.read_serde_map(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 204, in read_serde_map\n    key = k_reader(self)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1134, in read_tp_ns_to_str\n    return f"{v[\'namespace\']}/{v[\'topic\']}"\nTypeError: string indices must be integers\n'

JIRA Link: CORE-5780

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Jul 24, 2024
@rpdevmp
Copy link
Contributor

rpdevmp commented Jul 25, 2024

Many tests failed due to the same infra issue remote commands kept failing

RemoteCommandError({'ssh_config': {'host': 'ip-172-31-3-159', 'hostname': '172.31.3.159', 'user': 'root', 'port': 22, 'password': None, 'identityfile': '/home/ubuntu/.ssh/id_rsa'}, 'hostname': 'ip-172-31-3-159', 'ssh_hostname': '172.31.3.159', 'user': 'root', 'externally_routable_ip': '54.184.188.225',

Example Buildkite Job:
https://buildkite.com/redpanda/vtools/builds/15928

Going to close others as duplicate of this issue

  1. insert results into analytics DB error that can bee seen in CI runs is already fixed
  2. Work in progress to improve PandaTriage logic to be able to group issues based on root cause and avoid open GH issues for each test in case of common infra issue (for example)

@rpdevmp
Copy link
Contributor

rpdevmp commented Jul 25, 2024

Also, in some tests additonal error is present:

Example Buildkite Job:
https://buildkite.com/redpanda/vtools/builds/15928

ClientError('An error occurred (AuthenticationRequired) when calling the ListBuckets operation: Authentication required.')
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 530, in list_objects
    res = self._list_objects(bucket=bucket,
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 47, in do_retry
    return fn(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 503, in _list_objects
    return client.list_objects_v2(Bucket=bucket,
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AuthenticationRequired) when calling the ListObjectsV2 operation: Authentication required.

We need to investigate and fix both of these and close all other GT issues that were opened due to 15126 or 15128

P.S.
Also, going to open a Jira task to improve PandaTriage logging to make it easier to map test error with GH issue AND related Buildkite Job

Example of how this should be logged:

opening issue for test failure StreamVerifierTest.test_simple_produce_consume_txn_with_add_node created issue: https://github.com/redpanda-data/redpanda/issues/21625 Based on CI job: https://buildkite.com/redpanda/vtools/builds/15926

@rpdevmp rpdevmp self-assigned this Jul 25, 2024
This was referenced Jul 25, 2024
This was referenced Jul 31, 2024
@vbotbuildovich
Copy link
Collaborator Author

@piyushredpanda
Copy link
Contributor

Closing older-bot-filed CI issues as we transition to a more reliable system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/infra CI Root Cause Analysis - Infrastructure Issue
Projects
None yet
Development

No branches or pull requests

4 participants