Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PESDLC-1113 Copy logs from agent and RP pods #17772

Merged
merged 11 commits into from
May 9, 2024
Merged

Conversation

savex
Copy link
Contributor

@savex savex commented Apr 10, 2024

Take advantage of call to logs property and simulate DT log copy using streaming approach to eliminate huge memory consumptions.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@savex savex requested review from travisdowns and rpdevmp April 10, 2024 21:21
@savex savex marked this pull request as ready for review April 22, 2024 22:55
@savex
Copy link
Contributor Author

savex commented Apr 22, 2024

EC2 test run:

ubuntu@ip-172-31-4-5:~/tests$  cd /home/ubuntu/tests ; /usr/bin/env /opt/.ducktape-venv/bin/python3 /home/ubuntu/.vscode-server/extensions/ms-python.debugpy-2024.5.11001012-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 48537 -- -m ducktape --compress --cluster=ducktape.cluster.json.JsonCluster --cluster-file=cluster.json --globals=globals.json --max-parallel=1 --repeat=1 --test-runner-timeout=86400000 rptest/redpanda_cloud_tests/cloud_self_test.py::SelfRedpandaCloudTest.test_healthy 
[INFO:2024-04-22 22:40:13,692]: starting test run with session id 2024-04-22--006...
[INFO:2024-04-22 22:40:13,692]: running 1 tests...
[INFO:2024-04-22 22:40:13,693]: Triggering test 1 of 1...
[INFO:2024-04-22 22:40:14,642]: RunnerClient: Loading test {'directory': '/home/ubuntu/redpanda/tests/rptest/redpanda_cloud_tests', 'file_name': 'cloud_self_test.py', 'cls_name': 'SelfRedpandaCloudTest', 'method_name': 'test_healthy', 'injected_args': None}
[INFO:2024-04-22 22:40:14,646]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: on run 1/1
[INFO:2024-04-22 22:40:14,968]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: Setting up...
[WARNING - 2024-04-22 22:40:18,650 - redpanda_cloud - create - lineno:899]: will not create cluster; already have cluster_id cojcnl1ai2dm17j4nqng
WARN [TBOT]      CLI parameters are overriding onboarding config from  config/config.go:472
INFO [TBOT]      Created directory "/tmp/tbot-data" config/destination_directory.go:132
INFO [TBOT]      Anonymous telemetry is not enabled. Find out more about Machine ID's anonymous telemetry at https://goteleport.com/docs/machine-id/reference/telemetry/ tbot/anonymous_telemetry.go:83
INFO [TBOT]      Attempting to generate new identity from token tbot/renew.go:510
INFO [AUTH]      Attempting registration via proxy server. auth/register.go:283
INFO [AUTH]      Attempting to register Bot with IAM method using regional STS endpoint auth/register.go:629
INFO [AUTH]      Successfully registered Bot with IAM method using regional STS endpoint auth/register.go:662
INFO [AUTH]      Successfully registered via proxy server. auth/register.go:290
INFO [TBOT]      Successfully generated new bot identity, valid: after=2024-04-22T22:39:20Z, before=2024-04-23T04:40:19Z, duration=6h0m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-buildkite-robot], principals=[-teleport-internal-join], generation=0 tbot/tbot.go:446
INFO [TBOT]      Beginning renewal loop: ttl=6h0m0s interval=6h0m0s tbot/renew.go:726
INFO [TBOT]      Started watching for CA rotations tbot/ca_rotation.go:173
INFO [TBOT]      Attempting to generate new identity from token tbot/renew.go:510
INFO [AUTH]      Attempting registration via proxy server. auth/register.go:283
INFO [AUTH]      Attempting to register Bot with IAM method using regional STS endpoint auth/register.go:629
INFO [AUTH]      Successfully registered Bot with IAM method using regional STS endpoint auth/register.go:662
INFO [AUTH]      Successfully registered via proxy server. auth/register.go:290
INFO [TBOT]      Successfully renewed bot certificates, valid: after=2024-04-22T22:39:22Z, before=2024-04-23T04:40:21Z, duration=6h0m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-buildkite-robot], principals=[-teleport-internal-join], generation=0 tbot/renew.go:634
INFO [TBOT]      Successfully renewed impersonated certificates for directory /tmp/machine-id, valid: after=2024-04-22T22:39:23Z, before=2024-04-23T04:40:23Z, duration=6h1m0s | kind=tls, renewable=false, disallow-reissue=true, roles=[teleport-admin], principals=[root redpanda {{internal.logins}} -teleport-internal-join], generation=0 tbot/renew.go:697
INFO [TBOT]      Persisted certificates successfully. One-shot mode enabled so exiting. tbot/renew.go:780
WARN [TBOT]      Context canceled during backoff for CA rotation watcher. Aborting. tbot/ca_rotation.go:139
=== kubectl already installed
=== k alias already created
=== k9s already installed
=== AWS VM detected. Running AWS specific configurations
=== k8s context already configured
=== grpcurl already installed
=== unzip already installed
=== rpk already installed
=== stern already installed
[INFO:2024-04-22 22:40:39,041]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: Running...
[WARNING - 2024-04-22 22:42:55,050 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[WARNING - 2024-04-22 22:42:55,050 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[WARNING - 2024-04-22 22:43:19,950 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[WARNING - 2024-04-22 22:43:19,950 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[WARNING - 2024-04-22 22:43:43,433 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[WARNING - 2024-04-22 22:43:43,434 - utils - _check_oversized_allocations - lineno:120]: Ignoring oversized allocation, 217088 is less than the max allowable allocation size of 409600 bytes
[INFO:2024-04-22 22:43:43,435]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: Tearing down...
[INFO:2024-04-22 22:43:43,437]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: PASS
[INFO:2024-04-22 22:43:43,438]: RunnerClient: rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy: Data: None
test_id:    rptest.redpanda_cloud_tests.cloud_self_test.SelfRedpandaCloudTest.test_healthy
status:     PASS
run time:   3 minutes 28.791 seconds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-04-22--006
run time:         3 minutes 30.361 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
opassedfips:      0
ofailedfips:      0
========================================================================================================================================================================================================================================================================================
ubuntu@ip-172-31-4-5:~/tests$

@savex
Copy link
Contributor Author

savex commented Apr 22, 2024

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 23, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/48123#018f085c-7474-4b3e-b2ac-9d1bd0d8dae4:

"rptest.tests.full_disk_test.FullDiskReclaimTest.test_full_disk_triggers_gc"

new failures in https://buildkite.com/redpanda/redpanda/builds/48230#018f119a-6ef6-4e46-8a17-b7cbef6e856e:

"rptest.tests.rbac_upgrade_test.UpgradeMigrationCreatingDefaultRole.test_rbac_migration"

@savex savex marked this pull request as draft April 23, 2024 17:16
@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from 97ac409 to 5eaa5ef Compare April 24, 2024 14:25
@savex
Copy link
Contributor Author

savex commented Apr 24, 2024

/ci-repeat 1

@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch 3 times, most recently from c72decb to 39550ff Compare April 25, 2024 22:51
@savex savex marked this pull request as ready for review April 25, 2024 22:56
@savex
Copy link
Contributor Author

savex commented Apr 25, 2024

/ci-repeat 1

@savex
Copy link
Contributor Author

savex commented Apr 25, 2024

Linter error are from separate module and not related to this change

Run shfmt -i 2 -ci -s -d .
--- tests/docker/ducktape-deps/tinygo-wasi-transforms.orig
+++ tests/docker/ducktape-deps/tinygo-wasi-transforms
@@ -10,7 +10,7 @@
   local count=0
   until "$@"; do
     exit=$?
-    count=$(($count + 1))
+    count=$((count + 1))
     if [ $count -lt $retries ]; then
       echo "Retry $count/$retries exited $exit, retrying..."
     else
Error: Process completed with exit code 1.

@savex
Copy link
Contributor Author

savex commented Apr 25, 2024

/ci-repeat 1

@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from 21bdc15 to 53314a2 Compare April 26, 2024 14:34
@savex savex requested a review from a team as a code owner April 26, 2024 14:34
for line in process.stdout: # type: ignore
yield line
except Exception as e:
raise RuntimeError(f'Failed to capture output') from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does this exception chaining add? Actually I find the error confusing because it's not really the capture that failed in almost all cases but the process itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way to show some helpful errors if a process failed with something meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, there should be separate ticket/task to update this to return actual error from shell. Right now underlying error masks it under "return code 1" exception and this is useful when something wrong with the environment

@savex savex requested a review from travisdowns April 29, 2024 21:51
savex added 6 commits May 3, 2024 14:38
    Implement simple class to track time and
    log milestones with ease.
   Implement simple script that extracts specific log lines from rp log
   based on broker name and time.
   To eliminate additional function call, just implement fake account
   class inside Cloud broker with similar functionality
   To unify calls to node shell, put it inside CloudBroker class.
   This way actual shell pod could be left intact and not being
   created/deleted for each activity
   Enable KubeNodeShell to be initialized in demand
   and clean when needed
   Use CloudBroker initialization to inject script onto broker node
savex added 3 commits May 3, 2024 14:56
   Copy agent logs first
   Then use ThreadPoolExecutor to copy logs from brokers

   Use stopwatch to track time
@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from f0e51fc to 91d418b Compare May 3, 2024 19:59
@savex
Copy link
Contributor Author

savex commented May 3, 2024

Rebased with latest changes from @travisdowns on capture errors from cmd output

@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from 91d418b to 0cb51c8 Compare May 3, 2024 22:28
   Also refactor capture cmd to support latest updates to cmd
@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from 0cb51c8 to 97928f5 Compare May 3, 2024 22:44
tests/rptest/clients/kubectl.py Outdated Show resolved Hide resolved
@@ -16,6 +25,10 @@ def __init__(self, pod, kubectl, logger) -> None:
self._meta = pod['metadata']
self.name = self._meta['name']

# Backward compatibility
# Various classes will use this to hash and compare nodes
self.account(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this account need to be a class, that we instantiate?

Actually I don't really understand what is happening here at all :).

After this line, the CloudBroker object is unchanged right?

Looking further down, may be the intent is something like self.account = account(pod)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a mimic of ducktape mechanism that is defined here.

Ducktape uses separate class to define sophisticated way to do SSH to node and run commands: Linux_RemoteAccount. Since we need to be consistent with normal logs extraction when searching for errors inside them in raise_on_badlogs, we just defining fake class with one needed attribute: hostname which is consumed in multiple places as if it is duck-tape's node.account. class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primary cause for this dance is here.

Class BafLogLines used in separate module and there was a discussion couple of weeks back not to overcomplicate it with inheritance (which was original solution)

tests/rptest/services/utils.py Outdated Show resolved Hide resolved
tests/rptest/services/utils.py Outdated Show resolved Hide resolved
tests/rptest/services/utils.py Outdated Show resolved Hide resolved
tests/rptest/services/utils.py Show resolved Hide resolved
@savex savex requested a review from jackietung-redpanda May 8, 2024 21:15
@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from c074747 to 5ef6f10 Compare May 8, 2024 22:24
@savex savex force-pushed the PESDLC-1113-add-htt-logs-copy branch from 5ef6f10 to a8909aa Compare May 8, 2024 22:25
@savex savex merged commit 1acaf9d into dev May 9, 2024
17 checks passed
@savex savex deleted the PESDLC-1113-add-htt-logs-copy branch May 9, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants