Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test stuck during db node_setup (apt-get update) #8257

Open
2 tasks
soyacz opened this issue Aug 6, 2024 · 3 comments
Open
2 tasks

Test stuck during db node_setup (apt-get update) #8257

soyacz opened this issue Aug 6, 2024 · 3 comments
Assignees

Comments

@soyacz
Copy link
Contributor

soyacz commented Aug 6, 2024

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

During db node-4 preparation, apt-get update got frozen (while installing scylla manager agent):

2024-08-03T03:13:47.007+00:00 longevity-twcs-48h-master-db-node-9e80e135-4   !NOTICE | sudo[8491]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/apt-get update
2024-08-03T03:13:47.007+00:00 longevity-twcs-48h-master-db-node-9e80e135-4     !INFO | sudo[8491]: pam_unix(sudo:session): session opened for user root(uid=0) by scyllaadm(uid=1000)
...

And process sudo[8491] never ends.
culprit line is

self.remoter.sudo("apt-get update", ignore_status=True)

but actually it could happen in other cases too and we have no protection against frozen setup.

We could add timeout to all apt update and/or add protection on higher level around

with ThreadPoolExecutor(max_workers=4) as executor:

Impact

Frozen test, lost time and money

How frequently does it reproduce?

First time seen

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-9e80e135-4 (34.241.150.172 | 10.4.9.233) (shards: -1)
  • longevity-twcs-48h-master-db-node-9e80e135-3 (18.201.19.226 | 10.4.10.183) (shards: 7)
  • longevity-twcs-48h-master-db-node-9e80e135-2 (18.201.166.170 | 10.4.10.140) (shards: 7)
  • longevity-twcs-48h-master-db-node-9e80e135-1 (63.33.69.199 | 10.4.11.8) (shards: 7)

OS / Image: ami-00355186dfc821610 (aws: undefined_region)

Test: longevity-twcs-48h-test
Test id: 9e80e135-c117-4b9b-a87d-15aeb5dc363c
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 9e80e135-c117-4b9b-a87d-15aeb5dc363c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 9e80e135-c117-4b9b-a87d-15aeb5dc363c

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Contributor

fruch commented Aug 6, 2024

@soyacz

let keep track of similar cases, I don't remember seeing lots of such cases.

@fruch fruch changed the title Test stuck during db node_setup Test stuck during db node_setup (apt-get update) Aug 6, 2024
@fruch
Copy link
Contributor

fruch commented Aug 6, 2024

on more then I've seen, don't know if related, seem that there are places we scan the scylla repo in apt update commands ,while the we didn't update the key.

we introduced this new key for signing deb packages recently, it might be connected.

@vponomaryov
Copy link
Contributor

And process sudo[8491] never ends. culprit line is https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster.py#L1794 but actually it could happen in other cases too and we have no protection against frozen setup.

@soyacz please, use commit IDs, not branch names when you refer to some specific code line.
Moreover such a popular file for changes.
In some short period of time it will move significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants