Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: timeout during CPU fingerprinting #4439

Closed
capone212 opened this issue Jun 21, 2018 · 4 comments
Closed

Windows: timeout during CPU fingerprinting #4439

capone212 opened this issue Jun 21, 2018 · 4 comments

Comments

@capone212
Copy link
Contributor

capone212 commented Jun 21, 2018

Nomad version

v0.8.3

Operating system and Environment details

Windows

Nomad Client logs (if appropriate)

==> Starting Nomad agent...
==> Error starting agent: client setup failed 2: fingerprinting failed: cannot detect cpu total compute. CPU compute must be set manually using the client config option "cpu_total_compute"
    2018/06/20 21:50:25.899751 [WARN] fingerprint.cpu: 1 error(s) occurred:
* Unable to obtain CPU information: <nil>
 2018/06/20 21:50:25.899751 [DEBUG] fingerprint.cpu: core count: 24

Issue

Nomad client crashes at start after CPU fingerprinting failes. I have enhanced logging and able to obtain proper error message:


2018/06/20 13:35:13 [WARN] fingerprint.cpu: 1 error(s) occurred:
* Unable to obtain CPU information: initErr:<nil> error:context deadline exceeded

We see this problem on small set of windows boxes. On those boxes request to WMI take a lot of time (17 seconds). It clearly depend on environment, can't say any specific reason for this behavior.

I am attaching my patch which solved the issue for us. I am not opening pull request because the fix was done in external library.

0001-increased-timeout-during-CPU-fingerprinting.patch.txt

@angrycub
Copy link
Contributor

I experienced the same situation in 0.8.3 and resolved it with 0.8.4 because of #4265 for the log line and #4268 to extend the timeout for queries to 10 seconds.

It is interesting that you are seeing 17 second WMI queries, this will cause the newly extended timeout to still fail for you. However, you could use the same technique used in #4268 and extend that timeout constant. This would at least keep your modified code to the Nomad project.

@capone212
Copy link
Contributor Author

Nice! Thanks for info.
I glanced changes and noticed this
cpuInfoTimeout = 10 * time.Second

10 seconds is too short for our case, clearly.

@sichkarmg
Copy link

sichkarmg commented Jun 21, 2018

Hello,

Below are testing results on real windows 2012 R2 servers in production environment. Numbers are seconds of execution. Tests were repeated 20 times for each of four servers.

server1 | 16,49
server1 | 16,56
server1 | 16,55
server1 | 16,49
server1 | 16,60
server1 | 16,48
server1 | 16,48
server1 | 16,47
server1 | 16,65
server1 | 16,50
server1 | 16,48
server1 | 16,48
server1 | 16,47
server1 | 16,50
server1 | 16,48
server1 | 16,64
server1 | 16,50
server1 | 16,51
server1 | 16,48
server1 | 16,50
server1 | 16,58
server1 | 16,49
server1 | 16,54
server1 | 16,62
server1 | 16,48
server1 | 16,51
server1 | 16,56
server2 | 16,65
server2 | 16,68
server2 | 19,87
server2 | 16,81
server2 | 16,74
server2 | 16,67
server2 | 16,68
server2 | 19,97
server2 | 19,86
server2 | 16,65
server2 | 16,68
server2 | 16,68
server2 | 16,68
server2 | 16,66
server2 | 16,69
server2 | 16,69
server2 | 16,71
server2 | 20,78
server2 | 16,69
server2 | 16,73
server2 | 16,66
server2 | 19,90
server2 | 20,36
server2 | 16,65
server2 | 38,07
server2 | 19,90
server2 | 16,67
server3 | 16,70
server3 | 16,85
server3 | 16,94
server3 | 17,08
server3 | 16,81
server3 | 20,11
server3 | 17,43
server3 | 16,71
server3 | 16,67
server3 | 16,72
server3 | 19,97
server3 | 19,93
server3 | 20,10
server3 | 16,71
server3 | 16,72
server3 | 16,81
server3 | 16,80
server3 | 16,69
server3 | 19,90
server3 | 16,95
server3 | 16,81
server3 | 16,86
server3 | 33,90
server3 | 21,06
server3 | 19,95
server3 | 16,77
server3 | 16,71
server4 | 16,77
server4 | 19,79
server4 | 16,58
server4 | 16,57
server4 | 16,64
server4 | 19,82
server4 | 16,61
server4 | 16,77
server4 | 16,65
server4 | 16,69
server4 | 16,57
server4 | 20,07
server4 | 16,63
server4 | 16,59
server4 | 16,65
server4 | 16,62
server4 | 19,90
server4 | 17,87
server4 | 16,59
server4 | 16,75
server4 | 16,63
server4 | 19,87
server4 | 16,58
server4 | 17,04
server4 | 16,62
server4 | 16,57
server4 | 16,86

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants