-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChangeLog
1077 lines (841 loc) · 49.8 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Mon Dec 13 18:11:16 2010 Michael Jennings (mej)
Initial check-in.
----------------------------------------------------------------------
Mon Dec 13 19:12:54 2010 Michael Jennings (mej)
Work in progress: Initial skeleton for node health check.
----------------------------------------------------------------------
Tue Dec 14 11:15:23 2010 Michael Jennings (mej)
Completed driver script for health check. Now to write checks and
test it.
----------------------------------------------------------------------
Wed Dec 15 17:01:43 2010 Michael Jennings (mej)
Add packaging goop.
----------------------------------------------------------------------
Wed Dec 15 19:10:33 2010 Michael Jennings (mej)
Added some checks and a sample config. In the process of debugging.
----------------------------------------------------------------------
Wed Dec 15 20:51:38 2010 Michael Jennings (mej)
Debugging errors in the FS check.
----------------------------------------------------------------------
Wed Dec 15 21:19:35 2010 Michael Jennings (mej)
Thanks to Greg, I fixed the handling of subshell (pipeline)
vs. non-subshell while loops. Everything appears to be working now.
----------------------------------------------------------------------
Thu Dec 16 20:51:57 2010 Michael Jennings (mej)
Allow for more readable config files by stripping surrounding
whitespace from target and check values.
Convert comment check to native bash to avoid spawning grep and a
subshell.
Fix bug with regexp targets in config. Forgot to *actually* strip off
the slashes...
----------------------------------------------------------------------
Thu Dec 16 21:24:58 2010 Michael Jennings (mej)
Fix output redirection and debugging.
----------------------------------------------------------------------
Thu Mar 31 15:50:46 2011 Michael Jennings (mej)
Fix spec to package all installed files and directories.
Wrap I/O in "eval" to make sure $LOGFILE redirection symbols are used
properly.
Fix typo in $SILENT check.
Fix config file parsing to be compatible with bash >= 3.2.
----------------------------------------------------------------------
Thu Mar 31 17:46:42 2011 Michael Jennings (mej)
Don't log timestamp; we're trying to avoid subprocesses.
On error, syslog the reason.
----------------------------------------------------------------------
Thu Apr 21 18:46:17 2011 Michael Jennings (mej)
This is a work in progress. I'm testing a bunch of stuff, so some or
all of it may end up not working. We'll see.
- Added routine to gather /etc/passwd data into arrays
- Added userid-to-UID mapping function
- Consolidated process checks to use a single spawning of "ps"
- Added routine to gather list from TORQUE of users who currently
have jobs running on the node
- Added check for unauthorized processes running on the node
- Added timeout in background to kill nhc if it hangs to avoid
hanging pbs_mom
- Eliminated several unnecessary forks
----------------------------------------------------------------------
Fri Apr 22 14:03:32 2011 Michael Jennings (mej)
Still needs some debugging, but I've successfully eliminated all but 1
subprocess (the "ps" command). Quite good given all the script does
so far.
Also added the beginnings of a test script for making sure the
individual functions work as advertised.
----------------------------------------------------------------------
Mon Apr 25 13:00:10 2011 Michael Jennings (mej)
Final fixups for UID check. Everything appears to be working well
now.
----------------------------------------------------------------------
Wed Apr 27 12:19:11 2011 Michael Jennings (mej)
Added check to verify user processes descend from pbs_mom.
Added flexible regexp/glob match check.
Renamed utility functions to nhc_* so that only user-usable checks
start with check_*.
Added syslog function to save syslog messages until script
termination.
----------------------------------------------------------------------
Mon May 2 18:51:27 2011 Michael Jennings (mej)
Added checks for CPU socket/core/thread counts and total/free
RAM/swap/memory.
----------------------------------------------------------------------
Tue May 3 19:07:02 2011 Michael Jennings (mej)
Missed a file.
----------------------------------------------------------------------
Wed May 4 17:49:59 2011 Michael Jennings (mej)
Minor cleanups to check_ps_kswapd().
----------------------------------------------------------------------
Fri May 6 13:12:24 2011 Yong Qin (yqin)
Added check for Infiniband.
----------------------------------------------------------------------
Fri May 6 14:41:23 2011 Yong Qin (yqin)
A minor bug fix.
----------------------------------------------------------------------
Tue May 10 01:26:14 2011 Michael Jennings (mej)
Bump version.
----------------------------------------------------------------------
Tue May 10 13:01:16 2011 Yong Qin (yqin)
Added checks for Myrinet and Ethernet. Minor bug fix.
----------------------------------------------------------------------
Thu May 12 08:12:48 2011 Michael Jennings (mej)
Try alternate mechanism for IB port checks.
----------------------------------------------------------------------
Wed May 18 15:57:49 2011 Michael Jennings (mej)
Fix parsing bug. Due to bash not properly "escaping" expanded
variables inside ${VAR#...} constructs, config file lines must not
contain more than one occurance of "||" any more.
Direct output to /dev/null, then redirect if $LOGFILE is set.
----------------------------------------------------------------------
Tue May 24 15:33:25 2011 Michael Jennings (mej)
Support older single-core, non-HT CPUs in /proc/cpuinfo.
----------------------------------------------------------------------
Thu Sep 1 16:16:38 2011 Michael Jennings (mej)
Fixed status reporting and added NHC label to offline message.
----------------------------------------------------------------------
Mon Sep 19 17:13:11 2011 Michael Jennings (mej)
Bump version to 1.1.
This release adds the ability to detect previously-set notes for nodes
and not overwrite them.
It will also clear notes and online nodes if all checks pass for a
node that had previously had check errors. It will only do this for
nodes whose notes begin with "NHC" to avoid bringing nodes online
which were manually offlined. Nodes marked offline which have no note
are not distinguished from down nodes and may be brought online if the
error(s) clear.
----------------------------------------------------------------------
Thu Oct 13 16:04:20 2011 Michael Jennings (mej)
Output onlining/offlining of nodes to log (with timestamp).
Log failure of health check to logfile as well as syslog.
----------------------------------------------------------------------
Wed Jan 25 15:40:58 2012 Michael Jennings (mej)
Convert to autoconf/automake for build.
----------------------------------------------------------------------
Tue Feb 7 11:11:12 2012 Michael Jennings (mej)
Various fixes and release of 1.1.4.
----------------------------------------------------------------------
Tue Mar 13 11:22:45 2012 Michael Jennings (mej)
Bump version. More consistency/cleanups.
----------------------------------------------------------------------
Fri May 4 11:31:47 2012 Michael Jennings (mej)
Remove debugging stuff for UIDs > 100. I don't really use it anyway,
and some people may want to run as other users.
Convert node online/offline scripts to use variables and $PATH to
identify where the "pbsnodes" command is and what arguments it should
take.
Add an "eval" to the execution of the check so that shell variables
can be used or altered in config files.
----------------------------------------------------------------------
Wed May 9 12:12:54 2012 Michael Jennings (mej)
Always use [[ ]] instead of [ ] (primarily for consistency).
Add customization of resource manager daemon match expression and
greater control over pbsnodes commands in online/offline helpers.
----------------------------------------------------------------------
Wed May 9 12:22:20 2012 Michael Jennings (mej)
Fix a couple conditional expressions from the last commit.
----------------------------------------------------------------------
Tue May 15 09:05:23 2012 Michael Jennings (mej)
Make sure nodes with no job files still work.
----------------------------------------------------------------------
Wed May 16 13:50:20 2012 Michael Jennings (mej)
Fix bug pointed out by Ole Holm Nielsen <ole.h.nielsen@fysik.dtu.dk>
which caused the new "eval" of config file lines to barf on the
regular expression with parentheses in the sample config. Going
forward, users will need to take care to escape shell metacharacters
appropriately in config files.
----------------------------------------------------------------------
Fri Jun 22 15:26:40 2012 Michael Jennings (mej)
Add stubs for unit test and benchmarking scripts.
Convert main NHC driver script to use functions so that it can be
loaded without needing to be executed and to facilitate testing of
some of its functionality.
----------------------------------------------------------------------
Fri Jun 29 17:52:30 2012 Michael Jennings (mej)
I smell unit tests!
----------------------------------------------------------------------
Mon Jul 2 16:33:54 2012 Michael Jennings (mej)
Move unit test framework to separate file. Now called "SHUT."
Override output functions to suppress normal NHC I/O and exception
handling.
Major refactoring of test framework to allow named tests and progress
output.
Added lots more unit tests for main nhc script.
----------------------------------------------------------------------
Mon Jul 2 17:17:38 2012 Michael Jennings (mej)
Initial test files for each check script.
----------------------------------------------------------------------
Thu Aug 16 18:12:36 2012 Michael Jennings (mej)
More work on unit tests:
- Report number of tests skipped, if any.
- Add tests for "common.nhc" module.
- Add tests for "ww_fs.nhc" module.
- Fix typos in external match checks.
----------------------------------------------------------------------
Fri Aug 24 15:29:17 2012 Michael Jennings (mej)
Finished hardware unit tests.
----------------------------------------------------------------------
Mon Aug 27 14:52:27 2012 Michael Jennings (mej)
Unit tests are finally done! Should be 100% coverage on end-user
checks too, though I don't know of any "gcov" equivalents for
bash.... ;-)
TODO: More checks!
----------------------------------------------------------------------
Mon Aug 27 16:39:47 2012 Michael Jennings (mej)
Build fixes, alternative skip syntax, and unit test changes to allow
"make test" in the spec file. Tested on RHEL4, 5, and 6 and in chroot
jails and VNFS images.
----------------------------------------------------------------------
Tue Sep 4 13:01:10 2012 Michael Jennings (mej)
Initial support for the nVidia HealthMon tool for checking the status
of nVidia CUDA GPU devices. More information can be found with the
Tesla Deployment Kit version 3 (currently in RC status).
----------------------------------------------------------------------
Tue Sep 4 17:50:00 2012 Michael Jennings (mej)
Add check for blacklisted processes.
----------------------------------------------------------------------
Wed Sep 5 16:39:59 2012 Michael Jennings (mej)
New checks for filesystem size/used/free limits based on "df" output.
Refactored check_fs_mount() to only read /proc/mounts once and
populate central array set (just like all the other modules).
Refactored unit tests accordingly.
----------------------------------------------------------------------
Thu Sep 6 09:57:41 2012 Michael Jennings (mej)
Added support for detached mode. Runs all checks in the background,
saves state to filesystem and checks it on the next run.
----------------------------------------------------------------------
Fri Sep 7 14:14:41 2012 Michael Jennings (mej)
Added unit tests for new disk space checks. Tweaked detached mode to
detach sooner. Fixed some faulty logic.
----------------------------------------------------------------------
Fri Sep 7 14:52:49 2012 Michael Jennings (mej)
A couple minor bugfixes/cleanups. This is now officially 1.2 beta.
----------------------------------------------------------------------
Wed Oct 3 16:09:05 2012 Michael Jennings (mej)
Finalized 1.2 release.
----------------------------------------------------------------------
Thu Oct 25 18:29:13 2012 Michael Jennings (mej)
Add support for NHC log rotation.
----------------------------------------------------------------------
Fri Oct 26 17:17:51 2012 Michael Jennings (mej)
Add support and unit tests for an "authorized users" whitelist.
----------------------------------------------------------------------
Mon Oct 29 17:58:30 2012 Michael Jennings (mej)
By default, don't touch nodes that are offline but have no note. Not
every site uses notes as religiously as we do, nor wants to!
----------------------------------------------------------------------
Tue Oct 30 14:17:40 2012 Michael Jennings (mej)
Fix job file location fallback handling, and look up userids for
processes where only UID is given as this may indicate a userid >8
characters rather than an unknown user.
----------------------------------------------------------------------
Tue Nov 6 13:28:13 2012 Michael Jennings (mej)
Add nhc.cron script contributed by Ole Holm Nielsen
<Ole.H.Nielsen@fysik.dtu.dk> to help minimize excessive messages from
NHC when executed via cron.
----------------------------------------------------------------------
Wed Nov 7 14:23:44 2012 Michael Jennings (mej)
Finalized 1.2.1 release.
----------------------------------------------------------------------
Wed Nov 7 15:24:26 2012 Michael Jennings (mej)
Found a bug. Re-releasing 1.2.1.
----------------------------------------------------------------------
Tue Nov 27 16:56:25 2012 Michael Jennings (mej)
Despite being specified by POSIX, apparently bash's built-in "kill"
command doesn't support signaling process groups. The watchdog timer
has been rewritten to just kill the nhc script itself. Unit tests for
the watchdog timer were also added.
----------------------------------------------------------------------
Thu Nov 29 17:57:23 2012 Michael Jennings (mej)
New check: check_hw_mcelog
This check will run "mcelog --client" by default and fail if any
output is received. If the mcelog daemon is not running, this will be
noted in the log file and syslog, but the check will pass.
----------------------------------------------------------------------
Mon Dec 17 11:03:13 2012 Michael Jennings (mej)
Reset IFS in die() handler and add quotes to traps. This should
prevent newlines being added to failure messages when certain
subcommands cause timeouts.
----------------------------------------------------------------------
Wed Jan 16 12:14:23 2013 Michael Jennings (mej)
Patch from John Hanks <john.hanks@usu.edu> for basic pdsh-style node
range support. Node ranges are now permitted and must be surrounded
by braces (e.g., "{n00[00-99].cluster}"). Multiple ranges may be
specified by separating them with commas (e.g., "{node[0-5],node8}"),
but commas may NOT be used inside the brackets ("{node[0-5,8]}").
This feature should be considered experimental at this point. Please
report any mismatches.
----------------------------------------------------------------------
Wed Jan 16 14:24:02 2013 Michael Jennings (mej)
I've added fallback support for LDAP, NIS, etc. via "getent" based on
a suggestion and proposed patch from John Hanks <john.hanks@usu.edu>.
For users using any solution for passwd resolution other than local
/etc/passwd, there are now 2 possible alternatives.
One, you can override the use of /etc/passwd as the source of passwd
data. You can reference any file on the filesystem that's locally
accessible by setting PASSWD_DATA_SRC to the filename you want NHC to
use. This could be used to read from a cache file generated by, e.g.,
"ypcat passwd" or a similar command. This should also work with
process substitution, so you can specify something like
PASSWD_DATA_SRC='<(ypcat passwd)' instead of using a file. (Note,
though, that this will block. Insert associated caveats here.)
Two, if reading from PASSWD_DATA_SRC fails, NHC will use "getent" on
an as-needed basis to populate its internal data structures. Note
that this will mean 1 execution of getent PER missing userid or UID.
Once a particular passwd entry has been retrieved, the information
will be cached and used throughout that NHC run. Subsequent
executions of NHC will have to execute getent again for each missing
entry.
I do not have such a system, so these changes are largely untested.
Feedback is humbly requested. :-)
----------------------------------------------------------------------
Tue Jan 22 13:32:47 2013 Michael Jennings (mej)
Bare parenthesized regular expressions are incompatible with bash 3 in
RHEL4 and RHEL5. Quoted regular expressions are incompatible with
bash 4 in RHEL6. Why? Because someone decided in bash 4 to allow
parts of regular expressions to be quoted and matched as strings
instead. Why? Beats me.
Way to go, bash developers. That's the kind of incompatibility I'd
expect from Python. :-/
Thankfully, storing the regular expressions in variables works just
fine, so we'll do that.
----------------------------------------------------------------------
Tue Jan 22 16:56:03 2013 Michael Jennings (mej)
1.2.2 has been released.
----------------------------------------------------------------------
Tue Jan 22 16:58:44 2013 Michael Jennings (mej)
Setting up the tree for 1.2.3 development.
----------------------------------------------------------------------
Mon Feb 4 11:20:34 2013 Michael Jennings (mej)
Fix from Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk> for his
nhc.cron script to place transient files in /var/lib/nhc rather than
/tmp to avoid potential symlink issues. Based on suggestions on the
hpc-monitoring Google Group from Stuart Barkley <google@4gh.net> and
Jesse Becker <hawson@gmail.com>.
----------------------------------------------------------------------
Mon Feb 11 14:22:29 2013 Michael Jennings (mej)
Added variable $DETACHED_MODE_FAIL_NODATA which can be set to 1 to
cause detached mode to return failure by default, instead of success,
when no results file is present from a previous run.
Also made sure that results files which are older than /proc,
indicating a reboot since the last run, are considered stale and
removed.
----------------------------------------------------------------------
Mon Mar 11 14:47:19 2013 Michael Jennings (mej)
Add DMI data gatherer and corresponding checks. Definitely not
speedy, but there's an awful lot of very valuable information that can
be gleaned from it. And, as always, once you take the initial hit,
you can write as many DMI checks as you like with minimal added
overhead.
Still need to write the unit tests for the new checks.
----------------------------------------------------------------------
Mon Mar 11 17:43:40 2013 Michael Jennings (mej)
Added unit tests and auto-fu for DMI stuff.
----------------------------------------------------------------------
Tue Mar 12 16:39:17 2013 Michael Jennings (mej)
Merged and tweaked patch from Aleksey Senin <aleksey@senin.name> to
support Infiniband device names in check_hw_ib.
----------------------------------------------------------------------
Tue Mar 12 16:39:19 2013 Michael Jennings (mej)
Add .gitignore file.
----------------------------------------------------------------------
Wed Mar 13 15:29:34 2013 Michael Jennings (mej)
Auto-detect resource manager on startup for later use.
----------------------------------------------------------------------
Wed Mar 13 15:29:37 2013 Michael Jennings (mej)
Move nhc_fs_[un]parse_size() to common for use in other checks.
----------------------------------------------------------------------
Wed Mar 13 15:29:39 2013 Michael Jennings (mej)
Convert existing checks to support multiple resource managers via
$NHC_RM.
----------------------------------------------------------------------
Thu Mar 14 21:08:53 2013 Michael Jennings (mej)
Merged in contributions from Dustin Rice <dustin@alaska.edu> to add
SLURM support to the node online/offline scripts. Also added SGE
support (sort of, since it's not actually necessary).
----------------------------------------------------------------------
Thu Mar 14 21:14:11 2013 Michael Jennings (mej)
Fix "make test" by assuming PBS for testing purposes.
----------------------------------------------------------------------
Fri Mar 15 16:38:28 2013 Michael Jennings (mej)
Bump to version 1.3.
----------------------------------------------------------------------
Mon Mar 18 15:53:15 2013 Michael Jennings (mej)
Simulate svnversion with git to fix RPM release numbers.
----------------------------------------------------------------------
Wed Mar 20 11:58:08 2013 Michael Jennings (mej)
Fix issues with SLURM support found during testing. Since $(...) does
not treat quotes as metacharacters, we'll need to hard-code the sinfo
arguments for obtaining the node status listing. If someone has a
better way, I'm all ears (eyes?)!
As far as I can tell, SLURM support is now fully functional.
----------------------------------------------------------------------
Wed Mar 20 12:03:37 2013 Michael Jennings (mej)
Oops; missed a few spots.
----------------------------------------------------------------------
Wed Mar 20 13:37:30 2013 Michael Jennings (mej)
Preliminary work to integrate Grid Engine support into NHC. This
currently still requires an external input loop, but I am planning to
do that in nhc as well.
----------------------------------------------------------------------
Wed Mar 20 14:41:55 2013 Michael Jennings (mej)
Finish merging the Grid Engine wrapper into NHC. The nhc script is
now directly callable as an SGE/UGE/*GE "load sensor."
----------------------------------------------------------------------
Thu Mar 21 11:48:40 2013 Michael Jennings (mej)
Use "DRAIN" state instead of "DOWN" as the latter will terminate
running jobs on the node. We don't want that. :-)
----------------------------------------------------------------------
Thu Mar 21 11:48:42 2013 Michael Jennings (mej)
More tweaks to handling of SLURM node states.
----------------------------------------------------------------------
Thu Mar 21 14:42:56 2013 Michael Jennings (mej)
Another node state I didn't know about.
----------------------------------------------------------------------
Fri Mar 22 12:35:26 2013 Michael Jennings (mej)
Add online/offline support for IBM Platform LSF.
NOTE: This is based entirely on documentation and has not been tested
at all. If you are an LSF user and are willing to help test, please
contact me! I haven't found a way to have LSF run NHC on the nodes
yet, so for now it would need to run out of cron or similar.
----------------------------------------------------------------------
Mon Apr 01 16:27:08 2013 Michael Jennings (mej)
Add support for both command line options and arbitrary environment
variable setting on the command line. A few limited options are
available; see "nhc -h" for details. For example, you can turn on
debugging and set an alternate config file location using:
# nhc -d -c /etc/nhc/alternate.conf
All other configuration settings can be manipulated on the command
line using an env-like VAR=value syntax. So for example, if you
wanted to disable marking online/offline of nodes and set the maximum
system UID to 499, you can now do:
# nhc MARK_OFFLINE=0 MAX_SYS_UID=499
Note that these parameters WILL be overridden by the config file if
they're set there!
----------------------------------------------------------------------
Mon Apr 01 18:09:03 2013 Michael Jennings (mej)
Added 3 new checks: check_fs_inodes(), check_fs_ifree(), and
check_fs_iused(). The perform the same tasks as their
check_fs_{size,used,free}() counterparts except using inode count
instead of byte count.
----------------------------------------------------------------------
Mon Apr 01 18:09:06 2013 Michael Jennings (mej)
Add support for byte suffixes to all check_hw_{physmem,swap,mem}{,_free}() tests.
----------------------------------------------------------------------
Tue Apr 02 17:09:16 2013 Michael Jennings (mej)
Added new check: check_file_contents() will scan through a file
looking for matches to one or more patterns (regular expressions or
globs). The check will succeed iff all patterns are successfully
matched against individual lines in the file.
Some real-world usage examples:
check_file_contents /etc/passwd '/^root:x:0:0:[^:]*:/root:/bin/[a-z]*sh$/'
check_file_contents /etc/passwd 'adminusr:*' 'slurm:*' 'sshd:*'
check_file_contents /var/spool/torque/mom_priv/config '$pbsserver master'
check_file_contents /etc/hosts '10.0.0.10*master'
check_file_contents /proc/cgroups 'cpuset*1'
----------------------------------------------------------------------
Wed Apr 03 15:10:12 2013 Michael Jennings (mej)
More minor issues found during testing.
----------------------------------------------------------------------
Wed Apr 03 15:10:14 2013 Michael Jennings (mej)
I'll take "Things Missed by Unit Tests for $100, Alex."
----------------------------------------------------------------------
Wed Apr 03 15:10:17 2013 Michael Jennings (mej)
Note to self: Write more unit tests based on potential stupid
mistakes an admin could make rather than just real-world usage.
----------------------------------------------------------------------
Fri Apr 05 18:26:52 2013 Michael Jennings (mej)
Release version 1.3.
----------------------------------------------------------------------
Wed Jun 19 11:57:55 2013 Michael Jennings (mej)
SLURM doesn't prohibit nodes with subdomains, so don't forceably
eliminate them.
----------------------------------------------------------------------
Fri Sep 27 14:10:07 2013 Michael Jennings (mej)
Use "bsdtime" option instead of "time" to make sure we get a time
value we can easily parse (MMM:SS instead of DD-HH:MM:SS). Thanks to
John Hanks <john.hanks@usu.edu> for pointing out this issue!
----------------------------------------------------------------------
Mon Oct 07 16:14:08 2013 Michael Jennings (mej)
If logging to the logfile fails for some reason, syslog an error
message and redirect to /dev/null. Thanks to Ole Holm Nielsen
<Ole.H.Nielsen@fysik.dtu.dk> for catching this issue!
----------------------------------------------------------------------
Wed Nov 06 13:55:39 2013 Michael Jennings (mej)
Increment test count on failed test too.
----------------------------------------------------------------------
Wed Nov 06 13:55:41 2013 Michael Jennings (mej)
Add new check: check_ps_service [options] <service>
This check takes a service-oriented posture. It's similar to
check_ps_daemon but allows for glob- and regexp-based matching along
with optionally restarting the service if it's not running. The user
can also specify arbitrary commands to be run if the appropriate
service daemon isn't (or is) found.
----------------------------------------------------------------------
Tue Jan 14 07:41:12 2014 Michael Jennings (mej)
Add new "test-debug" target for generating verbose debugging output
when running the unit test suite.
Work around bash regexp implementations which do not support the \b
"word boundary" binding operator.
----------------------------------------------------------------------
Tue Jan 14 08:15:39 2014 Michael Jennings (mej)
Add support for timestamping in log/debug output based on bash
$SECONDS variable. This adds a single fork() in order to get the
current UNIX time_t value via date(1). Off by default unless
debugging.
----------------------------------------------------------------------
Sun Feb 09 02:24:11 2014 Michael Jennings (mej)
Properly handle missing script files.
----------------------------------------------------------------------
Sun Feb 09 02:24:18 2014 Michael Jennings (mej)
Preliminary setup for allowing checks to be done in a non-fatal manner
for general monitoring purposes.
----------------------------------------------------------------------
Sun Feb 09 02:24:26 2014 Michael Jennings (mej)
Add option (-a) and config variable (NHC_CHECK_ALL) to make individual
checks non-fatal. This will cause NHC to continue running all checks,
even if one or more of them fail, until it finishes. It then reports
how many checks failed and returns that number as its exit status.
This is intended for more general monitoring use (e.g., from cron).
----------------------------------------------------------------------
Sun Feb 09 02:24:34 2014 Michael Jennings (mej)
Don't assume device files will always be there.
----------------------------------------------------------------------
Mon Feb 10 16:57:54 2014 Michael Jennings (mej)
Added two new options for check_ps_service (-s and -k) to stop/kill
services which *are* running. Similar to check_ps_blacklist but
allows for blacklisted services to be actively terminated by NHC.
----------------------------------------------------------------------
Mon Feb 10 16:57:57 2014 Michael Jennings (mej)
Support negated user (owner) matches in check_ps_service just like in
check_ps_blacklist.
----------------------------------------------------------------------
Mon Feb 10 16:58:01 2014 Michael Jennings (mej)
This was going to be 1.3.1, but the changes are extensive enough that
it will need to be a 1.4 release. Bump version accordingly.
----------------------------------------------------------------------
Mon Feb 10 17:01:53 2014 Michael Jennings (mej)
Add another svnversion fallback to spec file and uncomment.
----------------------------------------------------------------------
Wed Feb 12 12:29:58 2014 Michael Jennings (mej)
Remove the reference to $BASH_SUBSHELL. I now realize what it does,
and it's not at all what I was intending.
----------------------------------------------------------------------
Wed Feb 12 13:29:09 2014 Michael Jennings (mej)
Tweak how $OFFLINE_NODE and $ONLINE_NODE are invoked so that they can
be customized with additional capabilities instead of just specifying
a single command. This is a potential alternative to customizing the
scripts themselves.
----------------------------------------------------------------------
Wed Feb 12 13:29:11 2014 Michael Jennings (mej)
Mark pretty much everything, including nhc itself, as a config file so
that customizations don't get overwritten. Conceivably any or all of
these elements could potentially acquire site-local modifications.
The caveat, of course, being that updates would then have to be done
by hand....
----------------------------------------------------------------------
Wed Feb 12 15:37:45 2014 Michael Jennings (mej)
Better way of securing permissions.
----------------------------------------------------------------------
Wed Feb 19 15:26:18 2014 Michael Jennings (mej)
Typo.
----------------------------------------------------------------------
Wed Feb 19 17:27:17 2014 Michael Jennings (mej)
This should fix the issue spotted by Anthony DelSorbo
<adelsorb@csc.com> and Ken Nielson <knielson@adaptivecomputing.com>.
When NHC exits, it tries to terminate the watchdog timer process (a
bash process with a sleep process as a child). Because of a "bug"
(lack of feature, but that feature is specified by POSIX!) in bash,
its internal implementation of the "kill" builtin is incapable of
sending a signal to an entire process group (see also: NHC's SVN
r1201 commit). So when we sent the signal to the watchdog (i.e.,
bash) process, it died, but its child (the sleep process) didn't!
This commit rewrites the watchdog timer (again) to try and make sure
that the sleep goes away when the bash goes away.
NOTE: This was only an issue if the output of NHC was being piped
(i.e., read()) somewhere. Unfortunately, that includes pbs_mom....
;-) You can compare/verify this by running "nhc" by itself on the
command line vs. running "nhc 2>&1 | less"
----------------------------------------------------------------------
Wed Feb 19 17:31:24 2014 Michael Jennings (mej)
Add EXIT for paranoia.
----------------------------------------------------------------------
Wed Mar 05 16:15:56 2014 Michael Jennings (mej)
Add support for negating file content matches in
check_file_contents(). Prefixing the match expression with an
exclamation mark (!) will cause the check to fail if any line in the
file matches the expression. Any combination of positive and negative
match expressions may be used in the same check.
----------------------------------------------------------------------
Fri Mar 07 12:09:06 2014 Michael Jennings (mej)
Patch from Eliot Eshelman <eliot.eshelman@6by9.net> to ensure that NHC
reads the correct exit code from nvidia-healthmon.
----------------------------------------------------------------------
Fri Mar 07 22:31:57 2014 Michael Jennings (mej)
Further input from Eliot Eshelman <eliot.eshelman@6by9.net> led me to
reorder and rework the subcommand execution for the nVidia healthmon
check to ensure that command-line options were handled in the most
portable way possible.
----------------------------------------------------------------------
Fri Mar 07 22:56:51 2014 Michael Jennings (mej)
What on earth was that??
----------------------------------------------------------------------
Fri Mar 14 10:10:54 2014 Michael Jennings (mej)
Clarification of comment verbiage.
----------------------------------------------------------------------
Fri Mar 14 10:10:59 2014 Michael Jennings (mej)
Added 4 new checks, all with similar syntax, for looking at process
resource consumption. Each check looks for processes using more than
a specified amount of the resource and can take various actions when
they are found. The checks and their respective resources are:
check_ps_cpu - Percentage of CPU utilization
check_ps_mem - Amount of total system memory (absolute size)
check_ps_physmem - Amount of physical RAM (absolute or percentage)
check_ps_time - Total CPU time
Syntax is, e.g.: check_ps_cpu [flags] <threshold>
Flags accepted:
-0 Non-fatal; report on matches, but don't terminate
-a Find and alert on all matches; don't die after the 1st
-e action Execute a command if a match is found
-f Full match; match against the entire command line
-k Kill matching processes if found
-l Log processes found to the NHC log
-m match Specifies a command (or command line) to match
-r value Renice matching processes by the specified factor
-s Log processes found to the syslog
-u [!]user Match only processes owned (or not owned) by user
Thresholds are specified as percentages (percent sign is optional for
check_ps_cpu), sizes (in kB or with appropriate suffix), or time (in
seconds or XXXmYYs).
Examples:
check_ps_cpu -r 19 -u '!root' 99
check_ps_mem -k -u mej -m '/leakyprog/' 24g
check_ps_physmem -l -s 90%
check_ps_time -l 720m
----------------------------------------------------------------------
Fri Mar 14 16:28:50 2014 Michael Jennings (mej)
Slight efficiency improvement by reading directly from the file
instead of creating a subprocess.
----------------------------------------------------------------------
Fri Mar 14 16:28:52 2014 Michael Jennings (mej)
Fix minor cosmetic bug when timestamps are turned on -- the completion
line in the log file gave the timestamp instead of the elapsed time.
----------------------------------------------------------------------
Fri Mar 14 17:18:12 2014 Michael Jennings (mej)
Add check_loadavg() for looking at the 1-, 5-, and 15-minute load
averages on a system. Any or all may be capped. Syntax is:
check_loadavg <limit_1m> <limit_5m> <limit_15m>
Blank limits are ignored.
This check was originally written in front of a live studio audience
at MoabCon 2013! (See the video at: http://go.lbl.gov/nhc-2013-mc)
----------------------------------------------------------------------
Mon Mar 17 15:27:10 2014 Michael Jennings (mej)
Added 2 new checks for looking at the results of bash's built-in
"test" command as well as file stat() values.
check_file_test() provides an interface to certain options of the
"test" command which examine file attributes without needing to shell
out to run the /bin/stat command. For example, you can check to see
if a file is readable, or writable, or if it even exists at all.
check_file_stat() goes further by actually running the /bin/stat
command and allowing you to test its results against expected values.
You can verify the owner or group of a file, or check to see if the
last-modified-time for a file is newer or older than you think it
should be.
These checks both support a *ton* of options, so documenting them all
here would make for a humongous changelog entry, but here are a few
examples:
To make sure /tmp is writable:
check_file_test -w /tmp
To make sure the passwd file isn't empty or missing:
check_file_test -s /etc/passwd
To make sure /dev/null is a character special device:
check_file_test -c /dev/null
To do a full integrity check on /dev/null:
check_file_stat -m 0666 -u 0 -g 0 -t 1 -T 3 /dev/null
To make sure /var/log/messages has recent activity:
check_file_stat -n 7200 /var/log/messages
To verify access to a user's ~/.ssh/ tree:
check_file_stat -m 0700 -U someuser /home/someuser
Full documentation for these checks will be on the web once I have a
chance to write them all up!
----------------------------------------------------------------------
Tue Mar 18 16:18:20 2014 Michael Jennings (mej)
Fix path to stat command.
----------------------------------------------------------------------
Tue Mar 18 16:18:23 2014 Michael Jennings (mej)
Add optional "fudge" factor to mem/swap size checks. This allows the
actual size to be within some percentage or specific number of kB of
the specified minimum/maximum and still pass the check. If not
specified, obviously, no fudge factor is used.
Example: To verify RAM size is 32GB +/- 10%:
check_hw_physmem 32g 32g 10%
-OR-
check_hw_physmem 32g 32g 3200m
----------------------------------------------------------------------
Wed Mar 19 09:20:07 2014 Michael Jennings (mej)
Add unit tests for fudge factor code.
----------------------------------------------------------------------
Wed Mar 19 13:54:01 2014 Michael Jennings (mej)
Create /var/run/nhc and put run-time files (like results) in there
instead of directly in /var/run.
Make $RESULTFILE depend on $NAME. Each named instance should have
independent results.
Don't have $CONFDIR or $HELPERDIR depend on $NAME; that requires a
duplicate of /etc/nhc and /usr/libexec/nhc per named instance. I
think the overwhelming majority of users will want checks and helper
scripts to be universal and only have the configuration file(s) differ
(at most). If anyone wants it the old way, it can still be overridden
via sysconfig or command line.
----------------------------------------------------------------------
Wed Mar 19 15:46:12 2014 Michael Jennings (mej)
Add new script file ww_cmd.nhc for checks based on arbitrary subcommands.
----------------------------------------------------------------------
Thu Mar 20 17:14:02 2014 Michael Jennings (mej)
Initial implementations of command-based checks. These may get further refinement.
----------------------------------------------------------------------
Fri Mar 21 12:30:45 2014 Michael Jennings (mej)
Patch from Eliot Eshelman <eliot.eshelman@6by9.net> (slightly
modified) for SLURM support in check_ps_unauth_users().
----------------------------------------------------------------------
Fri Mar 21 13:42:39 2014 Michael Jennings (mej)
Command output matching is (preliminarily) working now.
Fixed some missing check name labels in error messages.
----------------------------------------------------------------------
Fri Mar 21 14:46:45 2014 Michael Jennings (mej)
Redo the command output check a better way.
----------------------------------------------------------------------
Mon Mar 24 12:22:14 2014 Michael Jennings (mej)
Additional unit tests for command checks.
----------------------------------------------------------------------
Mon Mar 24 12:22:17 2014 Michael Jennings (mej)
Add stubs for Moab/TORQUE checks.
----------------------------------------------------------------------
Wed Mar 26 15:11:07 2014 Michael Jennings (mej)
Added a flag to check_ps_service to just start (instead of restart)
the service. Useful for cases like sshd where "restart" when the
daemon isn't running will kill user login sessions!
----------------------------------------------------------------------
Wed Mar 26 15:11:10 2014 Michael Jennings (mej)
Fix typos in check_ps_service unit tests.
----------------------------------------------------------------------
Wed Mar 26 15:11:12 2014 Michael Jennings (mej)
Add TORQUE/Moab-specific checks. These are still preliminary, and
unit tests for them are still pending, but they're already at least
somewhat useful.
check_moab_sched -t <timeout> -a <alert> -v <version> -m <match>
Checks the output of "mdiag -S -v" against the specified version,
alert, and/or arbitrary match expression(s). If a matching alert is
found, if the versions don't match, or if any of the match expressions
(possibly negated) trigger, the check fails. All parameters are
optional. Multiple occurrences of -m are supported.
check_moab_rm -t <timeout> -m <match>
Checks the output of "mdiag -R -v" against any specified match
expression(s). It also looks for any RMs that are not in the "Active"
state. If any RM is inactive, or if any of the match expressions
(possibly negated) trigger, the check fails. All parameters are
optional. Multiple occurrences of -m are supported.
check_moab_torque -t <timeout> -m <match>
Checks the output of "mdiag -R -v" against any specified match
expression(s). It also looks for the "scheduling" parameter to be
turned on. If "scheduling" is false, or if any of the match
expressions (possibly negated) trigger, the check fails. All
parameters are optional. Multiple occurrences of -m are supported.
----------------------------------------------------------------------
Thu Mar 27 16:10:28 2014 Michael Jennings (mej)
Skip a few tests on RHEL4 due to apparent bash bug.
----------------------------------------------------------------------
Thu Mar 27 17:01:46 2014 Michael Jennings (mej)
Explicitly test $DEBUG for ==1, not !=0
----------------------------------------------------------------------
Sun Mar 30 16:19:36 2014 Michael Jennings (mej)
Allow nhc.cron to pass options to nhc.
Fix missing flag in comment.
----------------------------------------------------------------------
Sun Mar 30 18:03:49 2014 Michael Jennings (mej)
Rewrite check_file_test to match check_file_stat calling conventions.
----------------------------------------------------------------------
Wed Apr 09 09:39:03 2014 Michael Jennings (mej)
check_loadavg() -> check_ps_loadavg()
----------------------------------------------------------------------
Fri Apr 18 14:59:17 2014 macabral