Anytime after the installation of the CSM services, the health of the management nodes and all CSM services can be validated.
The following are examples of when to run health checks:
- After completing the Install CSM Services step of the CSM install (not before)
- Before and after NCN reboots
- After the system is brought back up
- Any time there is unexpected behavior observed
- In order to provide relevant information to create support tickets
The areas should be tested in the order they are listed on this page. Errors in an earlier check may cause errors in later checks because of dependencies.
Each section of this health check document provides links to relevant troubleshooting procedures. If additional help is needed, see CSM Troubleshooting Information.
- 0. Cray command line interface
- 1. Platform health checks
- 2. Hardware Management Services health checks
- 3. Software Management Services health checks
- 4. Gateway health and SSH access checks
- 5. Booting CSM
barebones
image - 6. UAS/UAI tests
Some of the health check tests will fail if the Cray Command Line Interface (CLI) is not configured on the management NCNs. Tests with this dependency are noted in their descriptions below. These tests may be skipped but this is not recommended.
If running these checks during an initial CSM install, then to find details on configuring the Cray CLI, see Configure the Cray command line interface from the install documentation.
If running these checks after the initial CSM install, then to find details on configuring the Cray CLI, see Configure the Cray CLI from the operational documentation.
All platform health checks are expected to pass. Each check has been implemented as a Goss test which reports a PASS
or FAIL
.
Available platform health checks:
- NCN health checks
- OPTIONAL Check of
ncnHealthChecks
resources - Check of system management monitoring tools
These checks require that the Cray CLI is configured on all worker NCNs.
If ncn-m001
is the PIT node, then run these checks on ncn-m001
; otherwise run them from any master NCN.
-
(
ncn-m#
orpit#
) Run the automated tests.-
Specify the
admin
user password for the management switches in the system.This is required for some of the tests to execute.
read -s
is used to prevent the password from being written to the screen or the shell history.read -r -s -p "Switch admin password: " SW_ADMIN_PASSWORD
export SW_ADMIN_PASSWORD
-
Run the NCN and Kubernetes health checks.
/opt/cray/tests/install/ncn/automated/ncn-k8s-combined-healthcheck
-
-
Review results.
Review the output and follow the instructions provided to resolve any test failures. With the exception of Known issues with NCN health checks, all health checks are expected to pass.
To dump the NCN uptimes, the node resource consumptions, and/or the list of pods not in a running state, run the following:
/opt/cray/platform-utils/ncnHealthChecks.sh -s ncn_uptimes
/opt/cray/platform-utils/ncnHealthChecks.sh -s node_resource_consumption
/opt/cray/platform-utils/ncnHealthChecks.sh -s pods_not_running
See Known issues with NCN resource checks.
If all designated prerequisites are met, the availability of system management health services may optionally be validated by accessing the URLs listed in
Access System Management Health Services.
It is very important to check the Prerequisites
section of this document.
If one or more of the URLs listed in the procedure are inaccessible, it does not necessarily mean that system is not healthy. It may simply mean that not all of the prerequisites have been met to allow access to the system management health tools via URL.
Information to assist with troubleshooting some of the components mentioned in the prerequisites can be accessed here:
- Troubleshoot CMN Issues
- Troubleshoot DNS Configuration Issues
- Check BGP Status and Reset Sessions
- Troubleshoot BGP not Accepting Routes from MetalLB
- Troubleshoot Services without an Allocated IP Address
- Troubleshoot Prometheus Alerts
The checks in this section do not require that the Cray CLI is configured, but in the case of failures, some of the tests will provide troubleshooting suggestions that involve using the CLI.
Execute the HMS tests to confirm that the Hardware Management Services are running and operational.
Note: Do not run multiple instances of the HMS tests concurrently as they may interfere with one another and cause false failures.
These tests may be executed on any one worker or master NCN (but not ncn-m001
if it is still the PIT node).
(ncn-mw#
) Run the HMS CT tests.
/opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh
The return code of the script is zero if all HMS CT tests run and pass, non-zero if not. On CT test errors or failures, the script will print the path to the CT test log file for the administrator to inspect. If one or more failures occur, investigate the cause of each and take remediation steps if needed. See the Interpreting HMS Health Check Results documentation for more information.
After remediating a test failure for a particular service, just the tests for that individual service
can be re-run by supplying the name of the service to the run_hms_ct_tests.sh
script with the -t
option:
/opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh -t <service>
To list the HMS services that can be tested, use the -l
option:
/opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh -l
By the time the CSM health validation is first performed on a system, the Hardware State Manager (HSM) should have completed its discovery of the system. This section provides steps to verify that discovery has completed successfully and consists of two steps.
-
Verify that all hardware attempted to be discovered by HSM was successfully discovered.
(
ncn-mw#
) To verify that discovery completed successfully and that Redfish endpoints for the system hardware have been populated in HSM, run the following script:/opt/cray/csm/scripts/hms_verification/hsm_discovery_status_test.sh
The script will return an exit code of zero if there are no failures. Otherwise, the script will return a non-zero exit code along with output indicating which components failed discovery and troubleshooting steps for determining why discovery failed.
-
Verify that all hardware that is expected to be in the system is present in HSM.
To verify this, a comparison is made between HSM and the System Layout Service (SLS), which provides the foundational information for the hardware that makes up the system.
(
ncn-mw#
) To perform this comparison, run the following script:/opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py
The script will have an exit code of 0 if there are no failures. If there is any FAIL information displayed, the script will exit with a non-zero exit code.
Example of successful output:
HSM Cabinet Summary =================== x1000 (Mountain) Discovered Nodes: 16 Discovered Node BMCs: 5 Discovered Router BMCs: 16 Discovered Chassis BMCs: 8 Compute Module slots Populated: 5 Empty: 59 Router Module slots Populated: 16 Empty: 48 x3000 (River) Discovered Nodes: 12 (10 Mgmt, 2 Application, 0 Compute) Discovered Node BMCs: 11 Discovered Router BMCs: 2 Discovered Chassis BMCs: 0 Discovered Cab PDU Ctlrs: 2 Discovered CMCs: 0 River Cabinet Checks ============================ x3000 (River) Nodes: PASS NodeBMCs: PASS RouterBMCs: PASS CMCs: PASS CabinetPDUControllers: PASS Mountain/Hill Cabinet Checks ============================ x1000 (Mountain) ChassisBMCs: PASS Nodes: PASS NodeBMCs: PASS RouterBMCs: PASS EX2500 Cabinet Checks ============================ None Found.
Refer to 2.2.1 Interpreting results and 2.2.2 Known Issues in order to troubleshoot any errors or warnings.
The Cabinet Checks output is divided into four sections:
- Summary information for each cabinet.
- Detail information for River cabinets.
- Detail information for Mountain/Hill cabinets.
- Detail information for EX2500 cabinets.
In the River section, any hardware found in SLS and not discovered by HSM is considered a failure.
In the Mountain/Hill section, the only thing considered a failures are Chassis BMCs that are not discovered in HSM, and undiscovered BMCs from populated slots.
In the EX2500 section, performs checks for both air-cooled and liquid-cooled hardware based on the chassis. For the liquid-cooled chassis the only thing considered a failures are Chassis BMCs that are not discovered in HSM, and undiscovered BMCs from populated slots. In the air-cooled chassis (if present) any hardware found in SLS and not discovered by HSM is considered a failure.
Any failures need to be investigated by the admin for rectification. Any warnings should also be examined by the administrator to ensure they are accurate and expected.
For each of the BMCs that show up as not being present in HSM components or Redfish Endpoints use the following notes to determine whether the issue with the BMC can be safely ignored or needs to be addressed before proceeding.
-
The node BMCs for HPE Apollo XL645D nodes may report as a mismatch depending on the state of the system when the
verify_hsm_discovery.py
script is run. If the system is currently going through the process of installation, then this is an expected mismatch as the Prepare Compute Nodes procedure required to configure the BMC of the HPE Apollo 6500 XL645D node may not have been completed yet.For more information refer to Configure HPE Apollo 6500 XL645D Gen10 Plus Compute Nodes for additional required configuration for this type of BMC.
Example mismatch for the BMC of an HPE Apollo XL654D:
Nodes: FAIL - x3000c0s30b1n0 (Compute, NID 5) - Not found in HSM Components. NodeBMCs: FAIL - x3000c0s19b1 - Not found in HSM Components; Not found in HSM Redfish Endpoints.
-
Chassis Management Controllers (CMC) may show up as not being present in HSM. Gigabyte node blade CMCs not found in HSM is not normal and should be investigated. If a Gigabyte CMC is expected to not be connected to the HMN network, then it can be ignored. Otherwise, verify that the root service account is configured for the CMC and add it if needed by following the steps outlined in Add Root Service Account for Gigabyte Controllers.
CMCs have component names (xnames) in the form of
xXc0sSb999
, whereX
is the cabinet andS
is the rack U of the compute node chassis.Example mismatch for a CMC an Intel node blade:
ChassisBMCs/CMCs: FAIL - x3000c0s10b999 - Not found in HSM Components; Not found in HSM Redfish Endpoints; No mgmt port connection.
-
Cabinet PDU Controllers have component names (xnames) in the form of
xXmM
, whereX
is the cabinet andM
is the ordinal of the Cabinet PDU Controller.Example mismatch for a PDU:
CabinetPDUControllers: WARNING - x3000m0 - Not found in HSM Components ; Not found in HSM Redfish Endpoints
(
ncn#
) If the PDU is accessible over the network, the following can be used to determine the vendor of the PDU.PDU=x3000m0 curl -k -s --compressed https://$PDU -i | grep Server:
-
Example ServerTech output:
Server: ServerTech-AWS/v8.0v
-
Example HPE output:
Server: HPE/1.4.0
-
ServerTech PDUs may need passwords changed from their defaults to become functional. See Change Credentials on ServerTech PDUs.
-
HPE PDUs are supported and should show up as being found in HSM. If they are not, they should be investigated since that may indicate that configuration steps have not yet been executed which are required for the PDUs to be discovered. Refer to HPE PDU Admin Procedures for additional configuration for this type of PDU. The steps to run will depend on if the PDU has been set up yet, and whether or not an upgrade or fresh install of CSM is being performed.
-
-
River BMCs having no association with a management switch port will be annotated as such, and should be investigated.
-
In Hill configurations SLS assumes BMCs in chassis 1 and 3 are fully populated (32 Node BMCs), and in Mountain configurations SLS assumes all BMCs are fully populated (128 Node BMCs). For EX2500 cabinets will have either 1, 2, or 3 fully populated chassis depending on how the cabinet is configured. BMCs from non-populated chassis slots will not show up in the mismatch list. Any BMCs missing in populated chassis slots with no HSM data and will show up in the mismatch list.
If it was determined that the mismatch can not be ignored, then proceed onto the 2.2.2 Known Issues below to troubleshoot any mismatched BMCs.
Known issues that may prevent hardware from getting discovered by Hardware State Manager:
- Switches with river cabinets require SNMP to be enabled for discovery to work. For configuring SNMP, see Configure SNMP
- HMS Discovery job not creating Redfish Endpoints in Hardware State Manager
Optionally, these checks may be executed to detect problems with hardware in the system. Hardware check failures are not blockers for system installations and upgrades, and it is typically safe to postpone the investigation and resolution of any such failures until after the CSM installation or upgrade has completed.
These checks may be executed on any one worker or master NCN (but not ncn-m001
if it is still the PIT node).
(ncn-mw#
) Run the hardware checks.
/opt/cray/csm/scripts/hms_verification/run_hardware_checks.sh
The return code of the script is zero if all hardware checks run and pass, non-zero if not. On errors or failures, the script will print the path to the hardware checks log file for the administrator to inspect. See the Flags Set For Nodes In HSM documentation for more information about common types of hardware check failures.
This test requires that the Cray CLI is configured on nodes where the test is executed. See Cray command line interface.
(ncn-mw#
) To validate all SMS services, run the following:
/usr/local/bin/cmsdev test -q all
Successful output ends with a line similar to the following:
SUCCESS: All 6 service tests passed: bos, cfs, conman, ims, tftp, vcs
For more details, including known issues and other command line options, see Software Management Services health checks.
The gateway tests check the health of the API Gateway on all of the relevant networks. The gateway tests check that the gateway is accessible on all networks where it should be accessible, and NOT accessible on all networks where it should NOT be accessible. They also check several service endpoints to verify that they return the proper response on each accessible network.
The test will complete with an overall test status based on the result of the individual health checks on all of the networks.
Overall Gateway Test Status: PASS
For more detailed information on the tests results and examples, see Gateway Testing.
The gateway tests can be run from various locations. For this part of the CSM validation, check gateway access from the NCNs and from outside the system. Externally, the API gateway is accessible on the CMN and either the CAN or CHN, depending on the configuration of the system. On NCNs, the API gateway is accessible on the same networks (CMN and CAN/CHN) and it is also accessible on the NMNLB network.
The gateway tests may be run on any NCN with the docs-csm
RPM installed. For details on installing the docs-csm
RPM,
see Check for Latest Documentation.
To execute the tests, see Running Gateway Tests on an NCN Management Node.
To execute the tests, see Running Gateway Tests on a Device Outside the System.
The internal SSH access tests may be run on any NCN with the docs-csm
RPM installed. For details on installing the docs-csm
RPM,
see Check for Latest Documentation.
(ncn#
) Execute the tests by running the following command:
/usr/share/doc/csm/scripts/operations/pyscripts/start.py test_bican_internal
By default, SSH access will be tested on all relevant networks between master nodes and spine switches. It is possible to customize which nodes and networks will be tested. For example, it is possible to include UANs, to exclude master nodes, or to exclude the HMN. See the test usage statement for details.
(ncn#
) The test usage statement is displayed by calling the test with the --help
argument:
/usr/share/doc/csm/scripts/operations/pyscripts/start.py test_bican_internal --help
The test will complete with an overall pass/failure status such as the following:
Overall status: PASSED (Passed: 40, Failed: 0)
The external SSH access tests may be run on any system external to the cluster. The tests should not be run from another system running the Cray System Management software if that system was configured with the same internal network ranges as the system being tested as this will cause some tests to fail.
-
(
external#
) Python version 3 must be installed (if it is not already). -
(
external#
) Obtain the test code.There are two options for doing this:
-
Install the
docs-csm
RPM. -
Copy over the following folder from a system where the
docs-csm
RPM is installed:/usr/share/doc/csm/scripts/operations/pyscripts
-
-
(
external#
) Install the Python dependencies.Run the following command from the
pyscripts
directory in order to install the required Python dependencies:cd /usr/share/doc/csm/scripts/operations/pyscripts && pip install .
-
(
ncn#
orpit#
) Obtain theadmin
client secret.Because
kubectl
will not work outside of the cluster, obtain theadmin
client secret by running the following command on an NCN or the PIT node.kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d
Example output:
26947343-d4ab-403b-14e937dbd700
-
(
external#
) On the external system, execute the tests.cd /usr/share/doc/csm/scripts/operations/pyscripts && ./start.py test_bican_external
By default, SSH access will be tested on all relevant networks between master nodes and spine switches. It is possible to customize which nodes and networks will be tested. For example, it is possible to include compute nodes, to exclude spine switches, or to exclude the NMN. See the test usage statement for details.
The test usage statement is displayed by calling the test with the
--help
argument:cd /usr/share/doc/csm/scripts/operations/pyscripts && ./start.py test_bican_external --help
-
When prompted by the test, enter the system domain and the
admin
client secret.The test will complete with an overall pass/failure status such as the following:
Overall status: PASSED (Passed: 20, Failed: 0)
This test is very important to run, particularly during the CSM install prior to rebooting the PIT node, because it validates all of the services required for nodes to PXE boot from the cluster.
By default the test automatically chooses an enabled compute node and a barebones IMS image to use for the test. This default behavior can be overridden, however. For additional information and troubleshooting related to the barebones image or the test, see Troubleshoot the CMS Barebones Image Boot Test.
This test can be run on any master or worker NCN, but not the PIT node.
(ncn-mw#
) The script is executable and can be run without any arguments. It returns zero on success and
non-zero on failure.
/opt/cray/tests/integration/csm/barebonesImageTest
Successful output looks similar to the following:
cray.barebones-boot-test: INFO Barebones image boot test starting
cray.barebones-boot-test: INFO For complete logs look in the file /tmp/cray.barebones-boot-test.log
cray.barebones-boot-test: INFO Creating bos session with template:csm-barebones-image-test, on node:x3000c0s10b1n0
cray.barebones-boot-test: INFO Starting boot on compute node: x3000c0s10b1n0
cray.barebones-boot-test: INFO Found dracut message in console output - success!!!
cray.barebones-boot-test: INFO Successfully completed barebones image boot test.
The commands in this section require that the Cray CLI is configured on nodes where the commands are being executed.
The procedures below use the CLI as an authorized user and run on two separate node types. The first part runs on the LiveCD node, while the second part runs on a non-LiveCD Kubernetes master or worker node. In either case, the CLI configuration needs to be initialized on the node and the user running the procedure needs to be authorized.
The following procedures run on separate nodes of the system.
- Validate the basic UAS installation
- Validate UAI creation
- Test UAI gateway health
- UAS/UAI troubleshooting
This section can be run on any NCN or the PIT node.
-
(
ncn#
orpit#
) Show information aboutcray-uas-mgr
.cray uas mgr-info list --format toml
Expected output looks similar to the following:
service_name = "cray-uas-mgr" version = "1.11.5"
In this example output, it shows that UAS is installed and running the
1.11.5
version. If the error "Token not valid for UAS" occurs, see Authorization issues. -
(
ncn#
orpit#
) List UAIs on the system.cray uas list --format toml
Expected output looks similar to the following:
results = []
This example output shows that there are no currently running UAIs. It is possible, if someone else has been using the UAS, that there could be UAIs in the list. That is acceptable too from a validation standpoint.
-
(
ncn#
orpit#
) Verify that the pre-made UAI images are registered with UAScray uas images list --format toml
Expected output looks similar to the following:
default_image = "artifactory.algol60.net/csm-docker/stable/cray-uai-sles15sp3:1.6.0" image_list = [ "artifactory.algol60.net/csm-docker/stable/cray-uai-sles15sp3:1.6.0", "artifactory.algol60.net/csm-docker/stable/cray-uai-gateway-test:1.6.0", "artifactory.algol60.net/csm-docker/stable/cray-uai-broker:1.6.0",]
This example output shows that the pre-made end-user UAI images (
artifactory.algol60.net/csm-docker/stable/cray-uai-sles15sp3:1.6.0
,artifactory.algol60.net/csm-docker/stable/cray-uai-gateway-test:1.6.0
, andartifactory.algol60.net/csm-docker/stable/cray-uai-broker:1.6.0
) are registered with UAS. This does not necessarily mean these images are installed in the container image registry, but they are configured for use. If other UAI images have been created and registered, they may also show up here, which is acceptable.
IMPORTANT: If the site does not use UAIs, skip UAS and UAI validation. If UAIs are used, there are products that configure UAS like Cray Analytics and Cray Programming Environment that must be working correctly with UAIs, and should be validated (the procedures for this are beyond the scope of this document) prior to validating UAS and UAI. Failures in UAI creation that result from incorrect or incomplete installation of these products will generally take the form of UAIs stuck in waiting state trying to set up volume mounts. See the UAI Troubleshooting section for more information. IMPORTANT: If the site is configured to use the CHN, and the high speed network has not been installed and configured, this procedure can not be completed. The UAI that is created will be inaccessible until the high speed network is available.
This procedure must run on a master or worker node (not the PIT node).
-
(
ncn#
orpit#
) Verify that a UAI can be created:cray uas create --publickey ~/.ssh/id_rsa.pub --format toml
Expected output looks similar to the following:
uai_connect_string = "ssh vers@10.16.234.10" uai_host = "ncn-w001" uai_img = "registry.local/cray/cray-uai-sles15sp3:1.0.11" uai_ip = "10.16.234.10" uai_msg = "" uai_name = "uai-vers-a00fb46b" uai_status = "Pending" username = "vers" [uai_portmap]
This has created the UAI and the UAI is currently in the process of initializing and running. The
uai_status
in the output from this command may instead beWaiting
, which is also acceptable. -
(
ncn#
orpit#
) SetUAINAME
to the value of theuai_name
field in the previous command output (uai-vers-a00fb46b
in our example):UAINAME=uai-vers-a00fb46b
-
(
ncn#
orpit#
) Check the current status of the UAI:cray uas list --format toml
Expected output looks similar to the following:
[[results]] uai_age = "0m" uai_connect_string = "ssh vers@10.16.234.10" uai_host = "ncn-w001" uai_img = "registry.local/cray/cray-uai-sles15sp3:1.0.11" uai_ip = "10.16.234.10" uai_msg = "" uai_name = "uai-vers-a00fb46b" uai_status = "Running: Ready" username = "vers"
If the
uai_status
field isRunning: Ready
, proceed to the next step. Otherwise, wait and repeat this command until that is the case. It normally should not take more than a minute or two. -
(
ncn#
orpit#
) The UAI is ready for use. Log into it with the command in theuai_connect_string
field in the previous command output:ssh vers@10.16.234.10 vers@uai-vers-a00fb46b-6889b666db-4dfvn:~>
-
(
uai#
) Run a command on the UAI:vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> ps -afe
Expected output looks similar to the following:
UID PID PPID C STIME TTY TIME CMD root 1 0 0 18:51 ? 00:00:00 /bin/bash /usr/bin/uai-ssh.sh munge 36 1 0 18:51 ? 00:00:00 /usr/sbin/munged root 54 1 0 18:51 ? 00:00:00 su vers -c /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D vers 55 54 0 18:51 ? 00:00:00 /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D vers 62 55 0 18:51 ? 00:00:00 sshd: vers [priv] vers 67 62 0 18:51 ? 00:00:00 sshd: vers@pts/0 vers 68 67 0 18:51 pts/0 00:00:00 -bash vers 120 68 0 18:52 pts/0 00:00:00 ps -afe
-
(
uai#
) Log out from the UAIvers@uai-vers-a00fb46b-6889b666db-4dfvn:~> exit ncn#
-
(
ncn#
orpit#
) Clean up the UAI.cray uas delete --uai-list $UAINAME --format toml
Expected output looks similar to the following:
results = [ "Successfully deleted uai-vers-a00fb46b",]
If the commands ran with similar results, then the basic functionality of the UAS and UAI is working.
Like the NCN gateway health check, the gateway tests check the health of the API Gateway on all of the relevant networks. On UAIs, the API gateway should only be accessible on the user network (either CAN or CHN depending on the configuration of the system). The gateway tests check that the gateway is accessible on all networks where it should be accessible, and NOT accessible on all networks where it should NOT be accessible. They also check several service endpoints to verify that they return the proper response on each accessible network.
The UAI gateway tests may be run on any NCN with the docs-csm
RPM installed. For details on installing the docs-csm
RPM, see Check for Latest Documentation.
(ncn#
) The UAI gateway tests are executed by running the following command.
/usr/share/doc/csm/scripts/operations/gateway-test/uai-gateway-test.sh
The test will launch a UAI with the gateway-test image
, execute the gateway tests, and then delete the UAI that was launched.
The test will complete with an overall test status based on the result of the individual health checks on all of the networks.
Overall Gateway Test Status: PASS
For more detailed information on the tests results and examples, see Gateway Testing.
The following subsections include common failure modes seen with UAS / UAI operations and how to resolve them.
An error will be returned when running CLI commands if the user is not logged in as a valid Keycloak user or is accidentally using the CRAY_CREDENTIALS
environment variable. This variable is set regardless of the user credentials being used.
For example:
cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Bad Request: Token not valid for UAS. Attributes missing: ['gidNumber', 'loginShell', 'homeDirectory', 'uidNumber', 'name']
Fix this by logging in as a Keycloak user with the above attributes defined using cray auth login
, and make sure that CRAY_CREDENTIALS
is unset.
When running CLI commands, a Keycloak error may be returned.
For example:
cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Internal Server Error: An error was encountered while accessing Keycloak
If the wrong hostname was used to reach the API gateway, re-run the CLI initialization steps above and try again to check that. There may also be a problem with the Istio service mesh inside of the system.
Troubleshooting this is beyond the scope of this section, but there may be useful information in the UAS pod logs in Kubernetes. There are generally two UAS pods, so the user may need to look at logs from
both to find the specific failure. The logs tend to have a very large number of GET
events listed as part of the liveness checking.
The following shows an example of looking at UAS logs effectively (this example shows only one UAS manager, normally there would be two):
-
(
ncn-mw#
orpit#
) Determine the pod name of theuas-mgr
podkubectl get po -n services | grep "^cray-uas-mgr" | grep -v etcd
Expected output looks similar to:
cray-uas-mgr-6bbd584ccb-zg8vx 2/2 Running 0 12d
-
(
ncn-mw#
orpit#
) SetPODNAME
to the name of the manager pod whose logs are going to be viewed.export PODNAME=cray-uas-mgr-6bbd584ccb-zg8vx
-
(
ncn-mw#
orpit#
) View the last 25 log entries of thecray-uas-mgr
container in that pod, excludingGET
events:kubectl logs -n services $PODNAME cray-uas-mgr | grep -v 'GET ' | tail -25
Example output:
2021-02-08 15:32:41,211 - uas_mgr - INFO - getting deployment uai-vers-87a0ff6e in namespace user 2021-02-08 15:32:41,225 - uas_mgr - INFO - creating deployment uai-vers-87a0ff6e in namespace user 2021-02-08 15:32:41,241 - uas_mgr - INFO - creating the UAI service uai-vers-87a0ff6e-ssh 2021-02-08 15:32:41,241 - uas_mgr - INFO - getting service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:32:41,252 - uas_mgr - INFO - creating service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:32:41,267 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:32:41,360 - uas_mgr - INFO - No start time provided from pod 2021-02-08 15:32:41,361 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 127.0.0.1 - - [08/Feb/2021 15:32:41] "POST /v1/uas?imagename=registry.local%2Fcray%2Fno-image-registered%3A1.0.11 HTTP/1.1" 200 - 2021-02-08 15:32:54,455 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:32:54,455 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:32:54,455 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:32:54,484 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:32:54,596 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:25,053 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:40:25,054 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:40:25,054 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:40:25,085 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:40:25,212 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:51,210 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:40:51,210 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:40:51,210 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:40:51,261 - uas_mgr - INFO - deleting service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:51,291 - uas_mgr - INFO - delete deployment uai-vers-87a0ff6e in namespace user 127.0.0.1 - - [08/Feb/2021 15:40:51] "DELETE /v1/uas?uai_list=uai-vers-87a0ff6e HTTP/1.1" 200 -
When listing or describing a UAI, an error in the uai_msg
field may be returned. For example:
cray uas list --format toml
There may be something similar to the following output:
[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.103.13.172"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp3:1.0.11"
uai_ip = "10.103.13.172"
uai_msg = "ErrImagePull"
uai_name = "uai-vers-87a0ff6e"
uai_status = "Waiting"
username = "vers"
This means the pre-made end-user UAI image is not in the local registry (or whatever registry it is being pulled from; see the uai_img
value for details). To correct
this, locate and push/import the image to the registry.
Various packages install volumes in the UAS configuration. All of those volumes must also have the underlying resources available, sometimes on the host node where the UAI is running sometimes from with
Kubernetes. If a UAI gets stuck with a ContainerCreating
uai_msg
field for an extended time, this is a likely cause. UAIs run in the user
Kubernetes namespace, and are pods that can be examined
using kubectl describe
.
-
(
ncn-mw#
orpit#
) Locate the pod.kubectl get po -n user | grep <uai-name>
-
(
ncn-mw#
orpit#
) Investigate the problem using the pod name from the previous step.kubectl describe pod -n user <pod-name>
If volumes are missing they will show up in the
Events:
section of the output. Other problems may show up there as well. The names of the missing volumes or other issues should indicate what needs to be fixed to make the UAI run.