Anytime after the installation of the CSM services, the health of the management nodes and all CSM services can be validated.
The following are examples of when to run health checks:
- After CSM install.sh completes
- Before and after NCN reboots
- After the system is brought back up
- Any time there is unexpected behavior observed
- In order to provide relevant information to create support tickets
The areas should be tested in the order they are listed on this page. Errors in an earlier check may cause errors in later checks because of dependencies.
- Validate CSM Health
- Topics:
- 1. Platform Health Checks
- 1.1 ncnHealthChecks
- 1.2 ncnPostgresHealthChecks
- 1.3 BGP Peering Status and Reset
- 1.4 Verify that KEA has active DHCP leases
- 1.5 Verify ability to resolve external DNS
- 1.6 Verify Spire Agent is Running on Kubernetes NCNs
- 1.7 Verify the Vault Cluster is Healthy
- 1.8 Automated Goss Testing
- 1.9 OPTIONAL Check of System Management Monitoring Tools
- 2. Hardware Management Services Health Checks
- 3 Software Management Services Health Checks
- 4. Booting CSM Barebones Image
- 5. UAS / UAI Tests
Scripts do not verify results. Script output includes analysis needed to determine pass/fail for each check. All health checks are expected to pass.
Health Check scripts can be run:
- After CSM install.sh has been run (not before)
- Before and after one of the NCNs reboots
- After the system or a single node goes down unexpectedly
- After the system is gracefully shut down and brought up
- Any time there is unexpected behavior on the system to get a baseline of data for CSM services and components
- In order to provide relevant information to support tickets that are being opened after CSM install.sh has been run
Available Platform Health Checks:
- ncnHealthChecks
- ncnPostgresHealthChecks
- BGP Peering Status and Reset
- KEA / DHCP
- External DNS
- Spire Agent
- Vault Cluster
- Automated Goss Testing
Health Check scripts can be found and run on any worker or master node (not on PIT node), from any directory.
ncn# /opt/cray/platform-utils/ncnHealthChecks.sh
The ncnHealthChecks script reports the following health information:
- Kubernetes status for master and worker NCNs
- Ceph health status
- Health of etcd clusters
- Number of pods on each worker node for each etcd cluster
- Alarms set for any of the Etcd clusters
- Health of Etcd cluster's database
- List of automated etcd backups for the Boot Orchestration Service (BOS), Boot Script Service (BSS), Compute Rolling Upgrade Service (CRUS), and Domain Name Service (DNS), and Firmware Action Service (FAS) clusters
- NCN node uptimes
- NCN master and worker node resource consumption
- NCN node xnames and metal.no-wipe status
- NCN worker node pod counts
- Pods yet to reach the running state
Execute the ncnHealthChecks script and analyze the output of each individual check.
IMPORTANT: When the PIT node is booted, the NCN node metal.no-wipe status is not available and is correctly reported as 'unavailable'. Once ncn-m001 has been booted, the NCN metal.no-wipe status is expected to be reported as metal.no-wipe=1.
IMPORTANT: Only when ncn-m001 has been booted, if the output of the ncnHealthChecks.sh script shows that there are nodes that do not have the metal.no-wipe=1 status, then do the following:
ncn# csi handoff bss-update-param --set metal.no-wipe=1 --limit <SERVER_XNAME>
IMPORTANT: If the output of pod statuses indicates that there are pods in the Evicted
state, it may be due to the /root file system being filled up on the Kubernetes node in question. Kubernetes will begin evicting pods once the root file system space is at 85% until it is back under 80%. This may commonly happen on ncn-m001 as it is a location that install and doc files may be downloaded to. It may be necessary to clean up space in the /root directory if this is the root cause of pod evictions. The following commands can be used to determine if analysis of files under /root is needed to free-up space.
ncn# df -h /root
Filesystem Size Used Avail Use% Mounted on
LiveOS_rootfs 280G 245G 35G 88% /
ncn# du -h -s /root/
225G /root/
ncn# du -ah -B 1024M /root | sort -n -r | head -n 10
Note: The cray-crus-
pod is expected to be in the Init state until slurm and munge
are installed. In particular, this will be the case if executing this as part of the validation after completing the Install CSM Services.
If in doubt, validate the CRUS service using the CMS Validation Tool. If the CRUS check passes using that tool, do not worry about the cray-crus-
pod state.
Additionally, hmn-discovery and unbound manager cronjob pods may be in a 'NotReady' state. This is expected as these pods are periodically started and transition to the completed state.
Postgres Health Check scripts can be found and run on any worker or master node (not on PIT node), from any directory. The ncnPostgresHealthChecks script reports the following postgres health information:
- The status of each postgresql resource
- The number of cluster members
- The node which is the Leader
- The state of the each cluster member
- Replication Lag for any cluster member
- Kubernetes postgres pod status
Execute ncnPostgresHealthChecks script and analyze the output of each individual check.
ncn# /opt/cray/platform-utils/ncnPostgresHealthChecks.sh
-
Check the STATUS of the postgresql resources which are managed by the operator:
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS services cray-sls-postgres cray-sls 11 3 1Gi 12d Running
If any postgresql resources remains in a STATUS other than Running (such as SyncFailed), refer to Troubleshoot Postgres Database.
-
For a particular Postgres cluster, the expected output is similar to the following:
--- patronictl, version 1.6.5, list for services leader pod cray-sls-postgres-0 --- + Cluster: cray-sls-postgres (6938772644984361037) ---+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------+------------+--------+---------+----+-----------+ | cray-sls-postgres-0 | 10.47.0.35 | Leader | running | 1 | | | cray-sls-postgres-1 | 10.36.0.33 | | running | 1 | 0 | | cray-sls-postgres-2 | 10.44.0.42 | | running | 1 | 0 | +---------------------+------------+--------+---------+----+-----------+
The points below will cover the data in the table above for Member, Role, State, and Lag in MB columns.
For each Postgres cluster:
-
Verify there are three cluster members (with the exception of sma-postgres-cluster where there should be only two cluster members). If the number of cluster members is not correct, refer to Troubleshoot Postgres Database.
-
Verify there is one cluster member with the Leader Role and log output indicates expected status. Such as:
i am the leader with the lock
For example:
--- Logs for services Leader Pod cray-sls-postgres-0 --- ERROR: get_cluster INFO: establishing a new patroni connection to the postgres cluster INFO: initialized a new cluster INFO: Lock owner: cray-sls-postgres-0; I am cray-sls-postgres-0 INFO: Lock owner: None; I am cray-sls-postgres-0 INFO: no action. i am the leader with the lock INFO: No PostgreSQL configuration items changed, nothing to reload. INFO: postmaster pid=87 INFO: running post_bootstrap INFO: trying to bootstrap a new cluster
Errors reported prior to the lock status, such as ERROR: get_cluster or ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)) can be ignored. If there is no Leader, refer to Troubleshoot Postgres Database.
-
Verify the State of each cluster member is 'running'. If any cluster members are found to be in a non 'running' state (such as 'start failed'), refer to Troubleshoot Postgres Database.
-
Verify there is no large or growing lag. If any cluster members are found to have lag or lag is 'unknown', refer to Troubleshoot Postgres Database.
-
-
Check that all Kubernetes Postgres pods have a STATUS of Running.
ncn# kubectl get pods -A -o wide -l application=spilo NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES services cray-sls-postgres-0 3/3 Running 3 6d 10.38.0.102 ncn-w002 <none> <none> services cray-sls-postgres-1 3/3 Running 3 5d20h 10.42.0.89 ncn-w001 <none> <none> services cray-sls-postgres-2 3/3 Running 0 5d20h 10.36.0.31 ncn-w003 <none> <none>
If any Postgres pods have a STATUS other then Running, gather more information from the pod and refer to Troubleshoot Postgres Database.
ncn# kubectl describe pod <pod name> -n <pod namespace> ncn# kubectl logs <pod name> -n <pod namespace> -c <pod container name>
Verify that Border Gateway Protocol (BGP) peering sessions are established for each worker node on the system.
Check the Border Gateway Protocol (BGP) status on the Aruba or Mellanox switches. Verify that all sessions are in an Established state. If the state of any session in the table is Idle, reset the BGP sessions.
On an NCN, determine the IP addresses of switches:
ncn-m001# kubectl get cm config -n metallb-system -o yaml | head -12
Expected output looks similar to the following:
apiVersion: v1
data:
config: |
peers:
- peer-address: 10.252.0.2
peer-asn: 65533
my-asn: 65533
- peer-address: 10.252.0.3
peer-asn: 65533
my-asn: 65533
address-pools:
- name: customer-access
Using the first peer-address (10.252.0.2 here), log in using ssh
as the administrator to the first switch and note in the returned output if a Mellanox or Aruba switch is indicated.
ncn-m001# ssh admin@10.252.0.2
- On a Mellanox switch,
Mellanox Onyx Switch Management
orMellanox Switch
may be displayed after logging in to the switch withssh
. In this case, proceed to the Mellanox steps. - On an Aruba switch,
Please register your products now at: https://asp.arubanetworks.com
may be displayed after logging in to the switch withssh
. In this case, proceed to the Aruba steps.
-
Enable:
sw-spine-001# enable
-
Verify BGP is enabled:
sw-spine-001# show protocols | include bgp
Expected output looks similar to the following:
bgp: enabled
-
Check peering status:
sw-spine-001# show ip bgp summary
Expected output looks similar to the following:
VRF name : default BGP router identifier : 10.252.0.2 local AS number : 65533 BGP table version : 3 Main routing table version: 3 IPV4 Prefixes : 59 IPV6 Prefixes : 0 L2VPN EVPN Prefixes : 0 ------------------------------------------------------------------------------------------------------------------ Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd ------------------------------------------------------------------------------------------------------------------ 10.252.1.10 4 65533 2945 3365 3 0 0 1:00:21:33 ESTABLISHED/20 10.252.1.11 4 65533 2942 3356 3 0 0 1:00:20:49 ESTABLISHED/19 10.252.1.12 4 65533 2945 3363 3 0 0 1:00:21:33 ESTABLISHED/20
-
If one or more BGP session is reported in an Idle state, reset BGP to re-establish the sessions:
sw-spine-001# clear ip bgp all
-
It may take several minutes for all sessions to become Established. Wait a minute or so, and then verify that all sessions now are all reported as Established. If some sessions remain in an Idle state, re-run the clear ip bgp all command and check again.
-
If after several tries one or more BGP session remains Idle, see Check BGP Status and Reset Sessions.
-
-
Repeat the above Mellanox procedure using the second peer-address (10.252.0.3 here).
On an Aruba switch, the prompt may include sw-spine
or sw-agg
.
-
Check BGP peering status.
sw-agg01# show bgp ipv4 unicast summary
Expected output looks similar to the following:
VRF : default BGP Summary ----------- Local AS : 65533 BGP Router Identifier : 10.252.0.4 Peers : 7 Log Neighbor Changes : No Cfg. Hold Time : 180 Cfg. Keep Alive : 60 Confederation Id : 0 Neighbor Remote-AS MsgRcvd MsgSent Up/Down Time State AdminStatus 10.252.0.5 65533 19579 19588 20h:40m:30s Established Up 10.252.1.7 65533 34137 39074 20h:41m:53s Established Up 10.252.1.8 65533 34134 39036 20h:36m:44s Established Up 10.252.1.9 65533 34104 39072 00m:01w:04d Established Up 10.252.1.10 65533 34105 39029 00m:01w:04d Established Up 10.252.1.11 65533 34099 39042 00m:01w:04d Established Up 10.252.1.12 65533 34101 39012 00m:01w:04d Established Up
-
If one or more BGP session is reported in a Idle state, reset BGP to re-establish the sessions:
sw-agg01# clear bgp *
-
It may take several minutes for all sessions to become Established. Wait a minute or so, and then verify that all sessions now are reported as Established. If some sessions remain in an Idle state, re-run the clear bgp * command and check again.
-
If after several tries one or more BGP session remains Idle, see Check BGP Status and Reset Sessions
-
-
Repeat the above Aruba procedure using the second peer-address (10.252.0.5 in this example).
Verify that KEA has active DHCP leases. Right after an fresh install of CSM, it is important to verify that KEA is currently handing out DHCP leases on the system. The following commands can be run on any of the master nodes or worker nodes.
Get an API Token:
ncn# export TOKEN=$(curl -s -S -d grant_type=client_credentials \
-d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth \
-o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')
Retrieve all the leases currently in KEA:
ncn# curl -H "Authorization: Bearer ${TOKEN}" -X POST -H "Content-Type: application/json" -d '{ "command": "lease4-get-all", "service": [ "dhcp4" ] }' https://api-gw-service-nmn.local/apis/dhcp-kea | jq
If there is an non-zero amount of DHCP leases for air-cooled hardware returned, that is a good indication that KEA is working.
If unbound is configured to resolve outside hostnames, then the following check should be performed. If unbond is not configured to resolve outside hostnames, then this check may be skipped.
Run the following on one of the master or worker nodes (not the PIT node):
ncn# nslookup cray.com ; echo "Exit code is $?"
Expected output looks similar to the following:
Server: 10.92.100.225
Address: 10.92.100.225#53
Non-authoritative answer:
Name: cray.com
Address: 52.36.131.229
Exit code is 0
Verify that the command has exit code 0, reports no errors, and resolves the address.
Execute the following command on all Kubernetes NCNs (i.e. all worker nodes and master nodes, excluding the PIT):
ncn# goss -g /opt/cray/tests/install/ncn/tests/goss-spire-agent-service-running.yaml validate
Known failures and how to recover:
-
K8S Test: Verify spire-agent is enabled and running
-
The
spire-agent
service may fail to start on Kubernetes NCNs, logging errors (via journalctl) similar to "join token does not exist or has already been used" or the last logs containing multiple lines of "systemd[1]: spire-agent.service: Start request repeated too quickly.". Deleting therequest-ncn-join-token
daemonset pod running on the node may clear the issue. Even though thespire-agent
systemctl service on the Kubernetes node should eventually restart cleanly, the user may have to log in to the impacted nodes and restart the service. The following recovery procedure can be run from any Kubernetes node in the cluster.- Set
NODE
to the NCN which is experiencing the issue. In this example,ncn-w002
.ncn# export NODE=ncn-w002
- Define the following function
ncn# function renewncnjoin() { for pod in $(kubectl get pods -n spire |grep request-ncn-join-token | awk '{print $1}'); do if kubectl describe -n spire pods $pod | grep -q "Node:.*$1"; then echo "Restarting $pod running on $1"; kubectl delete -n spire pod "$pod"; fi done }
- Run the function as follows:
ncn# renewncnjoin $NODE
- Set
-
The
spire-agent
service may also fail if an NCN was powered off for too long and its tokens expired. If this happens, delete/root/spire/agent_svid.der
,/root/spire/bundle.der
, and/root/spire/data/svid.key
off the NCN before deleting therequest-ncn-join-token
daemonset pod.
-
Execute the following commands on ncn-m002
:
ncn-m002# goss -g /opt/cray/tests/install/ncn/tests/goss-k8s-vault-cluster-health.yaml validate
Check the output to verify no failures are reported:
Count: 2, Failed: 0, Skipped: 0
There are multiple Goss test suites available that cover a variety of sub-systems.
Run the NCN health checks against the three different types of nodes with the following commands:
IMPORTANT: These tests may only be successful while booted into the PIT node. Do not run these as part of upgrade testing. This includes the Kubernetes check in the next block.
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-master
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-worker
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-storage
And the Kubernetes test suite via:
pit# /opt/cray/tests/install/ncn/automated/ncn-kubernetes-checks
- These tests can only reliably be executed from the PIT node. Should be addressed in a future release.
- K8S Test: Kubernetes Query BSS Cloud-init for ca-certs
- May fail immediately after platform install. Should pass after the TrustedCerts Operator has updated BSS (Global cloud-init meta) with CA certificates.
- K8S Test: Kubernetes Velero No Failed Backups
- Because of a known issue with Velero, a backup may be attempted immediately upon the deployment of a backup schedule (for example, vault). It may be necessary to use the
velero
command to delete backups from a Kubernetes node to clear this situation.
- Because of a known issue with Velero, a backup may be attempted immediately upon the deployment of a backup schedule (for example, vault). It may be necessary to use the
If all designated prerequisites are met, the availability of system management health services may optionally be validated by accessing the URLs listed in Access System Management Health Services.
It is very important to check the Prerequisites
section of this document.
If one or more of the the URLs listed in the procedure are inaccessible, it does not necessarily mean that system is not healthy. It may simply mean that not all of the prerequisites have been met to allow access to the system management health tools via URL.
Information to assist with troubleshooting some of the components mentioned in the prerequisites can be accessed here:
- Troubleshoot CAN Issues
- Troubleshoot DNS Configuration Issues
- Check BGP Status and Reset Sessions
- Troubleshoot BGP not Accepting Routes from MetalLB
- Troubleshoot Services without an Allocated IP Address
Execute the HMS smoke and functional tests after the CSM install to confirm that the Hardware Management Services are running and operational.
These tests should be executed as root on at least one worker NCN and one master NCN (but not ncn-m001 if it is still the PIT node).
Run the HMS CT smoke tests. This is done by running the run_hms_ct_tests.sh
script:
ncn# /opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh
The return value of the script is 0 if all CT tests ran successfully, non-zero if not.
To run the tests manually:
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_smoke_tests_ncn-resources.sh
Examine the output. If one or more failures occur, investigate the cause of each failure. See the interpreting_hms_health_check_results documentation for more information.
Otherwise, run the HMS functional tests.
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_functional_tests_ncn-resources.sh
Examine the output. If one or more failures occur, investigate the cause of each failure. See the interpreting_hms_health_check_results documentation for more information.
Systems with Aruba leaf switches sometimes have issues with a known SNMP bug which prevents HSM discovery from discovering all HW. At this stage of the installation process, a script can be run to detect if this issue is currently affecting the system, and if so, correct it.
Refer to Air cooled hardware is not getting properly discovered with Aruba leaf switches for details.
By this point in the installation process, the Hardware State Manager (HSM) should have done its discovery of the system.
The foundational information for this discovery is from the System Layout Service (SLS). Thus, a comparison needs to be done to see that what is specified in SLS (focusing on BMC components and Redfish endpoints) are present in HSM.
To perform this comparison execute the verify_hsm_discovery.py
script on a Kubernetes master or worker NCN. The result is pass/fail (returns 0 or non-zero):
ncn# /opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py
The output will ideally appear as follows, if there are mismatches these will be displayed in the appropriate section of the output. Refer to 2.3.1 Interpreting results and 2.3.2 Known Issues below to troubleshoot any mismatched BMCs.
ncn# /opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py
HSM Cabinet Summary
===================
x1000 (Mountain)
Discovered Nodes: 50
Discovered Node BMCs: 25
Discovered Router BMCs: 32
Discovered Chassis BMCs: 8
x3000 (River)
Discovered Nodes: 23 (12 Mgmt, 7 Application, 4 Compute)
Discovered Node BMCs: 24
Discovered Router BMCs: 2
Discovered Cab PDU Ctlrs: 0
River Cabinet Checks
====================
x3000
Nodes: PASS
NodeBMCs: PASS
RouterBMCs: PASS
ChassisBMCs: PASS
CabinetPDUControllers: PASS
Mountain/Hill Cabinet Checks
============================
x1000 (Mountain)
ChassisBMCs: PASS
Nodes: PASS
NodeBMCs: PASS
RouterBMCs: PASS
The script will have an exit code of 0 if there are no failures. If there is any FAIL information displayed, the script will exit with a non-zero exit code. Failure information interpretation is described in the next section.
The Cabinet Checks output is divided into three sections:
- Summary information for for each cabinet
- Detail information for for River cabinets
- Detail information for Mountain/Hill cabinets.
In the River section, any hardware found in SLS and not discovered by HSM is considered a failure, with the exception of PDU controllers, which is a warning. Also, the BMC of one of the management NCNs (typically 'ncn-m001') will not be connected to the HSM HW network and thus will show up as being not discovered and/or not having any mgmt network connection. This is treated as a warning.
In the Mountain section, the only thing considered a failure are Chassis BMCs that are not discovered in HSM. All other items (nodes, node BMCs and router BMCs) which are not discovered are considered warnings.
Any failures need to be investigated by the admin for rectification. Any warnings should also be examined by the admin to insure they are accurate and expected.
For each of the BMCs that show up as not being present in HSM components or Redfish Endpoints use the following notes to determine if the issue with the BMC can be safely ignored, or if there is a legitimate issue with the BMC.
-
The node BMC of 'ncn-m001' will not typically be present in HSM component data, as it is typically connected to the site network instead of the HMN network.
-
Chassis Management Controllers (CMC) may show up as not being present in HSM. CMCs for Intel server blades can be ignored. Gigabyte server blade CMCs not found in HSM is not normal and should be investigated. If a Gigabyte CMC is expected to not be connected to the HMN network, then it can be ignored.
CMCs have xnames in the form of
xXc0sSb999
, whereX
is the cabinet andS
is the rack U of the compute node chassis.Example mismatch for a CMC an Intel server blade:
...
ChassisBMCs/CMCs: FAIL
- x3000c0s10b999 - Not found in HSM Components; Not found in HSM Redfish Endpoints; No mgmt port connection.
...
-
HPE PDUs are not supported at this time and will likely show up as not being found in HSM. They can be ignored.
Cabinet PDU Controllers have xnames in the form of
xXmM
, whereX
is the cabinet andM
is the ordinal of the Cabinet PDU Controller.Example mistmatch for HPE PDU:
...
CabinetPDUControllers: WARNING
- x3000m0 - Not found in HSM Components ; Not found in HSM Redfish Endpoints
...
-
BMCs having no association with a management switch port will be annotated as such, and should be investigated. Exceptions to this are in Mountain or Hill configurations where Mountain BMCs will show this condition on SLS/HSM mismatches, which is normal.
-
In Hill configurations SLS assumes BMCs in chassis 1 and 3 are fully populated (32 Node BMCs), and in Mountain configurations SLS assumes all BMCs are fully populated (128 Node BMCs). Any non-populated BMCs will have no HSM data and will show up in the mismatch list.
If it was determined that the mismatch can not be ignored, then proceed onto the the 2.3.2 Known Issues below to troubleshoot any mismatched BMCs.
Known issues that may prevent hardware from getting discovered by Hardware State Manager:
- Air cooled hardware is not getting properly discovered with Aruba leaf switches
- HMS Discovery job not creating RedfishEndpoints in Hardware State Manager
The Software Management Services health checks are run using /usr/local/bin/cmsdev
.
- The tool logs to
/opt/cray/tests/cmsdev.log
- The -q (quiet) and -v (verbose) flags can be used to decrease or increase the amount of information sent to the screen.
- The same amount of data is written to the log file in either case.
The following test can be run on any Kubernetes node (any master or worker node, but not the PIT node).
ncn# /usr/local/bin/cmsdev test -q all
If all checks passed:
- The return code will be 0
- The final line of output will begin with
SUCCESS
- For example:
ncn# /usr/local/bin/cmsdev test -q all ... SUCCESS: All 7 service tests passed: bos, cfs, conman, crus, ims, tftp, vcs ncn# echo $? 0
If one or more checks failed:
- The return code will be non-0
- The final line of output will begin with
FAILURE
and will list which checks failed - For example:
ncn# /usr/local/bin/cmsdev test -q all ... FAILURE: 2 service tests FAILED (conman, ims), 5 passed (bos, cfs, crus, tftp, vcs) ncn# echo $? 1
Additional test execution details can be found in /opt/cray/tests/cmsdev.log
.
Included with the Cray System Management (CSM) release is a pre-built node image that can be used to validate that core CSM services are available and responding as expected. The CSM barebones image contains only the minimal set of RPMs and configuration required to boot an image and is not suitable for production usage. To run production work loads, it is suggested that an image from the Cray OS (COS) product, or similar, be used.
NOTES
- The CSM Barebones image included with the release will not successfully complete
beyond the dracut stage of the boot process. However, if the dracut stage is reached, the
boot can be considered successful and shows that the necessary CSM services needed to
boot a node are up and available.
- This inability to boot the barebones image fully will be resolved in future releases of the CSM product.
- In addition to the CSM Barebones image, the release also includes an IMS Recipe that
can be used to build the CSM Barebones image. However, the CSM Barebones recipe currently requires
RPMs that are not installed with the CSM product. The CSM Barebones recipe can be built after the
Cray OS (COS) product stream is also installed on to the system.
- In future releases of the CSM product, work will be undertaken to resolve these dependency issues.
- This procedure can be followed on any NCN or the PIT node.
- The Cray CLI must be configured on the node where this procedure is being performed. See Configure the Cray Command Line Interface for details on how to do this.
- Locate CSM Barebones Image in IMS
- Create a BOS Session Template for the CSM Barebones Image
- Find an available compute node
- Reboot the node using a BOS session template
- Watch Boot on Console
Locate the CSM Barebones image and note the etag
and path
fields in the output.
ncn# cray ims images list --format json | jq '.[] | select(.name | contains("barebones"))'
Expected output is similar to the following:
{
"created": "2021-01-14T03:15:55.146962+00:00",
"id": "293b1e9c-2bc4-4225-b235-147d1d611eef",
"link": {
"etag": "6d04c3a4546888ee740d7149eaecea68",
"path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
"type": "s3"
},
"name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-PRODUCT_VERSION"
}
The session template below can be copied and used as the basis for the BOS Session Template. As noted below, make sure the S3 path for the manifest matches the S3 path shown in the Image Management Service (IMS).
-
Create
sessiontemplate.json
ncn# vi sessiontemplate.json
The session template should contain the following:
{ "boot_sets": { "compute": { "boot_ordinal": 2, "etag": "etag_value_from_cray_ims_command", "kernel_parameters": "console=ttyS0,115200 bad_page=panic crashkernel=340M hugepagelist=2m-2g intel_iommu=off intel_pstate=disable iommu=pt ip=dhcp numa_interleave_omit=headless numa_zonelist_order=node oops=panic pageblock_order=14 pcie_ports=native printk.synchronous=y rd.neednet=1 rd.retry=10 rd.shell turbo_boost_limit=999 spire_join_token=${SPIRE_JOIN_TOKEN}", "network": "nmn", "node_roles_groups": [ "Compute" ], "path": "path_value_from_cray_ims_command", "rootfs_provider": "cpss3", "rootfs_provider_passthrough": "dvs:api-gw-service-nmn.local:300:nmn0", "type": "s3" } }, "enable_cfs": false, "name": "shasta-PRODUCT_VERSION-csm-bare-bones-image" }
NOTE: The rootfs provider shown above references the
dvs
provider. DVS is not provided as part of the CSM distribution and is not expected to work until the COS product is installed and configured. As noted above, the barebones image is not expected to boot at this time. Work is being done to enable a fully functional and bootable barebones image in a future release of the CSM product. Until that work is complete, the use of thedvs
rootfs provider is suggested.NOTE: Be sure to replace the values of the
etag
andpath
fields with the ones you noted earlier in thecray ims images list
command. -
Create the BOS session template using the following file as input:
ncn# cray bos sessiontemplate create --file sessiontemplate.json --name shasta-PRODUCT_VERSION-csm-bare-bones-image
The expected output is:
/sessionTemplate/shasta-PRODUCT_VERSION-csm-bare-bones-image
ncn# cray hsm state components list --role Compute --enabled true
Example output:
[[Components]]
ID = "x3000c0s17b1n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 1
NetType = "Sling"
Arch = "X86"
Class = "River"
[[Components]]
ID = "x3000c0s17b2n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 2
NetType = "Sling"
Arch = "X86"
Class = "River"
If it is noticed that compute nodes are missing from Hardware State Manager, refer to 2.3.2 Known Issues to troubleshoot any Node BMCs that have not been discovered.
Choose a node from those listed and set XNAME
to its ID. In this example, x3000c0s17b2n0
:
ncn# export XNAME=x3000c0s17b2n0
Create a BOS session to reboot the chosen node using the BOS session template that was created:
ncn# cray bos session create --template-uuid shasta-PRODUCT_VERSION-csm-bare-bones-image --operation reboot --limit $XNAME
Expected output looks similar to the following:
limit = "x3000c0s17b2n0"
operation = "reboot"
templateUuid = "shasta-PRODUCT_VERSION-csm-bare-bones-image"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
rel = "session"
type = "GET"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1/status"
rel = "status"
type = "GET"
See Manage Node Consoles for information on how to connect to the node's console (and for instructions on how to close it later).
The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly.
This boot test is considered successful if the boot reaches the dracut stage. You know this has happened if the console output has something similar to the following somewhere within the final 20 lines of its output:
[ 7.876909] dracut: FATAL: Don't know how to handle 'root=craycps-s3:s3://boot-images/e3ba09d7-e3c2-4b80-9d86-0ee2c48c2214/rootfs:c77c0097bb6d488a5d1e4a2503969ac0-27:dvs:api-gw-service-nmn.local:300:nmn0'
[ 7.898169] dracut: Refusing to continue
NOTE: As long as the preceding text is found near the end of the console output, the test is considered successful. It is normal (and not indicative of a test failure) to see something similar to the following at the very end of the console output:
Starting Dracut Emergency Shell...
[ 11.591948] device-mapper: uevent: version 1.0.3
[ 11.596657] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
Warning: dracut: FATAL: Don't know how to handle
Press Enter for maintenance
(or press Control-D to continue):
After the node has reached this point, close the console session. The test is complete.
The procedures below use the CLI as an authorized user and run on two separate node types. The first part runs on the LiveCD node, while the second part runs on a non-LiveCD Kubernetes master or worker node. When using the CLI on either node, the CLI configuration needs to be initialized and the user running the procedure needs to be authorized.
The following procedures run on separate nodes of the system. They are, therefore, separated into separate sub-sections.
This section can be run on any NCN or the PIT node.
-
Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.
-
Basic UAS installation is validated using the following: 1.
ncn# cray uas mgr-info list
Expected output looks similar to the following:
service_name = "cray-uas-mgr" version = "1.11.5"
In this example output, it shows that UAS is installed and running the
1.11.5
version. 1.ncn# cray uas list
Expected output looks similar to the following:
results = []
This example output shows that there are no currently running UAIs. It is possible, if someone else has been using the UAS, that there could be UAIs in the list. That is acceptable too from a validation standpoint.
-
Verify that the pre-made UAI images are registered with UAS
ncn# cray uas images list
Expected output looks similar to the following:
default_image = "dtr.dev.cray.com/cray/cray-uai-sles15sp1:latest" image_list = [ "dtr.dev.cray.com/cray/cray-uai-sles15sp1:latest",]
This example output shows that the pre-made end-user UAI image (
cray/cray-uai-sles15sp1:latest
) is registered with UAS. This does not necessarily mean this image is installed in the container image registry, but it is configured for use. If other UAI images have been created and registered, they may also show up here, which is acceptable.
IMPORTANT:
If you are upgrading CSM and your site does not use UAIs, skip UAS and UAI validation. If you do use UAIs, there are products that configure UAS like Cray Analytics and Cray Programming Environment. These must be working correctly with UAIs and should be validated and corrected (the procedures for this are beyond the scope of this document) prior to validating UAS and UAI. Failures in UAI creation that result from incorrect or incomplete installation of these products will generally take the form of UAIs stuck in 'waiting' state trying to set up volume mounts. See the UAI Troubleshooting section for more information.
This procedure must run on a master or worker node (not the PIT node and not ncn-w001
) on the system. (It is also possible to do from an external host, but the procedure for that is not covered here).
-
Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.
-
Verify that a UAI can be created:
ncn# cray uas create --publickey ~/.ssh/id_rsa.pub
Expected output looks similar to the following:
uai_connect_string = "ssh vers@10.16.234.10" uai_host = "ncn-w001" uai_img = "registry.local/cray/cray-uai-sles15sp1:latest" uai_ip = "10.16.234.10" uai_msg = "" uai_name = "uai-vers-a00fb46b" uai_status = "Pending" username = "vers" [uai_portmap]
This has created the UAI and the UAI is currently in the process of initializing and running.
-
Set
UAINAME
to the value of theuai_name
field in the previous command output (uai-vers-a00fb46b
in our example):ncn# export UAINAME=uai-vers-a00fb46b
-
Check the current status of the UAI:
ncn# cray uas list
Expected output looks similar to the following:
[[results]] uai_age = "0m" uai_connect_string = "ssh vers@10.16.234.10" uai_host = "ncn-w001" uai_img = "registry.local/cray/cray-uai-sles15sp1:latest" uai_ip = "10.16.234.10" uai_msg = "" uai_name = "uai-vers-a00fb46b" uai_status = "Running: Ready" username = "vers"
If the
uai_status
field isRunning: Ready
, proceed to the next step. Otherwise, wait and repeat this command until that is the case. It normally should not take more than a minute or two. -
The UAI is ready for use. Log into it with the command in the
uai_connect_string
field in the previous command output:ncn# ssh vers@10.16.234.10 vers@uai-vers-a00fb46b-6889b666db-4dfvn:~>
-
Run a command on the UAI:
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> ps -afe
Expected output looks similar to the following:
UID PID PPID C STIME TTY TIME CMD root 1 0 0 18:51 ? 00:00:00 /bin/bash /usr/bin/uai-ssh.sh munge 36 1 0 18:51 ? 00:00:00 /usr/sbin/munged root 54 1 0 18:51 ? 00:00:00 su vers -c /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D vers 55 54 0 18:51 ? 00:00:00 /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D vers 62 55 0 18:51 ? 00:00:00 sshd: vers [priv] vers 67 62 0 18:51 ? 00:00:00 sshd: vers@pts/0 vers 68 67 0 18:51 pts/0 00:00:00 -bash vers 120 68 0 18:52 pts/0 00:00:00 ps -afe
-
Log out from the UAI
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> exit ncn#
-
Clean up the UAI.
ncn# cray uas delete --uai-list $UAINAME
Expected output looks similar to the following:
results = [ "Successfully deleted uai-vers-a00fb46b",]
If the commands ran with similar results, then the basic functionality of the UAS and UAI is working.
The following subsections include common failure modes seen with UAS / UAI operations and how to resolve them.
An error will be returned when running CLI commands if the user is not logged in as a valid Keycloak user or is accidentally using the CRAY_CREDENTIALS
environment variable. This variable is set regardless of the user credentials being used.
For example:
ncn# cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Bad Request: Token not valid for UAS. Attributes missing: ['gidNumber', 'loginShell', 'homeDirectory', 'uidNumber', 'name']
Fix this by logging in as a real user (someone with actual Linux credentials) and making sure that CRAY_CREDENTIALS
is unset.
When running CLI commands, a Keycloak error may be returned.
For example:
ncn# cray uas list
The symptom of this problem is output similar to the following:
Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.
Error: Internal Server Error: An error was encountered while accessing Keycloak
If the wrong hostname was used to reach the API gateway, re-run the CLI initialization steps above and try again to check that. There may also be a problem with the Istio service mesh inside of the system. Troubleshooting this is beyond the scope of this section, but there may be useful information in the UAS pod logs in Kubernetes. There are generally two UAS pods, so the user may need to look at logs from both to find the specific failure. The logs tend to have a very large number of GET
events listed as part of the liveness checking.
The following shows an example of looking at UAS logs effectively (this example shows only one UAS manager, normally there would be two):
-
Determine the pod name of the uas-mgr pod
ncn# kubectl get po -n services | grep "^cray-uas-mgr" | grep -v etcd
Expected output looks similar to:
cray-uas-mgr-6bbd584ccb-zg8vx 2/2 Running 0 12d
-
Set PODNAME to the name of the manager pod whose logs are being viewed.
ncn# export PODNAME=cray-uas-mgr-6bbd584ccb-zg8vx
-
View its last 25 log entries of the cray-uas-mgr container in that pod, excluding
GET
events:ncn# kubectl logs -n services $PODNAME cray-uas-mgr | grep -v 'GET ' | tail -25
Example output:
2021-02-08 15:32:41,211 - uas_mgr - INFO - getting deployment uai-vers-87a0ff6e in namespace user 2021-02-08 15:32:41,225 - uas_mgr - INFO - creating deployment uai-vers-87a0ff6e in namespace user 2021-02-08 15:32:41,241 - uas_mgr - INFO - creating the UAI service uai-vers-87a0ff6e-ssh 2021-02-08 15:32:41,241 - uas_mgr - INFO - getting service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:32:41,252 - uas_mgr - INFO - creating service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:32:41,267 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:32:41,360 - uas_mgr - INFO - No start time provided from pod 2021-02-08 15:32:41,361 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 127.0.0.1 - - [08/Feb/2021 15:32:41] "POST /v1/uas?imagename=registry.local%2Fcray%2Fno-image-registered%3Alatest HTTP/1.1" 200 - 2021-02-08 15:32:54,455 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:32:54,455 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:32:54,455 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:32:54,484 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:32:54,596 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:25,053 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:40:25,054 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:40:25,054 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:40:25,085 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e 2021-02-08 15:40:25,212 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:51,210 - uas_auth - INFO - UasAuth lookup complete for user vers 2021-02-08 15:40:51,210 - uas_mgr - INFO - UAS request for: vers 2021-02-08 15:40:51,210 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers 2021-02-08 15:40:51,261 - uas_mgr - INFO - deleting service uai-vers-87a0ff6e-ssh in namespace user 2021-02-08 15:40:51,291 - uas_mgr - INFO - delete deployment uai-vers-87a0ff6e in namespace user 127.0.0.1 - - [08/Feb/2021 15:40:51] "DELETE /v1/uas?uai_list=uai-vers-87a0ff6e HTTP/1.1" 200 -
When listing or describing a UAI, an error in the uai_msg
field may be returned. For example:
ncn# cray uas list
There may be something similar to the following output:
[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.103.13.172"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp1:latest"
uai_ip = "10.103.13.172"
uai_msg = "ErrImagePull"
uai_name = "uai-vers-87a0ff6e"
uai_status = "Waiting"
username = "vers"
This means the pre-made end-user UAI image is not in the local registry (or whatever registry it is being pulled from; see the uai_img
value for details). To correct this, locate and push/import the image to the registry.
Various packages install volumes in the UAS configuration. All of those volumes must also have the underlying resources available, sometimes on the host node where the UAI is running sometimes from with Kubernetes. If a UAI gets stuck with a ContainerCreating
uai_msg
field for an extended time, this is a likely cause. UAIs run in the user
Kubernetes namespace, and are pods that can be examined using kubectl describe
.
-
Locate the pod.
ncn# kubectl get po -n user | grep <uai-name>
-
Investigate the problem using the pod name from the previous step.
ncn# kubectl describe pod -n user <pod-name>
If volumes are missing they will show up in the
Events:
section of the output. Other problems may show up there as well. The names of the missing volumes or other issues should indicate what needs to be fixed to make the UAI run.