Skip to content

Latest commit

 

History

History
141 lines (115 loc) · 6.07 KB

HW_Installation_troubleshooting.md

File metadata and controls

141 lines (115 loc) · 6.07 KB

Troubleshooting Guide

This section covers how to solve the reported issues. This section will be updated, continuously.

SPGW-C or UPF is not working

Please check the log with below commands:

$ kubectl logs spgwc-0 -n riab -c spgwc # for SPGW-C log
$ kubectl logs upf-0 -n riab -c bess # for UPF log

In the log, if we can see unsupported CPU type or a specific flag (e.g., AES) is missing, we should check the CPU microarchitecture. RiaB requires Intel Haswell or more recent CPU microarchitecture. If we have the appropriate CPU type, we should build SPGW-C or UPF image on the machine where RiaB will run.

To build SPGW-C, first clone the SPGW-C repository on the machine with git clone https://github.com/omec-project/spgw. Then, edit below line in Makefile:

DOCKER_BUILD_ARGS        ?= --build-arg RTE_MACHINE='native'

Then, run make on the spgw directory.

Likewise, for building UPF image, we should clone UPF repository with git clone https://github.com/omec-project/upf-epc. Then, edit below line in Makefile:

CPU                      ?= native

Then, run make on the upf-epc directory.

After building those images, we should modify overriding value yaml file (i.e., sdran-in-a-box-values.yaml). Go to the file and write down below:

images:
  tags:
    spgwc: <spgwc_image_tag>
    bess: <bess_upf_image_tag>
    pfcpiface: <pfcpiface_upf_image_tab>
  pullPolicy: IfNotPresent

Then, run below commands:

$ cd /path/to/sdran-in-a-box
$ make reset-test
# after all OMEC pods are deleted, run make again
$ make

ETCD is not working

Sometimes, we see the below outputs when building RiaB.

TASK [etcd : Configure | Ensure etcd is running] ***********************************************************************
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).

If we see this, we can command below:

$ sudo systemctl restart docker
$ cd /path/to/sdran-in-a-box
$ make

Atomix controllers cannot be deleted/reset

Sometimes, Atomix controllers cannot be deleted (maybe we will get stuck when deleting Atomix controller pods) when we command make reset-test.

rm -f /tmp/build/milestones/oai-enb-cu
rm -f /tmp/build/milestones/oai-enb-du
rm -f /tmp/build/milestones/oai-ue
helm delete -n riab sd-ran || true
release "sd-ran" uninstalled
cd /tmp/build/milestones; rm -f ric
kubectl delete -f https://raw.githubusercontent.com/atomix/kubernetes-controller/master/deploy/atomix-controller.yaml || true
customresourcedefinition.apiextensions.k8s.io "databases.cloud.atomix.io" deleted
customresourcedefinition.apiextensions.k8s.io "partitions.cloud.atomix.io" deleted
customresourcedefinition.apiextensions.k8s.io "members.cloud.atomix.io" deleted
customresourcedefinition.apiextensions.k8s.io "primitives.cloud.atomix.io" deleted
serviceaccount "atomix-controller" deleted
clusterrole.rbac.authorization.k8s.io "atomix-controller" deleted
clusterrolebinding.rbac.authorization.k8s.io "atomix-controller" deleted
service "atomix-controller" deleted
deployment.apps "atomix-controller" deleted

If the script is stopped here, we can command:

# Commmand Ctrl+c first to stop the Makefile script if the make reset-test is got stuck. Then command below.
$ make reset-atomix # Manually delete Atomix controller pods
$ make atomix # Manually install Atomix controller pods
$ make reset-test # Then, make reset-test again

Or, sometimes we see this when deploying RiaB:

Error from server (AlreadyExists): error when creating "https://raw.githubusercontent.com/atomix/kubernetes-controller/master/deploy/atomix-controller.yaml": object is being deleted: customresourcedefinitions.apiextensions.k8s.io "members.cloud.atomix.io" already exists
Makefile:231: recipe for target '/tmp/build/milestones/atomix' failed

In this case, we can manually delete atomix with the command make atomix || make reset-atomix, and then resume to deploy RiaB.

Pod onos-consensus-db-1-0 initialization failed

In Ubuntu 20.04 (kernel 5.4.0-65-generic), the k8s pod named onos-consensus-db-1-0 might fail due to a bug of using go and alpine together (e.g., docker-library/golang#320).

It can be seen in kubectl logs -n riab onos-consensus-db-1-0 as:

runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

Such pod utilizes the docker image atomix/raft-storage-node:v0.5.3, tagged from the build of the image atomix/dragonboat-raft-storage-node:latest available at https://github.com/atomix/dragonboat-raft-storage-node.

A quick fix (allowing an unlimited amount memory to be locked by the pod) to this issue is cloning the repository https://github.com/atomix/dragonboat-raft-storage-node, and changing the Makefile:

# Before change
image: build
	docker build . -f build/dragonboat-raft-storage-node/Dockerfile -t atomix/dragonboat-raft-storage-node:${RAFT_STORAGE_NODE_VERSION}

# After change: unlimited maximum locked-in-memory address space
image: build
	docker build --ulimit memlock=-1 . -f build/dragonboat-raft-storage-node/Dockerfile -t atomix/dragonboat-raft-storage-node:${RAFT_STORAGE_NODE_VERSION}

Then running in the source dir of this repository the command make image, and tagging the built image as:

docker tag atomix/dragonboat-raft-storage-node:latest  atomix/raft-storage-node:v0.5.3

After that proceed with the execution of the Riab setup again.

Other issues?

Please contact ONF SD-RAN team, if you see any issue. Any issue report from users is very welcome. Mostly, the redeployment by using make reset-test and make [option] resolves issues.