Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<Arktos Mizar Integration>To able to add worker nodes into RP clusters in scale-out 1 X 1 enviroment #1230

Conversation

q131172019
Copy link
Collaborator

@q131172019 q131172019 commented Nov 23, 2021

What type of PR is this?
/kind documentation
/kind feature

What this PR does / why we need it:
When working on the project of Arktos Mizar Integration, Mizar team requests Arktos team to provide scale-out environment and the worker nodes can be added into RP cluster.

This PR enables to add the worker nodes into RP cluster in scale-out 1 TP X 1 RP environment and has been tested in the following 2 sets of scale out 1 TP X 1 RP (1 master node +1 worker node) environment, which is prerequisite of scale-out 2 TPs X 2 RPs environment. All nodes are AWS EC2 instance - t2.2xlarge running Ubuntu 18.04.

TP1: 172.31.3.192
RP1: 172.31.5.191
Worker node-1: 172.31.4.110
Worker node-2: 172.31.29.26

TP2: 172.31.5.56
RP1: 172.31.13.237
Worker node: 172.31.2.149

The script ./hack/arktos-worker-up.sh running on worker node enables kubelet and kube-proxy as well as the flannel is installed in process mode on worker node so that joining into RP cluster can be successful.

Also, the codes of this PR also works in scale-up environment ( master node + n x worker nodes) and have been tested in scale-up 1 + 2 environment below.

Master node : 172.31.22.85
Worker node-1: 172.31.29.128
Worker node-2: 172.31.24.185
Worker node-3: 172.21.5.205

Which issue(s) this PR fixes:
N/A

Special notes for your reviewer:
In scale-out 1 TP X 1 RP cluster, currently when 2nd worker node attempts to join RP cluster, in its flannel log /tmp/flanneld.log, you will see the following error:

E1203 06:00:52.242285 1298 route_network.go:115] Error adding route to 10.244.0.0/24 via 172.31.5.191 dev index 2: network is unreachable
I1203 06:00:52.242309 1298 route_network.go:86] Subnet added: 10.244.1.0/24 via 172.31.4.110
E1203 06:00:52.242497 1298 route_network.go:115] Error adding route to 10.244.1.0/24 via 172.31.4.110 dev index 2: network is unreachable

In scale-up cluster, currently when 3rd worker node attempts to join cluster, in its flannel log /tmp/flanneld.log, you will see the following error:

E1203 04:15:33.103839 18746 route_network.go:115] Error adding route to 10.244.0.0/24 via 172.31.22.85 dev index 2: network is unreachable
I1203 04:15:33.103858 18746 route_network.go:86] Subnet added: 10.244.1.0/24 via 172.31.29.128
E1203 04:15:33.103956 18746 route_network.go:115] Error adding route to 10.244.1.0/24 via 172.31.29.128 dev index 2: network is unreachable
I1203 04:15:33.103975 18746 route_network.go:86] Subnet added: 10.244.2.0/24 via 172.31.24.185
E1203 04:15:33.104110 18746 route_network.go:115] Error adding route to 10.244.2.0/24 via 172.31.24.185 dev index 2: network is unreachable

Need further investigate some limits of network on AWS EC2 instance type - t2.2xlarge.

Does this PR introduce a user-facing change?:
YES.

======== Scale out environment ===================
0.  Follow up the procedure at https://github.com/q131172019/arktos/blob/CarlXie_singleNodeArktosCluster/docs/setup-guide/setup-dev-env.md to create development environment on each node

1.  Follow up the procedure of [set up scale-out 1 X 1 environment](https://github.com/CentaurusInfra/arktos/blob/master/docs/setup-guide/scale-out-local-dev-setup.md) and run the following scripts to automatically start TP1 and RP1
1.1) On TP1: 
        ./hack/scale-out-1x1-rp1-multi-nodes/scale-out-TP1-node.sh <RP1_IP>
1.2) On RP1: 
        ./hack/scale-out-1x1-rp1-multi-nodes/scale-out-RP1-node.sh <TP1_IP>

2. On worker nodes to join into RP1 cluster, run the script to automatically join into RP1 cluster
     ./hack/scale-out-1x1-rp1-multi-nodes/scale-out-RP1-worker-node-join.sh <RP1_IP>

3.  On RP1 node, check the status of node and check the flannel log on each node of RP1 cluster
     ./cluster/kubectl.sh get nodes
      cat /tmp/flanneld.log

4. Test whether the ngnix application can be deployed successfully
     ./cluster/kubectl.sh run nginx --image=nginx --replicas=10
     ./cluster/kubectl.sh get pod -n default -o wide
     ./cluster/kubectl.sh delete deployment/nginx

5. Please follow up the steps to do end-to-end verification of service in scale-out cluster at https://github.com/CentaurusInfra/arktos/issues/1143

======== Scale up environment ===================
0.  Follow up the procedure at https://github.com/q131172019/arktos/blob/CarlXie_singleNodeArktosCluster/docs/setup-guide/setup-dev-env.md to create development environment on each node

1.  On master node: run the following script to start single node scale-up cluster,
     ./hack/scale-up-multi-nodes/scale-up-master-node.sh

2.  On worker nodes: run the following script to join into scale-up cluster
     ./hack/scale-up-multi-nodes/scale-up-worker-node-join.sh <MASTER_NODE_IP>

3.  On master node, check the status of nodes and check the flannel log /tmp/flanneld.log on each node
     ./cluster/kubectl.sh get nodes
      cat /tmp/flanneld.log

4. Test whether the ngnix application can be deployed successfully
     ./cluster/kubectl.sh run nginx --image=nginx --replicas=10
     ./cluster/kubectl.sh get pod -n default -o wide
     ./cluster/kubectl.sh exec -ti <1st pod> -- curl <IP of  another nginx pods>
     ./cluster/kubectl.sh exec -ti <2st pod> -- curl <IP of  another nginx pods>
     ./cluster/kubectl.sh delete deployment/nginx

5. Please follow up the steps to do end-to-end verification of service in scale-up cluster at https://github.com/CentaurusInfra/arktos/issues/1142

@centaurus-cloud-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dingyin
You can assign the PR to them by writing /assign @dingyin in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@q131172019 q131172019 closed this Nov 23, 2021
@q131172019 q131172019 reopened this Nov 23, 2021
hack/arktos-worker-up.sh Outdated Show resolved Hide resolved
hack/arktos-worker-up.sh Outdated Show resolved Hide resolved
@Sindica
Copy link
Collaborator

Sindica commented Nov 23, 2021

Question for set up instruction:

  1. Set up instruction should be a .md file checked in together in this commit so we later can find the instruction easily.
  2. Step 2: "On worker node to join into RP cluster, copy the following files from RP node to the directory /tmp/arktos". Why /tmp/arktos? Can we use same config folder for master and worker?
  3. Step 3: "Clean up the directories - /opt/cni/bin and /etc/cni/net.d" Can this be incorporated into arktos-worker-up.sh?
  4. Step 5: too many manual steps. Can this be done automatically with a single flag?

@q131172019
Copy link
Collaborator Author

q131172019 commented Nov 23, 2021

Question for set up instruction:

Set up instruction should be a .md file checked in together in this commit so we later can find the instruction easily.
Yes. I agree to write the .md file

Step 2: "On worker node to join into RP cluster, copy the following files from RP node to the directory /tmp/arktos". Why >>/tmp/arktos? Can we use same config folder for master and worker?

We can use same config folder for master and worker if it does not create confusion between master and worker.

Step 3: "Clean up the directories - /opt/cni/bin and /etc/cni/net.d" Can this be incorporated into arktos-worker-up.sh?
Yes.

Step 5: too many manual steps. Can this be done automatically with a single flag?
Yes.

@q131172019 q131172019 closed this Nov 23, 2021
@q131172019 q131172019 reopened this Nov 23, 2021
hack/arktos-cni.rc Outdated Show resolved Hide resolved
hack/arktos-cni.rc Outdated Show resolved Hide resolved
hack/arktos-cni.rc Outdated Show resolved Hide resolved
@q131172019 q131172019 closed this Dec 3, 2021
@q131172019 q131172019 reopened this Dec 3, 2021
@q131172019
Copy link
Collaborator Author

This PR can be closed because it is replaced by larger PR 1382.

@q131172019 q131172019 closed this Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants