Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Bricks are failing to connect to the volume post gluster node reboot #1457

Open
PrasadDesala opened this issue Jan 3, 2019 · 7 comments
Open
Assignees
Labels
brick-multiplexing-issue tracker label to capture all issues related to brick multiplexing feature bug priority: high

Comments

@PrasadDesala
Copy link

Bricks are failing to connect to the volume post gluster node reboot.

Observed behavior

On a system having 102 PVCs with brick-mux enabled I rebooted gluster-kube1-0 pod. After sometime the gluster pod is back online and is connected to the trusted pool but bricks on that gluster node are failing to connect to the volume.

[root@gluster-kube1-0 /]# ps -ef | grep -i glusterfsd
root 30332 59 0 09:52 pts/3 00:00:00 grep --color=auto -i glusterfsd
[root@gluster-kube1-0 /]# glustercli volume status pvc-db2b6e88-0f29-11e9-aaf6-525400933534
Volume : pvc-db2b6e88-0f29-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 129ac9de-9e60-4227-99df-48d7e17238f9 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 35692 | 4034 |
| 46a34351-19a2-4fd2-b692-ea07fbe4f71d | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick2/brick | false | 0 | 0 |
| 0935a101-2e0d-4c5f-914f-0e4562602950 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 39067 | 4115 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

I am seeing below continuous messages in glusterd2 logs,
time="2019-01-03 09:52:57.982317" level=error msg="failed to connect to brick, aborting volume profile operation" brick="6257213e-de5c-4ae5-867d-38e0fd5abc0e:/var/run/glusterd2/bricks/pvc-81d554b4-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick" error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[volume-profile.go:246:volumes.txnVolumeProfile]" txnid=e763af77-19f2-4935-bd02-9c65be68657a
time="2019-01-03 09:52:57.982371" level=error msg="Step failed on node." error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" node=6257213e-de5c-4ae5-867d-38e0fd5abc0e reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[step.go:120:transaction.runStepFuncOnNodes]" step=volume.Profile txnid=e763af77-19f2-4935-bd02-9c65be68657a
time="2019-01-03 09:52:57.997172" level=info msg="client connected" address="10.233.64.5:48521" server=sunrpc source="[server.go:148:sunrpc.(*SunRPC).acceptLoop]" transport=tcp
time="2019-01-03 09:52:57.998020" level=error msg="registry.SearchByBrickPath() failed for brick" brick=/var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick error="SearchByBrickPath: port for brick /var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick not found" source="[rpc_prog.go:104:pmap.(*GfPortmap).PortByBrick]"
time="2019-01-03 09:52:57.998383" level=info msg="client disconnected" address="10.233.64.5:48521" server=sunrpc source="[server.go:109:sunrpc.(*SunRPC).pruneConn]"

Expected/desired behavior

Post gluster pod reboot, bricks should connect back to the volume without any issues,

Details on how to reproduce (minimal and precise)

  1. Create a 3 node gcs system using vagrant.
  2. Create 102 PVCs with brick mux enabled.
  3. Reboot a gluster pod.
  4. Once the pod is back online, check glustercli volume status

Information about the environment:

  • Glusterd2 version used (e.g. v4.1.0 or master): v6.0-dev.97.gita6fc33c
  • Operating system used: CentOS 7.6
  • Glusterd2 compiled from sources, as a package (rpm/deb), or container:
  • Using External ETCD: (yes/no, if yes ETCD version): yes, 3.3.8
  • If container, which container image:
  • Using kubernetes, openshift, or direct install:
  • If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: kubernetes
@atinmu atinmu added bug GCS/1.0 Issue is blocker for Gluster for Container Storage priority: high labels Jan 3, 2019
@vpandey-RH
Copy link
Contributor

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

@PrasadDesala
Copy link
Author

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

@vpandey-RH Its been more than 45 minutes. Still I see bricks are trying to re-connect.

@vpandey-RH
Copy link
Contributor

IS there any change in number of bricks that were previously showing port as 0 ?

@vpandey-RH
Copy link
Contributor

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

@PrasadDesala
Copy link
Author

PrasadDesala commented Jan 3, 2019

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

Yes it seems brick process is not running after gluster node reboot. So the brick process is showing as '0' for that node.

Below is the output snip of volume status for a volume;
Before node reboot:
[root@gluster-kube1-0 /]# glustercli volume status
Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 46726 | 7886 |
| 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 |
| 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

After node reboot:
[root@gluster-kube1-0 /]# glustercli volume status pvc-30622ade-0f26-11e9-aaf6-525400933534
Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | false | 0 | 0 |
| 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 |
| 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

@atinmu atinmu added brick-multiplexing-issue tracker label to capture all issues related to brick multiplexing feature and removed GCS/1.0 Issue is blocker for Gluster for Container Storage labels Jan 17, 2019
@atinmu
Copy link
Contributor

atinmu commented Jan 17, 2019

Taking this out from GCS/1.0 tag considering we're not going to make brick multiplexing a default option in GCS/1.0 release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
brick-multiplexing-issue tracker label to capture all issues related to brick multiplexing feature bug priority: high
Projects
None yet
Development

No branches or pull requests

4 participants