[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gluster-users
Subject:    [Gluster-users] New Gluster volume (10.3) not healing symlinks after brick offline
From:       Matt Rubright <mrubrigh () uncc ! edu>
Date:       2023-01-23 20:57:48
Message-ID: CA+J2L0LYeJViju0buvQAVHbJn-n4MtoL=9e7pM09BKFVmOHVnQ () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hi friends,

I have recently built a new replica 3 arbiter 1 volume on 10.3 servers and
have been putting it through its paces before getting it ready for
production use. The volume will ultimately contain about 200G of web
content files shared among multiple frontends. Each will use the gluster
fuse client to connect.

What I am experiencing sounds very much like this post from 9 years ago:
https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html

In short, if I perform these steps I can reliably end up with symlinks on
the volume which will not heal either by initiating a 'full heal' from the
cluster or using a fuse client to read each file:

1) Verify that all nodes are healthy, the volume is healthy, and there are
no items needing to be healed
2) Cleanly shut down one server hosting a brick
3) Copy data, including some symlinks, from a fuse client to the volume
4) Bring the brick back online and observe the number and type of items
needing to be healed
5) Initiate a full heal from one of the nodes
6) Confirm that while files and directories are healed, symlinks are not

Please help me determine if I have improper expectations here. I have some
basic knowledge of managing gluster volumes, but I may be misunderstanding
intended behavior.

Here is the volume info and heal data at each step of the way:

*** Verify that all nodes are healthy, the volume is healthy, and there are
no items needing to be healed ***

# gluster vol info cwsvol01

Volume Name: cwsvol01
Type: Replicate
Volume ID: 7b28e6e6-4a73-41b7-83fe-863a45fd27fc
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: glfs02-172-20-1:/data/brick01/cwsvol01
Brick2: glfs01-172-20-1:/data/brick01/cwsvol01
Brick3: glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on

# gluster vol status
Status of volume: cwsvol01
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick glfs02-172-20-1:/data/brick01/cwsvol0
1                                           50253     0          Y
1397
Brick glfs01-172-20-1:/data/brick01/cwsvol0
1                                           56111     0          Y
1089
Brick glfsarb01-172-20-1:/data/arb01/cwsvol
01                                          54517     0          Y
118704
Self-heal Daemon on localhost               N/A       N/A        Y
1413
Self-heal Daemon on glfs01-172-20-1         N/A       N/A        Y
3490
Self-heal Daemon on glfsarb01-172-20-1      N/A       N/A        Y
118720

Task Status of Volume cwsvol01
------------------------------------------------------------------------------
There are no active volume tasks

# gluster vol heal cwsvol01 info summary
Brick glfs02-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfs01-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

*** Cleanly shut down one server hosting a brick ***

*** Copy data, including some symlinks, from a fuse client to the volume ***

# gluster vol heal cwsvol01 info summary
Brick glfs02-172-20-1:/data/brick01/cwsvol01
Status: Transport endpoint is not connected
Total Number of entries: -
Number of entries in heal pending: -
Number of entries in split-brain: -
Number of entries possibly healing: -

Brick glfs01-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 810
Number of entries in heal pending: 810
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
Status: Connected
Total Number of entries: 810
Number of entries in heal pending: 810
Number of entries in split-brain: 0
Number of entries possibly healing: 0

*** Bring the brick back online and observe the number and type of entities
needing to be healed ***

# gluster vol heal cwsvol01 info summary
Brick glfs02-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfs01-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 769
Number of entries in heal pending: 769
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
Status: Connected
Total Number of entries: 769
Number of entries in heal pending: 769
Number of entries in split-brain: 0
Number of entries possibly healing: 0

*** Initiate a full heal from one of the nodes ***

# gluster vol heal cwsvol01 info summary
Brick glfs02-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfs01-172-20-1:/data/brick01/cwsvol01
Status: Connected
Total Number of entries: 148
Number of entries in heal pending: 148
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
Status: Connected
Total Number of entries: 148
Number of entries in heal pending: 148
Number of entries in split-brain: 0
Number of entries possibly healing: 0

# gluster vol heal cwsvol01 info
Brick glfs02-172-20-1:/data/brick01/cwsvol01
Status: Connected
Number of entries: 0

Brick glfs01-172-20-1:/data/brick01/cwsvol01
/web01-etc
/web01-etc/nsswitch.conf
/web01-etc/swid/swidtags.d
/web01-etc/swid/swidtags.d/redhat.com
/web01-etc/os-release
/web01-etc/system-release
< truncated >

*** Verify that one brick contains the symlink while the previously-offline
one does not ***

[root@cws-glfs01 ~]# ls -ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf
lrwxrwxrwx 2 root root 29 Jan  4 16:00
/data/brick01/cwsvol01/web01-etc/nsswitch.conf ->
/etc/authselect/nsswitch.conf

[root@cws-glfs02 ~]# ls -ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf
ls: cannot access '/data/brick01/cwsvol01/web01-etc/nsswitch.conf': No such
file or directory

*** Note entries in /var/log/gluster/glustershd.log ***

[2023-01-23 20:34:40.939904 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
operation failed. [{source=<gfid:3cade471-8aba-492a-b981-d63330d2e02e>},
{target=(null)}, {errno=116}, {error=Stale file handle}]
[2023-01-23 20:34:40.945774 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
operation failed. [{source=<gfid:35102340-9409-4d88-a391-da43c00644e7>},
{target=(null)}, {errno=116}, {error=Stale file handle}]
[2023-01-23 20:34:40.749715 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
operation failed. [{source=<gfid:874406a9-9478-4b83-9e6a-09e262e4b85d>},
{target=(null)}, {errno=116}, {error=Stale file handle}]

[Attachment #5 (text/html)]

<div dir="ltr">Hi friends,<div><br></div><div>I have recently built a new replica 3 \
arbiter 1 volume on 10.3 servers and have been putting it through its paces before \
getting it ready for production use. The volume will ultimately contain about 200G of \
web content files shared among multiple frontends. Each will use the gluster fuse  \
client to connect.</div><div><br></div><div>What I am experiencing sounds very  much \
like this post from 9 years ago:  <a \
href="https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html">https:// \
lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html</a><br></div><div><br></div><div>In \
short, if I perform these steps I can reliably end up with symlinks on the volume \
which will not heal either by initiating a &#39;full heal&#39; from the cluster or \
using a fuse client to read each file:</div><div><br></div><div>1) Verify that all \
nodes are healthy, the volume is healthy, and there are no items needing to be \
healed</div><div>2) Cleanly shut down one server hosting a brick</div><div>3) Copy \
data, including some symlinks, from a fuse client to the volume</div><div>4) Bring \
the brick back online and observe the number and type of items needing to be \
healed</div><div>5) Initiate a full heal from one of the nodes</div><div>6) Confirm \
that while files and directories are healed, symlinks are \
not</div><div><br></div><div>Please help me determine if I have improper expectations \
here. I have some basic knowledge of managing gluster volumes, but I may be \
misunderstanding intended behavior.  </div><div><br></div><div>Here is the volume \
info and heal data at each step of the way:</div><div><br></div><div>*** Verify that \
all nodes are healthy, the volume is healthy, and there are no items needing to be \
healed ***</div><div><br></div><div># gluster vol info cwsvol01</div>  <br>Volume \
Name: cwsvol01<br>Type: Replicate<br>Volume ID: \
7b28e6e6-4a73-41b7-83fe-863a45fd27fc<br>Status: Started<br>Snapshot Count: \
0<br>Number of Bricks: 1 x (2 + 1) = 3<br>Transport-type: tcp<br>Bricks:<br>Brick1: \
glfs02-172-20-1:/data/brick01/cwsvol01<br>Brick2: \
glfs01-172-20-1:/data/brick01/cwsvol01<br>Brick3: \
glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)<br>Options \
Reconfigured:<br>performance.client-io-threads: off<br>nfs.disable: \
on<br>transport.address-family: inet<br>storage.fips-mode-rchecksum: \
on<br>cluster.granular-entry-heal: on<br><div><br></div><div># gluster vol \
status<br></div><div>Status of volume: cwsvol01<br>Gluster process                    \
TCP Port   RDMA Port   Online   \
Pid<br>------------------------------------------------------------------------------<br>Brick \
glfs02-172-20-1:/data/brick01/cwsvol0<br>1                                            \
50253       0               Y          1397 <br>Brick \
glfs01-172-20-1:/data/brick01/cwsvol0<br>1                                            \
56111       0               Y          1089 <br>Brick \
glfsarb01-172-20-1:/data/arb01/cwsvol<br>01                                           \
54517       0               Y          118704<br>Self-heal Daemon on localhost        \
N/A          N/A            Y          1413 <br>Self-heal Daemon on glfs01-172-20-1   \
N/A          N/A            Y          3490 <br>Self-heal Daemon on \
glfsarb01-172-20-1         N/A          N/A            Y          118720<br>  \
<br>Task Status of Volume \
cwsvol01<br>------------------------------------------------------------------------------<br>There \
are no active volume tasks<br></div><div><br></div><div># gluster vol heal cwsvol01 \
info summary<br>Brick glfs02-172-20-1:/data/brick01/cwsvol01<br>Status: \
Connected<br>Total Number of entries: 0<br>Number of entries in heal pending: \
0<br>Number of entries in split-brain: 0<br>Number of entries possibly healing: \
0<br><br>Brick glfs01-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Total \
Number of entries: 0<br>Number of entries in heal pending: 0<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: 0<br><br>Brick \
glfsarb01-172-20-1:/data/arb01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 0<br>Number of entries in heal pending: 0<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: \
0<br></div><div><br></div><div>*** Cleanly shut down one server hosting a brick \
***</div><div><br></div><div>*** Copy data, including some symlinks, from a fuse \
client to the volume ***</div><div><br></div><div># gluster vol heal cwsvol01 info \
summary<br>Brick glfs02-172-20-1:/data/brick01/cwsvol01<br>Status: Transport endpoint \
is not connected<br>Total Number of entries: -<br>Number of entries in heal pending: \
-<br>Number of entries in split-brain: -<br>Number of entries possibly healing: \
-<br><br>Brick glfs01-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Total \
Number of entries: 810<br>Number of entries in heal pending: 810<br>Number of entries \
in split-brain: 0<br>Number of entries possibly healing: 0<br><br>Brick \
glfsarb01-172-20-1:/data/arb01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 810<br>Number of entries in heal pending: 810<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: \
0<br></div><div><br></div><div>*** Bring the brick back online and observe the number \
and type of entities needing to be healed ***</div><div><br></div><div># gluster vol \
heal cwsvol01 info summary<br>Brick glfs02-172-20-1:/data/brick01/cwsvol01<br>Status: \
Connected<br>Total Number of entries: 0<br>Number of entries in heal pending: \
0<br>Number of entries in split-brain: 0<br>Number of entries possibly healing: \
0<br><br>Brick glfs01-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Total \
Number of entries: 769<br>Number of entries in heal pending: 769<br>Number of entries \
in split-brain: 0<br>Number of entries possibly healing: 0<br><br>Brick \
glfsarb01-172-20-1:/data/arb01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 769<br>Number of entries in heal pending: 769<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: \
0<br></div><div><br></div><div>*** Initiate a full heal from one of the nodes \
***</div><div><br></div><div># gluster vol heal cwsvol01 info summary<br>Brick \
glfs02-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 0<br>Number of entries in heal pending: 0<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: 0<br><br>Brick \
glfs01-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 148<br>Number of entries in heal pending: 148<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: 0<br><br>Brick \
glfsarb01-172-20-1:/data/arb01/cwsvol01<br>Status: Connected<br>Total Number of \
entries: 148<br>Number of entries in heal pending: 148<br>Number of entries in \
split-brain: 0<br>Number of entries possibly healing: \
0<br></div><div><br></div><div># gluster vol heal cwsvol01 info <br>Brick \
glfs02-172-20-1:/data/brick01/cwsvol01<br>Status: Connected<br>Number of entries: \
0<br><br>Brick glfs01-172-20-1:/data/brick01/cwsvol01<br>/web01-etc \
<br>/web01-etc/nsswitch.conf <br>/web01-etc/swid/swidtags.d \
<br>/web01-etc/swid/swidtags.d/<a href="http://redhat.com">redhat.com</a> \
<br>/web01-etc/os-release <br>/web01-etc/system-release<br></div><div>&lt; truncated \
&gt;</div><div><br></div><div>*** Verify that one brick contains the symlink while \
the previously-offline one does not ***</div><div><br></div><div>[root@cws-glfs01 ~]# \
ls -ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf <br>lrwxrwxrwx 2 root root 29 \
Jan   4 16:00 /data/brick01/cwsvol01/web01-etc/nsswitch.conf -&gt; \
/etc/authselect/nsswitch.conf<br></div><div><br></div><div>[root@cws-glfs02 ~]# ls \
-ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf<br>ls: cannot access \
&#39;/data/brick01/cwsvol01/web01-etc/nsswitch.conf&#39;: No such file or \
directory<br></div><div><br></div><div>*** Note entries in \
/var/log/gluster/glustershd.log ***</div><div><br></div><div>[2023-01-23 \
20:34:40.939904 +0000] W [MSGID: 114031] \
[client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote operation \
failed. [{source=&lt;gfid:3cade471-8aba-492a-b981-d63330d2e02e&gt;}, {target=(null)}, \
{errno=116}, {error=Stale file handle}] <br>[2023-01-23 20:34:40.945774 +0000] W \
[MSGID: 114031] [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: \
remote operation failed. [{source=&lt;gfid:35102340-9409-4d88-a391-da43c00644e7&gt;}, \
{target=(null)}, {errno=116}, {error=Stale file handle}] <br>[2023-01-23 \
20:34:40.749715 +0000] W [MSGID: 114031] \
[client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote operation \
failed. [{source=&lt;gfid:874406a9-9478-4b83-9e6a-09e262e4b85d&gt;}, {target=(null)}, \
{errno=116}, {error=Stale file handle}]  \
<br></div><div><br></div><div><br></div></div>



________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic