'Re: [Gluster-users] [ovirt-users] Ovirt/Gluster replica 3 distributed-replicated problem'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gluster-users
Subject:    Re: [Gluster-users] [ovirt-users] Ovirt/Gluster replica 3 distributed-replicated problem
From:       Ravishankar N <ravishankar () redhat ! com>
Date:       2016-09-29 15:02:57
Message-ID: 80ccc63e-4850-c5fd-f1c2-1418353a56e1 () redhat ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

On 09/29/2016 08:03 PM, Davide Ferrari wrote:
> It's strange, I've tried to trigger the error again by putting vm04 in 
> maintenence and stopping the gluster service (from ovirt gui) and now 
> the VM starts correctly. Maybe the arbiter indeed blamed the brick 
> that was still up before, but how's that possible?

A write from the client on that file (vm image) could have succeeded 
only on vm04 even before you brought it down.

> The only (maybe big) difference with the previous, erroneous 
> situation, is that before I did maintenence (+ reboot) of 3 of my 4 
> hosts, maybe I should have left more time between one reboot and another?

If you did not do anything from the previous run other than to bring the 
node up and things worked, then the file is not in split-brain. Split 
braine'd files need to be resolved before they can be accessed again, 
which apparently did not happen in your case.

-Ravi
> 
> 2016-09-29 14:16 GMT+02:00 Ravishankar N <ravishankar@redhat.com 
> <mailto:ravishankar@redhat.com>>:
> 
> On 09/29/2016 05:18 PM, Sahina Bose wrote:
> > Yes, this is a GlusterFS problem. Adding gluster users ML
> > 
> > On Thu, Sep 29, 2016 at 5:11 PM, Davide Ferrari
> > <davide@billymob.com <mailto:davide@billymob.com>> wrote:
> > 
> > Hello
> > 
> > maybe this is more glustefs then ovirt related but since
> > OVirt integrates Gluster management and I'm experiencing the
> > problem in an ovirt cluster, I'm writing here.
> > 
> > The problem is simple: I have a data domain mappend on a
> > replica 3 arbiter1 Gluster volume with 6 bricks, like this:
> > 
> > Status of volume: data_ssd
> > Gluster process TCP Port  RDMA Port  Online  Pid
> > ------------------------------------------------------------------------------
> > Brick vm01.storage.billy:/gluster/ssd/data/
> > brick 49153     0          Y       19298
> > Brick vm02.storage.billy:/gluster/ssd/data/
> > brick 49153     0          Y       6146
> > Brick vm03.storage.billy:/gluster/ssd/data/
> > arbiter_brick 49153     0          Y       6552
> > Brick vm03.storage.billy:/gluster/ssd/data/
> > brick 49154     0          Y       6559
> > Brick vm04.storage.billy:/gluster/ssd/data/
> > brick 49152     0          Y       6077
> > Brick vm02.storage.billy:/gluster/ssd/data/
> > arbiter_brick 49154     0          Y       6153
> > Self-heal Daemon on localhost               N/A N/A       
> > Y       30746
> > Self-heal Daemon on vm01.storage.billy      N/A N/A       
> > Y       196058
> > Self-heal Daemon on vm03.storage.billy      N/A N/A       
> > Y       23205
> > Self-heal Daemon on vm04.storage.billy      N/A N/A       
> > Y       8246
> > 
> > 
> > Now, I've put in maintenance the vm04 host, from ovirt,
> > ticking the "Stop gluster" checkbox, and Ovirt didn't
> > complain about anything. But when I tried to run a new VM it
> > complained about "storage I/O problem", while the storage
> > data status was always UP.
> > 
> > Looking in the gluster logs I can see this:
> > 
> > [2016-09-29 11:01:01.556908] I
> > [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No
> > change in volfile, continuing
> > [2016-09-29 11:02:28.124151] E [MSGID: 108008]
> > [afr-read-txn.c:89:afr_read_txn_refresh_done]
> > 0-data_ssd-replicate-1: Failing READ on gfid
> > bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
> > [Input/output error]
> > [2016-09-29 11:02:28.126580] W [MSGID: 108008]
> > [afr-read-txn.c:244:afr_read_txn] 0-data_ssd-replicate-1:
> > Unreadable subvolume -1 found with event generation 6 for
> > gfid bf5922b7-19f3-4ce3-98df-71e981ecca8d. (Possible split-brain)
> > [2016-09-29 11:02:28.127374] E [MSGID: 108008]
> > [afr-read-txn.c:89:afr_read_txn_refresh_done]
> > 0-data_ssd-replicate-1: Failing FGETXATTR on gfid
> > bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
> > [Input/output error]
> > [2016-09-29 11:02:28.128130] W [MSGID: 108027]
> > [afr-common.c:2403:afr_discover_done] 0-data_ssd-replicate-1:
> > no read subvols for (null)
> > [2016-09-29 11:02:28.129890] W
> > [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 8201:
> > READ => -1 gfid=bf5922b7-19f3-4ce3-98df-71e981ecca8d
> > fd=0x7f09b749d210 (Input/output error)
> > [2016-09-29 11:02:28.130824] E [MSGID: 108008]
> > [afr-read-txn.c:89:afr_read_txn_refresh_done]
> > 0-data_ssd-replicate-1: Failing FSTAT on gfid
> > bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
> > [Input/output error]
> > 
> 
> Does `gluster volume heal data_ssd info split-brain` report that
> the file is in split-brain, with vm04 still being down?
> If yes, could you provide the extended attributes of this gfid
> from all 3 bricks:
> getfattr -d -m . -e hex
> /path/to/brick/bf/59/bf5922b7-19f3-4ce3-98df-71e981ecca8d
> 
> If no, then I'm guessing that it is not in actual split-brain
> (hence the 'Possible split-brain' message). If the node you
> brought down contains the only good copy of the file (i.e the
> other data brick and arbiter are up, and the arbiter 'blames' this
> other brick), all I/O is failed with EIO to prevent file from
> getting into actual split-brain. The heals will happen when the
> good node comes up and I/O should be allowed again in that case.
> 
> -Ravi
> 
> 
> > [2016-09-29 11:02:28.133879] W
> > [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 8202:
> > FSTAT()
> > /ba2bd397-9222-424d-aecc-eb652c0169d9/images/f02ac1ce-52cd-4b81-8b29-f8006d0469e0/ff4e49c6-3084-4234-80a1-18a67615c527
> >  => -1 (Input/output error)
> > The message "W [MSGID: 108008]
> > [afr-read-txn.c:244:afr_read_txn] 0-data_ssd-replicate-1:
> > Unreadable subvolume -1 found with event generation 6 for
> > gfid bf5922b7-19f3-4ce3-98df-71e981ecca8d. (Possible
> > split-brain)" repeated 11 times between [2016-09-29
> > 11:02:28.126580] and [2016-09-29 11:02:28.517744]
> > [2016-09-29 11:02:28.518607] E [MSGID: 108008]
> > [afr-read-txn.c:89:afr_read_txn_refresh_done]
> > 0-data_ssd-replicate-1: Failing STAT on gfid
> > bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
> > [Input/output error]
> > 
> > Now, how is it possible to have a split brain if I stopped
> > just ONE server which had just ONE of six bricks, and it was
> > cleanly shut down with maintenance mode from ovirt?
> > 
> > I created the volume originally this way:
> > # gluster volume create data_ssd replica 3 arbiter 1
> > vm01.storage.billy:/gluster/ssd/data/brick
> > vm02.storage.billy:/gluster/ssd/data/brick
> > vm03.storage.billy:/gluster/ssd/data/arbiter_brick
> > vm03.storage.billy:/gluster/ssd/data/brick
> > vm04.storage.billy:/gluster/ssd/data/brick
> > vm02.storage.billy:/gluster/ssd/data/arbiter_brick
> > # gluster volume set data_ssd group virt
> > # gluster volume set data_ssd storage.owner-uid 36 && gluster
> > volume set data_ssd storage.owner-gid 36
> > # gluster volume start data_ssd
> > 
> 
> 
> 
> 
> > 
> > 
> > -- 
> > Davide Ferrari
> > Senior Systems Engineer
> > 
> > _______________________________________________
> > Users mailing list
> > Users@ovirt.org <mailto:Users@ovirt.org>
> > http://lists.ovirt.org/mailman/listinfo/users
> > <http://lists.ovirt.org/mailman/listinfo/users>
> > 
> > 
> 
> 
> 
> 
> -- 
> Davide Ferrari
> Senior Systems Engineer

[Attachment #5 (text/html)]

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 09/29/2016 08:03 PM, Davide Ferrari
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAPU8Gx6uQc4BZoHU=T3p0TJwXWXuqRK9mK_EWKf_gJzpMrHAqw@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>It's strange, I've tried to trigger the error again by
          putting vm04 in maintenence and stopping the gluster service
          (from ovirt gui) and now the VM starts correctly. Maybe the
          arbiter indeed blamed the brick that was still up before, but
          how's that possible?<br>
        </div>
      </div>
    </blockquote>
    <br>
    A write from the client on that file (vm image) could have succeeded
    only on vm04 even before you brought it down.<br>
    <br>
    <blockquote
cite="mid:CAPU8Gx6uQc4BZoHU=T3p0TJwXWXuqRK9mK_EWKf_gJzpMrHAqw@mail.gmail.com"
      type="cite">
      <div dir="ltr">The only (maybe big) difference with the previous,
        erroneous situation, is that before I did maintenence (+ reboot)
        of 3 of my 4 hosts, maybe I should have left more time between
        one reboot and another?<br>
      </div>
    </blockquote>
    <br>
    If you did not do anything from the previous run other than to bring
    the node up and things worked, then the file is not in split-brain.
    Split braine'd files need to be resolved before they can be accessed
    again, which apparently did not happen in your case.<br>
    <br>
    -Ravi<br>
    <blockquote
cite="mid:CAPU8Gx6uQc4BZoHU=T3p0TJwXWXuqRK9mK_EWKf_gJzpMrHAqw@mail.gmail.com"
      type="cite">
      <div class="gmail_extra"><br>
        <div class="gmail_quote">2016-09-29 14:16 GMT+02:00 Ravishankar
          N <span dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:ravishankar@redhat.com" \
target="_blank">ravishankar@redhat.com</a>&gt;</span>:<br>  <blockquote \
                class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000">
              <div>
                <div class="h5">
                  <div>On 09/29/2016 05:18 PM, Sahina Bose wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">Yes, this is a GlusterFS problem.
                      Adding gluster users ML<br>
                    </div>
                    <div class="gmail_extra"><br>
                      <div class="gmail_quote">On Thu, Sep 29, 2016 at
                        5:11 PM, Davide Ferrari <span dir="ltr">&lt;<a
                            moz-do-not-send="true"
                            href="mailto:davide@billymob.com"
                            target="_blank">davide@billymob.com</a>&gt;</span>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0
                          0 0 .8ex;border-left:1px #ccc
                          solid;padding-left:1ex">
                          <div dir="ltr">
                            <div>
                              <div>
                                <div>
                                  <div>
                                    <div>
                                      <div>Hello<br>
                                        <br>
                                      </div>
                                      maybe this is more glustefs then
                                      ovirt related but since OVirt
                                      integrates Gluster management and
                                      I'm experiencing the problem in an
                                      ovirt cluster, I'm writing here.<br>
                                      <br>
                                    </div>
                                    The problem is simple: I have a data
                                    domain mappend on a replica 3
                                    arbiter1 Gluster volume with 6
                                    bricks, like this:<br>
                                    <span
                                      style="font-family:monospace,monospace"><br>
                                      Status of volume: data_ssd<br>
                                      Gluster
                                      process                       <wbr>     
                                      TCP Port  RDMA Port  Online  Pid<br>
                                      \
------------------------------<wbr>------------------------------<wbr>------------------<br>
  Brick
                                      vm01.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      brick                         <wbr>             \

                                      49153     0          Y       19298<br>
                                      Brick
                                      vm02.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      brick                         <wbr>             \
  49153     0          Y       6146
                                      <br>
                                      Brick
                                      vm03.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      arbiter_brick                 <wbr>             \
  49153     0          Y       6552
                                      <br>
                                      Brick
                                      vm03.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      brick                         <wbr>             \
  49154     0          Y       6559
                                      <br>
                                      Brick
                                      vm04.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      brick                         <wbr>             \
  49152     0          Y       6077
                                      <br>
                                      Brick
                                      vm02.storage.billy:/gluster/ss<wbr>d/data/<br>
                                      arbiter_brick                 <wbr>             \
  49154     0          Y       6153
                                      <br>
                                      Self-heal Daemon on
                                      localhost               N/A      
                                      N/A        Y       30746<br>
                                      Self-heal Daemon on
                                      vm01.storage.billy      N/A      
                                      N/A        Y       196058<br>
                                      Self-heal Daemon on
                                      vm03.storage.billy      N/A      
                                      N/A        Y       23205<br>
                                      Self-heal Daemon on
                                      vm04.storage.billy      N/A      
                                      N/A        Y       8246 </span><br>
                                    <br>
                                    <br>
                                  </div>
                                  Now, I've put in maintenance the vm04
                                  host, from ovirt, ticking the "Stop
                                  gluster" checkbox, and Ovirt didn't
                                  complain about anything. But when I
                                  tried to run a new VM it complained
                                  about "storage I/O problem", while the
                                  storage data status was always UP.<br>
                                  <br>
                                </div>
                                Looking in the gluster logs I can see
                                this:<br>
                                <br>
                                <span
                                  style="font-family:monospace,monospace">[2016-09-29
                                  11:01:01.556908] I
                                  [glusterfsd-mgmt.c:1596:mgmt_g<wbr>etspec_cbk]
                                  0-glusterfs: No change in volfile,
                                  continuing<br>
                                  [2016-09-29 11:02:28.124151] E [MSGID:
                                  108008] \
[afr-read-txn.c:89:afr_read_tx<wbr>n_refresh_done]  0-data_ssd-replicate-1: Failing \
                READ
                                  on gfid bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d:
                                  split-brain observed. [Input/output
                                  error]<br>
                                  [2016-09-29 11:02:28.126580] W [MSGID:
                                  108008] [afr-read-txn.c:244:afr_read_t<wbr>xn]
                                  0-data_ssd-replicate-1: Unreadable
                                  subvolume -1 found with event
                                  generation 6 for gfid
                                  bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d.
                                  (Possible split-brain)<br>
                                  [2016-09-29 11:02:28.127374] E [MSGID:
                                  108008] \
[afr-read-txn.c:89:afr_read_tx<wbr>n_refresh_done]  0-data_ssd-replicate-1: Failing
                                  FGETXATTR on gfid
                                  bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d:
                                  split-brain observed. [Input/output
                                  error]<br>
                                  [2016-09-29 11:02:28.128130] W [MSGID:
                                  108027] [afr-common.c:2403:afr_discove<wbr>r_done]
                                  0-data_ssd-replicate-1: no read
                                  subvols for (null)<br>
                                  [2016-09-29 11:02:28.129890] W
                                  [fuse-bridge.c:2228:fuse_readv<wbr>_cbk]
                                  0-glusterfs-fuse: 8201: READ =&gt; -1
                                  gfid=bf5922b7-19f3-4ce3-98df-7<wbr>1e981ecca8d
                                  fd=0x7f09b749d210 (Input/output error)<br>
                                  [2016-09-29 11:02:28.130824] E [MSGID:
                                  108008] \
[afr-read-txn.c:89:afr_read_tx<wbr>n_refresh_done]  0-data_ssd-replicate-1: Failing \
                FSTAT
                                  on gfid bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d:
                                  split-brain observed. [Input/output
                                  error]<br>
                                </span></div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </div>
              </div>
              Does `gluster volume heal data_ssd info split-brain`
              report that the file is in split-brain, with vm04 still
              being down? <br>
              If yes, could you provide the extended attributes of this
              gfid from all 3 bricks:<br>
              getfattr -d -m . -e hex /path/to/brick/bf/59/<span
                style="font-family:monospace,monospace">bf5922b7-<wbr>19f3-4ce3-98df-71e981ecca8d</span><br>
  <br>
              If no, then I'm guessing that it is not in actual
              split-brain (hence the 'Possible split-brain' message). If
              the node you brought down contains the only good copy of
              the file (i.e the other data brick and arbiter are up, and
              the arbiter 'blames' this other brick), all I/O is failed
              with EIO to prevent file from getting into actual
              split-brain. The heals will happen when the good node
              comes up and I/O should be allowed again in that case.<br>
              <br>
              -Ravi<span class=""><br>
                <br>
                <br>
                <blockquote type="cite">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr">
                          <div>
                            <div><span
                                style="font-family:monospace,monospace">[2016-09-29
                                11:02:28.133879] W
                                [fuse-bridge.c:767:fuse_attr_c<wbr>bk]
                                0-glusterfs-fuse: 8202: FSTAT()
                                \
/ba2bd397-9222-424d-aecc-eb652<wbr>c0169d9/images/f02ac1ce-52cd-<wbr>4b81-8b29-f8006d0469e0/ff4e49c<wbr>6-3084-4234-80a1-18a67615c527
  =&gt; -1 (Input/output error)<br>
                                The message "W [MSGID: 108008]
                                [afr-read-txn.c:244:afr_read_t<wbr>xn]
                                0-data_ssd-replicate-1: Unreadable
                                subvolume -1 found with event generation
                                6 for gfid
                                bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d.
                                (Possible split-brain)" repeated 11
                                times between [2016-09-29
                                11:02:28.126580] and [2016-09-29
                                11:02:28.517744]<br>
                                [2016-09-29 11:02:28.518607] E [MSGID:
                                108008] \
[afr-read-txn.c:89:afr_read_tx<wbr>n_refresh_done]  0-data_ssd-replicate-1: Failing \
                STAT on
                                gfid bf5922b7-19f3-4ce3-98df-71e981<wbr>ecca8d:
                                split-brain observed. [Input/output
                                error]<br>
                              </span><br>
                            </div>
                            Now, how is it possible to have a split
                            brain if I stopped just ONE server which had
                            just ONE of six bricks, and it was cleanly
                            shut down with maintenance mode from ovirt?<br>
                            <br>
                          </div>
                          I created the volume originally this way:<br>
                          <span style="font-family:monospace,monospace">#
                            gluster volume create data_ssd replica 3
                            arbiter 1 vm01.storage.billy:/gluster/ss<wbr>d/data/brick
                            vm02.storage.billy:/gluster/ss<wbr>d/data/brick
                            vm03.storage.billy:/gluster/ss<wbr>d/data/arbiter_brick
                            vm03.storage.billy:/gluster/ss<wbr>d/data/brick
                            vm04.storage.billy:/gluster/ss<wbr>d/data/brick
                            \
vm02.storage.billy:/gluster/ss<wbr>d/data/arbiter_brick<br>  # gluster volume set \
data_ssd group virt<br>  # gluster volume set data_ssd
                            storage.owner-uid 36 &amp;&amp; gluster
                            volume set data_ssd storage.owner-gid 36<br>
                            # gluster volume start data_ssd</span><span><font
                              color="#888888"><br>
                            </font></span></div>
                      </blockquote>
                    </div>
                  </div>
                </blockquote>
                <br>
                <br>
                <br>
                <br>
                <blockquote type="cite">
                  <div class="gmail_extra">
                    <div class="gmail_quote">
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div dir="ltr"><span><font color="#888888"><br
                                clear="all">
                              <div>
                                <div>
                                  <div>
                                    <div>
                                      <div>
                                        <div>
                                          <div><br>
                                            -- <br>
                                            <div>
                                              <div dir="ltr">
                                                <div>Davide Ferrari<br>
                                                </div>
                                                Senior Systems Engineer<br>
                                              </div>
                                            </div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </font></span></div>
                        <br>
                        ______________________________<wbr>_________________<br>
                        Users mailing list<br>
                        <a moz-do-not-send="true"
                          href="mailto:Users@ovirt.org" \
target="_blank">Users@ovirt.org</a><br>  <a moz-do-not-send="true"
                          href="http://lists.ovirt.org/mailman/listinfo/users"
                          rel="noreferrer" \
target="_blank">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>  <br>
                      </blockquote>
                    </div>
                    <br>
                  </div>
                </blockquote>
                <p><br>
                </p>
              </span></div>
          </blockquote>
        </div>
        <br>
        <br clear="all">
        <br>
        -- <br>
        <div class="gmail_signature" data-smartmail="gmail_signature">
          <div dir="ltr">
            <div>Davide Ferrari<br>
            </div>
            Senior Systems Engineer<br>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[prev in list] [next in list] [prev in thread] [next in thread]