[prev in list] [next in list] [prev in thread] [next in thread] 

List:       redhat-linux-cluster
Subject:    [Linux-cluster] VM locking in EL6 clusters
From:       "Devin A. Bougie" <devin.bougie () cornell ! edu>
Date:       2014-05-15 15:16:52
Message-ID: 3CD8C126-28A3-4526-8D20-28C280BCA1E8 () cornell ! edu
[Download RAW message or body]

Hello,

We are running three EL6 clusters using the Red Hat High Availability Add-On, and are \
encountering issues with locking our clustered VM’s.

We followed the virtual machine disk locking documentation \
(http://libvirt.org/locking.html#sanlockstorage), specifically its recommendation to \
have sanlock's disk_lease_dir set to a shared GFS2 file system.  Ideally we would \
store the leases on a clustered logical volume block device, but haven’t yet figured \
out to configure this.  Our VM XML definitions are on the same GFS2 file system, and \
our KVM virtual machines are using clustered logical volumes for their block devices.

For example, here are a few configuration excerpts.
———
- From /etc/cluster/cluster.conf:
<clusterfs device="/dev/vgift1/cluster" force_unmount="0" fstype="gfs2" \
                mountpoint="/gfs/cluster" name="cluster" options="noatime"/>
...
<vm autostart="0" domain="fd14" migrate="live" name="lnx91" \
path="/gfs/cluster/libvirt/qemu" recovery="relocate”/>

- From /etc/fstab:
/dev/vgift1/cluster	/gfs/cluster	gfs2	noatime	0	0

- From /gfs/cluster/libvirt/qemu/lnx91.xml:
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/vgift1/lnx91'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

- From /etc/libvirt/quemu-sanlock.conf:
host_id = 3
auto_disk_leases = 1
disk_lease_dir = "/gfs/cluster/libvirt/sanlock"
———

In general this works well, but periodically we see warnings and eventual errors that \
break sanlock and subsequently each clustered VM.  After the problem occurs, recovery \
involves `service libvirtd restart ; clusvcadm -d vm:lnx91 ; clusvcadm -e vm:lnx91`.  \
This is consistent across all three of our clusters (two at 6.4, one at fully updated \
6.5).  We see no other indications of problems with our iSCSI setup or GFS2 file \
systems.  Please see below for an example of what we see in /var/log/messages.

I would greatly appreciate any suggestions for resolving or debugging this problem on \
our EL6 clusters.  I would be happy to provide entire configuration or log files, or \
open a bugzilla report if necessary.

Many thanks,
Devin

———
…
May 14 23:32:43 lnx903 sanlock[8476]: 2014-05-14 23:32:43-0400 5909646 [8476]: s2 \
                check_our_lease warning 60 last_success 5909586
May 14 23:32:44 lnx903 sanlock[8476]: 2014-05-14 23:32:44-0400 5909647 [8476]: s2 \
                check_our_lease warning 61 last_success 5909586
May 14 23:32:45 lnx903 sanlock[8476]: 2014-05-14 23:32:45-0400 5909648 [8476]: s2 \
                check_our_lease warning 62 last_success 5909586
May 14 23:32:46 lnx903 sanlock[8476]: 2014-05-14 23:32:46-0400 5909649 [8476]: s2 \
                check_our_lease warning 63 last_success 5909586
May 14 23:32:47 lnx903 sanlock[8476]: 2014-05-14 23:32:47-0400 5909650 [8476]: s2 \
                check_our_lease warning 64 last_success 5909586
May 14 23:32:48 lnx903 sanlock[8476]: 2014-05-14 23:32:48-0400 5909651 [8476]: s2 \
                check_our_lease warning 65 last_success 5909586
May 14 23:32:49 lnx903 sanlock[8476]: 2014-05-14 23:32:49-0400 5909652 [8476]: s2 \
                check_our_lease warning 66 last_success 5909586
May 14 23:32:50 lnx903 sanlock[8476]: 2014-05-14 23:32:50-0400 5909653 [8476]: s2 \
                check_our_lease warning 67 last_success 5909586
May 14 23:32:51 lnx903 sanlock[8476]: 2014-05-14 23:32:51-0400 5909654 [8476]: s2 \
                check_our_lease warning 68 last_success 5909586
May 14 23:32:52 lnx903 sanlock[8476]: 2014-05-14 23:32:52-0400 5909655 [8476]: s2 \
                check_our_lease warning 69 last_success 5909586
May 14 23:32:53 lnx903 sanlock[8476]: 2014-05-14 23:32:53-0400 5909656 [8476]: s2 \
                check_our_lease warning 70 last_success 5909586
May 14 23:32:54 lnx903 sanlock[8476]: 2014-05-14 23:32:54-0400 5909657 [8476]: s2 \
                check_our_lease warning 71 last_success 5909586
May 14 23:32:55 lnx903 sanlock[8476]: 2014-05-14 23:32:55-0400 5909658 [8476]: s2 \
                check_our_lease warning 72 last_success 5909586
May 14 23:32:56 lnx903 sanlock[8476]: 2014-05-14 23:32:56-0400 5909659 [8476]: s2 \
                check_our_lease warning 73 last_success 5909586
May 14 23:32:57 lnx903 wdmd[3627]: test warning now 5909660 ping 5909650 close \
                5210877 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:32:57 lnx903 wdmd[3627]: /dev/watchdog closed unclean
May 14 23:32:57 lnx903 kernel: iTCO_wdt: Unexpected close, not stopping watchdog!
May 14 23:32:57 lnx903 sanlock[8476]: 2014-05-14 23:32:57-0400 5909660 [8476]: s2 \
                check_our_lease warning 74 last_success 5909586
May 14 23:32:58 lnx903 wdmd[3627]: test warning now 5909661 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:32:58 lnx903 sanlock[8476]: 2014-05-14 23:32:58-0400 5909661 [8476]: s2 \
                check_our_lease warning 75 last_success 5909586
May 14 23:32:59 lnx903 wdmd[3627]: test warning now 5909662 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:32:59 lnx903 sanlock[8476]: 2014-05-14 23:32:59-0400 5909662 [8476]: s2 \
                check_our_lease warning 76 last_success 5909586
May 14 23:33:00 lnx903 wdmd[3627]: test warning now 5909663 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:00 lnx903 sanlock[8476]: 2014-05-14 23:33:00-0400 5909663 [8476]: s2 \
                check_our_lease warning 77 last_success 5909586
May 14 23:33:01 lnx903 wdmd[3627]: test warning now 5909664 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:01 lnx903 sanlock[8476]: 2014-05-14 23:33:01-0400 5909664 [8476]: s2 \
                check_our_lease warning 78 last_success 5909586
May 14 23:33:02 lnx903 wdmd[3627]: test warning now 5909665 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:02 lnx903 sanlock[8476]: 2014-05-14 23:33:02-0400 5909665 [8476]: s2 \
                check_our_lease warning 79 last_success 5909586
May 14 23:33:03 lnx903 wdmd[3627]: test failed rem 54 now 5909666 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 \
                check_our_lease failed 80
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 \
                kill 19407 sig 15 count 1
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 \
                kill 24035 sig 15 count 1
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909667 [8476]: s2 \
                kill 19407 sig 15 count 2
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909667 [8476]: s2 \
                kill 24035 sig 15 count 2
May 14 23:33:04 lnx903 kernel: br0: port 3(vnet1) entering disabled state
May 14 23:33:04 lnx903 kernel: device vnet1 left promiscuous mode
May 14 23:33:04 lnx903 kernel: br0: port 3(vnet1) entering disabled state
May 14 23:33:04 lnx903 wdmd[3627]: test failed rem 53 now 5909667 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:04 lnx903 kernel: br0: port 2(vnet0) entering disabled state
May 14 23:33:04 lnx903 kernel: device vnet0 left promiscuous mode
May 14 23:33:04 lnx903 kernel: br0: port 2(vnet0) entering disabled state
May 14 23:33:04 lnx903 sanlock[8476]: 2014-05-14 23:33:04-0400 5909668 [8476]: s2 \
                kill 19407 sig 15 count 3
May 14 23:33:04 lnx903 sanlock[8476]: 2014-05-14 23:33:04-0400 5909668 [8476]: s2 \
                kill 24035 sig 15 count 3
May 14 23:33:05 lnx903 wdmd[3627]: test failed rem 52 now 5909668 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:05 lnx903 ntpd[3431]: Deleting interface #47 vnet1, \
fe80::fc54:ff:febd:58b2#123, interface stats: received=0, sent=0, dropped=0, \
                active_time=30984 secs
May 14 23:33:05 lnx903 ntpd[3431]: Deleting interface #46 vnet0, \
fe80::fc54:ff:fe14:d476#123, interface stats: received=0, sent=0, dropped=0, \
                active_time=31076 secs
May 14 23:33:05 lnx903 ntpd[3431]: peers refreshed
May 14 23:33:05 lnx903 sanlock[8476]: 2014-05-14 23:33:05-0400 5909669 [8476]: s2 \
                kill 19407 sig 15 count 4
May 14 23:33:05 lnx903 sanlock[8476]: 2014-05-14 23:33:05-0400 5909669 [8476]: s2 \
                kill 24035 sig 15 count 4
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: dead \
                19407 ci 2 count 4
May 14 23:33:06 lnx903 wdmd[3627]: test failed rem 51 now 5909669 ping 5909650 close \
                5909660 renewal 5909586 expire 5909666 client 8476 \
                sanlock___LIBVIRT__DISKS__:3
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: dead \
                24035 ci 3 count 4
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: s2 all \
                pids clear
May 14 23:33:07 lnx903 wdmd[3627]: /dev/watchdog reopen
May 14 23:33:35 lnx903 rgmanager[3795]: status on vm "lnx99" returned 1 (generic \
                error)
May 14 23:33:35 lnx903 rgmanager[3795]: status on vm "lnx91" returned 1 (generic \
                error)
May 14 23:33:45 lnx903 rgmanager[3795]: Stopping service vm:lnx99
May 14 23:33:45 lnx903 rgmanager[3795]: Stopping service vm:lnx91
May 14 23:33:46 lnx903 rgmanager[3795]: Service vm:lnx99 is recovering
May 14 23:33:46 lnx903 rgmanager[3795]: Service vm:lnx91 is recovering
May 14 23:34:34 lnx903 sanlock[8476]: 2014-05-14 23:34:34-0400 5909758 [6907]: s2 \
                renewed 5909607 delta_length 151 too long
May 14 23:36:01 lnx903 rgmanager[3795]: Service vm:lnx99 is stopped
May 14 23:36:06 lnx903 rgmanager[3795]: Service vm:lnx91 is stopped
...
———


-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic