'[gfs-bugs] [Bug 226] New - nodes hang after reading journal'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gfs-bugs
Subject:    [gfs-bugs] [Bug 226] New - nodes hang after reading journal
From:       bugzilla-daemon () sistina ! com
Date:       2001-03-20 0:14:21
[Download RAW message or body]

http://bugzilla.sistina.com/show_bug.cgi?id=226

*** shadow/226	Mon Mar 19 18:14:21 2001
--- shadow/226.tmp.8442	Mon Mar 19 18:14:21 2001
***************
*** 0 ****
--- 1,250 ----
+ Bug#: 226
+ Product: GFS
+ Version: 4.0
+ Platform: 
+ OS/Version: All
+ Status: NEW   
+ Resolution: 
+ Severity: normal
+ Priority: P4
+ Component: __unknown__
+ AssignedTo: gfs-bugs@sistina.com                            
+ ReportedBy: annis@fnal.gov               
+ URL: 
+ Summary: nodes hang after reading journal
+ 
+ Hi folks,
+ I seem to be in a state I cannot recover from, and could use help.
+ 
+ It seems to have started last friday, when the lock server went down.
+ This lead to the necessity to do meatware, and a problem with the
+ stomithd not being able to exec the agent. That was due to /usr/local/sbin
+ not being in the path, and that was because I added stomithd to the
+ /etc/rc.d/init.d/pool script, firing it with "daemon stomithd".
+ (I've since removed that...). But the cluster came up after that
+ affair, after I ran "wait_meatware -s ip" and "do_meatware -s ip" by
+ hand. During this process, I stopped the memexpd a few times.
+ 
+ Fine.
+ 
+ But today one of the machines just hung when someone tried looking
+ at a GFS file system.
+ 
+ And now, when I try to mount the GFS file system, the machine goes
+ into a hard hang. The story goes like this:
+ 
+ I power down the 3 GFS nodes. I power one back up. I run
+         /etc/rc.d/init.d/pool start
+ and get success. Then I try to mount the file system with:
+         /etc/rc.d/init.d/gfs start
+ and get soft hung: the logs report the need for "do_meatware".
+ So I do the "do_meatware -s ip#1" and "do_meatware -s ip#2"
+ against the other machines, and then try the
+         /etc/rc.d/init.d/gfs start
+ again. The kernel goes off and looks at the journals (there are 9,
+ but only 3 machines: room to grow as suggested by the howto), and
+ reports Done. (the log doesn't report done on the last journal, but
+ a report of done does appear on the serial console), and
+ then goes into a hard hang that only a power cycle clears.
+ 
+ This same story is repeated on any of the 3 GFS nodes that I try
+ to bring up.
+ 
+ At one point in the affair, I remember one of the nodes hung up
+ on acquiring a journal lock, which I rebooted outof, but this no longer
+ occurs.
+ 
+ Any ideas as to what is happening?
+ 
+ thanks,
+ Jim Annis
+ 
+ This was a very vanilla setup:
+         Fresh install of RH 7.0.
+         Kernel 2.2.18 from source.
+         qlogic 2200 drivers from source
+         GFS v4.0 from source
+ 
+ Following the recipe in the howto for GFS on fibre channel with DMEP.
+ (Dot Hill raid, 3 GFS nodes (tam02 tam03 tam04), 1 lock server node (tam01),
+ qlogic 2200 HBA, capellix switch)
+ 
+ #
+ # From /var/log/messages on a cluster host computer
+ #
+ Mar 19 17:26:37 tam02 kernel: GFS:  Done
+ Mar 19 17:28:36 tam04 kernel: memexp:  Cluster Information Device summary
+ Mar 19 17:28:36 tam04 kernel:   jid: 2, cid: 3, ipaddr: 131.225.7.67, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel:   jid: 1, cid: 2, ipaddr: 131.225.7.66, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel:   jid: 0, cid: 1, ipaddr: 131.225.7.51, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel:   Node timeout: 100
+ Mar 19 17:28:39 tam01 memexpd[13096]: New connection: fd 10 from 4307e183:104
+ Mar 19 17:28:39 tam04 kernel: IP: 4307e183
+ Mar 19 17:29:17 tam01 ypbind[476]: broadcast: RPC: Timed out.
+ Mar 19 17:29:39 tam04 kernel: IPparam 83e10733
+ Mar 19 17:29:39 tam04 kernel: Banning 3307e183
+ Mar 19 17:29:39 tam04 wait_meatware: Node 131.225.7.51 requires hard reset.  Run
+ do_meatware after power cycling the machine.
+ Mar 19 17:30:11 tam04 stomithd: successful stomith - wait_meatware - meatware
+ reply 131.225.7.51
+ Mar 19 17:30:19 tam04 kernel: GFS:  Trying to acquire journal lock 0...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Looking at journal 0...
+ Mar 19 17:30:19 tam04 kernel: memexp:  client recovery reset expired cid 1
+ (called)
+ Mar 19 17:30:19 tam04 kernel: GFS:  Done
+ Mar 19 17:30:19 tam04 kernel: GFS:  Trying to acquire journal lock 1...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Looking at journal 1...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Done
+ Mar 19 17:30:19 tam04 kernel: GFS:  Trying to acquire journal lock 2...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Looking at journal 2...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Done
+ Mar 19 17:30:19 tam04 kernel: GFS:  Trying to acquire journal lock 3...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Looking at journal 3...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Done
+ Mar 19 17:30:19 tam04 kernel: GFS:  Trying to acquire journal lock 4...
+ Mar 19 17:30:19 tam04 kernel: GFS:  Looking at journal 4...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Done
+ Mar 19 17:30:20 tam04 kernel: GFS:  Trying to acquire journal lock 5...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Looking at journal 5...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Done
+ Mar 19 17:30:20 tam04 kernel: GFS:  Trying to acquire journal lock 6...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Looking at journal 6...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Done
+ Mar 19 17:30:20 tam04 kernel: GFS:  Trying to acquire journal lock 7...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Looking at journal 7...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Done
+ Mar 19 17:30:20 tam04 kernel: GFS:  Trying to acquire journal lock 8...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Looking at journal 8...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Done
+ Mar 19 17:30:20 tam04 kernel: GFS:  Trying to acquire journal lock 9...
+ Mar 19 17:30:20 tam04 kernel: GFS:  Looking at journal 9...
+ 
+ #
+ # From a serial console, you can see that the journal is actually read
+ #
+ Mar 19 17:26:37 tam02 kernel: GFS:  Done
+ Mar 19 17:26:37 tam02 kernel: GFS:  Trying to acquire journal lock 8...
+ Mar 19 17:26:37 tam02 kernel: GFS:  Looking at journal 8...
+ Mar 19 17:26:37 tam02 kernel: GFS:  Done
+ Mar 19 17:26:37 tam02 kernel: GFS:  Trying to acquire journal lock 9...
+ Mar 19 17:26:37 tam02 kernel: GFS:  Looking at journal 9...
+ Mar 19 17:26:37 tam02 kernel: GFS:  Done
+ 
+ 
+ [root@tam02 gfs-4.0]# uname -a
+ Linux tam02.fnal.gov 2.2.18 #1 SMP Thu Mar 1 07:03:24 MST 2001 i686 unknown
+ [root@tam02 gfs-4.0]# scripts/dump_configuration
+ -----System Info found:-----------------------------------
+ Operating System: linux-gnu
+ Platform: linux_2_2
+ CPU type: i686
+ Vendor: pc
+ 
+ Shell: /bin/sh
+ C Preprocessor: gcc -E
+ Kernel-space Compiler: gcc
+ Kernel-space Compiler flags:
+ Kernel-space Linker: ld
+ User-land Compiler: gcc
+ User-land Compiler flags: -g -O2
+ Kernel source directory: /usr/src/linux
+ Kernel module directory: /lib/modules/2.2.18
+ return type of signal handlers is void
+ 
+ Installed components:
+ 
+  +/- |   Description
+ ------------------------------------------------
+   +  |   pthread libaries
+   +  |   directory entry headers
+   +  |   standard C headers
+   +  |   posix complient sys/wait.h
+   +  |   fctnl.h header
+   +  |   paths.h header
+   +  |   sys/ioctl.h header
+   +  |   sys/time.h header
+   +  |   unistd.h header
+   +  |   `struct stat' contains an `st_rdev' member
+   +  |   program may include both `time.h' and `sys/time.h'
+   +  |   unistd.h header
+   +  |   getpagesize function
+   +  |   mmap function working correctly
+   +  |   vprintf function
+   +  |   gethostname function
+   +  |   gettimeofday function
+   +  |   mkdir function
+   +  |   select function
+   +  |   socket function
+   +  |   strdup function
+   +  |   strerror function
+   +  |   strstr function
+   +  |   strtol function
+   +  |   strtoul function
+   +  |   uname function
+ -----End System Info--------------------------------------
+ <snip>
+ [root@tam02 gfs-4.0]# gfs_tool dbdump /gfs
+ gfs_tool: error doing ioctl:  Inappropriate ioctl for device
+ 
+ [root@tam02 gfs-4.0]# pinfo -s
+ 
+ ----------------------------------------------------------------------
+ Pool List  (2 pools)
+   (1) Pool name:    pool0_cidev
+        Pool device:  121,1
+        Pool ID:      80135a56e76
+        In use:       No
+        Capacity:     20081208 kB
+        Subpool List (1 subpools) for 121,1 (/dev/pool/pool0_cidev)
+         [ 1] offset       :         0
+              blocks       :  40162416
+              striping     :         0
+              total weight :         0
+              Type         : gfs_data
+           subpool devices:
+                 sda1    weight: 0
+ 
+   (2) Pool name:    pool0
+        Pool device:  121,2
+        Pool ID:      8027d893408
+        In use:       No
+        Capacity:     266454080 kB
+        Subpool List (1 subpools) for 121,2 (/dev/pool/pool0)
+         [ 1] offset       :         0
+              blocks       : 532908160
+              striping     :         0
+              total weight :         1
+              Type         : gfs_data
+           subpool devices:
+                 sda2    weight: 1
+ 
+ 
+ ----------------------------------------------------------------------
+ 
+ [root@tam02 gfs-4.0]# pinfo -f pool0
+ poolname        pool0
+ 
+ subpools        1
+ 
+ subpool 0       0       1       gfs_data
+ 
+ pooldevice      0       0       /dev/sda2       1
+ 
+ 
+ [root@tam02 gfs-4.0]# cat gfscf.cf
+ datadev:  /dev/pool/pool0
+ cidev:  /dev/pool/pool0_cidev
+ lockdev:  131.225.7.47:15697
+ cbport:  3001
+ timeout:  100
+ 
+ STOMITH:  meatware
+ name:     meatware
+ 
+ #      IP addr     CID  STOMITH method
+ node: 131.225.7.51  1   SM:  meatware
+ node: 131.225.7.66  2   SM:  meatware
+ node: 131.225.7.67  3   SM:  meatware
Read the GFS HOWTO http://www.sistina.com/gfs/Pages/howto.html
gfs-bugs mailing list
gfs-bugs@sistina.com
http://lists.sistina.com/mailman/listinfo/gfs-bugs
Read the GFS Howto:  http://www.sistina.com/gfs/Pages/howto.html

[prev in list] [next in list] [prev in thread] [next in thread]