[prev in list] [next in list] [prev in thread] [next in thread]
List: gfs-bugs
Subject: [gfs-bugs] [Bug 226] New - nodes hang after reading journal
From: bugzilla-daemon () sistina ! com
Date: 2001-03-20 0:14:21
[Download RAW message or body]
http://bugzilla.sistina.com/show_bug.cgi?id=226
*** shadow/226 Mon Mar 19 18:14:21 2001
--- shadow/226.tmp.8442 Mon Mar 19 18:14:21 2001
***************
*** 0 ****
--- 1,250 ----
+ Bug#: 226
+ Product: GFS
+ Version: 4.0
+ Platform:
+ OS/Version: All
+ Status: NEW
+ Resolution:
+ Severity: normal
+ Priority: P4
+ Component: __unknown__
+ AssignedTo: gfs-bugs@sistina.com
+ ReportedBy: annis@fnal.gov
+ URL:
+ Summary: nodes hang after reading journal
+
+ Hi folks,
+ I seem to be in a state I cannot recover from, and could use help.
+
+ It seems to have started last friday, when the lock server went down.
+ This lead to the necessity to do meatware, and a problem with the
+ stomithd not being able to exec the agent. That was due to /usr/local/sbin
+ not being in the path, and that was because I added stomithd to the
+ /etc/rc.d/init.d/pool script, firing it with "daemon stomithd".
+ (I've since removed that...). But the cluster came up after that
+ affair, after I ran "wait_meatware -s ip" and "do_meatware -s ip" by
+ hand. During this process, I stopped the memexpd a few times.
+
+ Fine.
+
+ But today one of the machines just hung when someone tried looking
+ at a GFS file system.
+
+ And now, when I try to mount the GFS file system, the machine goes
+ into a hard hang. The story goes like this:
+
+ I power down the 3 GFS nodes. I power one back up. I run
+ /etc/rc.d/init.d/pool start
+ and get success. Then I try to mount the file system with:
+ /etc/rc.d/init.d/gfs start
+ and get soft hung: the logs report the need for "do_meatware".
+ So I do the "do_meatware -s ip#1" and "do_meatware -s ip#2"
+ against the other machines, and then try the
+ /etc/rc.d/init.d/gfs start
+ again. The kernel goes off and looks at the journals (there are 9,
+ but only 3 machines: room to grow as suggested by the howto), and
+ reports Done. (the log doesn't report done on the last journal, but
+ a report of done does appear on the serial console), and
+ then goes into a hard hang that only a power cycle clears.
+
+ This same story is repeated on any of the 3 GFS nodes that I try
+ to bring up.
+
+ At one point in the affair, I remember one of the nodes hung up
+ on acquiring a journal lock, which I rebooted outof, but this no longer
+ occurs.
+
+ Any ideas as to what is happening?
+
+ thanks,
+ Jim Annis
+
+ This was a very vanilla setup:
+ Fresh install of RH 7.0.
+ Kernel 2.2.18 from source.
+ qlogic 2200 drivers from source
+ GFS v4.0 from source
+
+ Following the recipe in the howto for GFS on fibre channel with DMEP.
+ (Dot Hill raid, 3 GFS nodes (tam02 tam03 tam04), 1 lock server node (tam01),
+ qlogic 2200 HBA, capellix switch)
+
+ #
+ # From /var/log/messages on a cluster host computer
+ #
+ Mar 19 17:26:37 tam02 kernel: GFS: Done
+ Mar 19 17:28:36 tam04 kernel: memexp: Cluster Information Device summary
+ Mar 19 17:28:36 tam04 kernel: jid: 2, cid: 3, ipaddr: 131.225.7.67, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel: jid: 1, cid: 2, ipaddr: 131.225.7.66, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel: jid: 0, cid: 1, ipaddr: 131.225.7.51, nstomith:
+ 1
+ Mar 19 17:28:36 tam04 kernel: Node timeout: 100
+ Mar 19 17:28:39 tam01 memexpd[13096]: New connection: fd 10 from 4307e183:104
+ Mar 19 17:28:39 tam04 kernel: IP: 4307e183
+ Mar 19 17:29:17 tam01 ypbind[476]: broadcast: RPC: Timed out.
+ Mar 19 17:29:39 tam04 kernel: IPparam 83e10733
+ Mar 19 17:29:39 tam04 kernel: Banning 3307e183
+ Mar 19 17:29:39 tam04 wait_meatware: Node 131.225.7.51 requires hard reset. Run
+ do_meatware after power cycling the machine.
+ Mar 19 17:30:11 tam04 stomithd: successful stomith - wait_meatware - meatware
+ reply 131.225.7.51
+ Mar 19 17:30:19 tam04 kernel: GFS: Trying to acquire journal lock 0...
+ Mar 19 17:30:19 tam04 kernel: GFS: Looking at journal 0...
+ Mar 19 17:30:19 tam04 kernel: memexp: client recovery reset expired cid 1
+ (called)
+ Mar 19 17:30:19 tam04 kernel: GFS: Done
+ Mar 19 17:30:19 tam04 kernel: GFS: Trying to acquire journal lock 1...
+ Mar 19 17:30:19 tam04 kernel: GFS: Looking at journal 1...
+ Mar 19 17:30:19 tam04 kernel: GFS: Done
+ Mar 19 17:30:19 tam04 kernel: GFS: Trying to acquire journal lock 2...
+ Mar 19 17:30:19 tam04 kernel: GFS: Looking at journal 2...
+ Mar 19 17:30:19 tam04 kernel: GFS: Done
+ Mar 19 17:30:19 tam04 kernel: GFS: Trying to acquire journal lock 3...
+ Mar 19 17:30:19 tam04 kernel: GFS: Looking at journal 3...
+ Mar 19 17:30:19 tam04 kernel: GFS: Done
+ Mar 19 17:30:19 tam04 kernel: GFS: Trying to acquire journal lock 4...
+ Mar 19 17:30:19 tam04 kernel: GFS: Looking at journal 4...
+ Mar 19 17:30:20 tam04 kernel: GFS: Done
+ Mar 19 17:30:20 tam04 kernel: GFS: Trying to acquire journal lock 5...
+ Mar 19 17:30:20 tam04 kernel: GFS: Looking at journal 5...
+ Mar 19 17:30:20 tam04 kernel: GFS: Done
+ Mar 19 17:30:20 tam04 kernel: GFS: Trying to acquire journal lock 6...
+ Mar 19 17:30:20 tam04 kernel: GFS: Looking at journal 6...
+ Mar 19 17:30:20 tam04 kernel: GFS: Done
+ Mar 19 17:30:20 tam04 kernel: GFS: Trying to acquire journal lock 7...
+ Mar 19 17:30:20 tam04 kernel: GFS: Looking at journal 7...
+ Mar 19 17:30:20 tam04 kernel: GFS: Done
+ Mar 19 17:30:20 tam04 kernel: GFS: Trying to acquire journal lock 8...
+ Mar 19 17:30:20 tam04 kernel: GFS: Looking at journal 8...
+ Mar 19 17:30:20 tam04 kernel: GFS: Done
+ Mar 19 17:30:20 tam04 kernel: GFS: Trying to acquire journal lock 9...
+ Mar 19 17:30:20 tam04 kernel: GFS: Looking at journal 9...
+
+ #
+ # From a serial console, you can see that the journal is actually read
+ #
+ Mar 19 17:26:37 tam02 kernel: GFS: Done
+ Mar 19 17:26:37 tam02 kernel: GFS: Trying to acquire journal lock 8...
+ Mar 19 17:26:37 tam02 kernel: GFS: Looking at journal 8...
+ Mar 19 17:26:37 tam02 kernel: GFS: Done
+ Mar 19 17:26:37 tam02 kernel: GFS: Trying to acquire journal lock 9...
+ Mar 19 17:26:37 tam02 kernel: GFS: Looking at journal 9...
+ Mar 19 17:26:37 tam02 kernel: GFS: Done
+
+
+ [root@tam02 gfs-4.0]# uname -a
+ Linux tam02.fnal.gov 2.2.18 #1 SMP Thu Mar 1 07:03:24 MST 2001 i686 unknown
+ [root@tam02 gfs-4.0]# scripts/dump_configuration
+ -----System Info found:-----------------------------------
+ Operating System: linux-gnu
+ Platform: linux_2_2
+ CPU type: i686
+ Vendor: pc
+
+ Shell: /bin/sh
+ C Preprocessor: gcc -E
+ Kernel-space Compiler: gcc
+ Kernel-space Compiler flags:
+ Kernel-space Linker: ld
+ User-land Compiler: gcc
+ User-land Compiler flags: -g -O2
+ Kernel source directory: /usr/src/linux
+ Kernel module directory: /lib/modules/2.2.18
+ return type of signal handlers is void
+
+ Installed components:
+
+ +/- | Description
+ ------------------------------------------------
+ + | pthread libaries
+ + | directory entry headers
+ + | standard C headers
+ + | posix complient sys/wait.h
+ + | fctnl.h header
+ + | paths.h header
+ + | sys/ioctl.h header
+ + | sys/time.h header
+ + | unistd.h header
+ + | `struct stat' contains an `st_rdev' member
+ + | program may include both `time.h' and `sys/time.h'
+ + | unistd.h header
+ + | getpagesize function
+ + | mmap function working correctly
+ + | vprintf function
+ + | gethostname function
+ + | gettimeofday function
+ + | mkdir function
+ + | select function
+ + | socket function
+ + | strdup function
+ + | strerror function
+ + | strstr function
+ + | strtol function
+ + | strtoul function
+ + | uname function
+ -----End System Info--------------------------------------
+ <snip>
+ [root@tam02 gfs-4.0]# gfs_tool dbdump /gfs
+ gfs_tool: error doing ioctl: Inappropriate ioctl for device
+
+ [root@tam02 gfs-4.0]# pinfo -s
+
+ ----------------------------------------------------------------------
+ Pool List (2 pools)
+ (1) Pool name: pool0_cidev
+ Pool device: 121,1
+ Pool ID: 80135a56e76
+ In use: No
+ Capacity: 20081208 kB
+ Subpool List (1 subpools) for 121,1 (/dev/pool/pool0_cidev)
+ [ 1] offset : 0
+ blocks : 40162416
+ striping : 0
+ total weight : 0
+ Type : gfs_data
+ subpool devices:
+ sda1 weight: 0
+
+ (2) Pool name: pool0
+ Pool device: 121,2
+ Pool ID: 8027d893408
+ In use: No
+ Capacity: 266454080 kB
+ Subpool List (1 subpools) for 121,2 (/dev/pool/pool0)
+ [ 1] offset : 0
+ blocks : 532908160
+ striping : 0
+ total weight : 1
+ Type : gfs_data
+ subpool devices:
+ sda2 weight: 1
+
+
+ ----------------------------------------------------------------------
+
+ [root@tam02 gfs-4.0]# pinfo -f pool0
+ poolname pool0
+
+ subpools 1
+
+ subpool 0 0 1 gfs_data
+
+ pooldevice 0 0 /dev/sda2 1
+
+
+ [root@tam02 gfs-4.0]# cat gfscf.cf
+ datadev: /dev/pool/pool0
+ cidev: /dev/pool/pool0_cidev
+ lockdev: 131.225.7.47:15697
+ cbport: 3001
+ timeout: 100
+
+ STOMITH: meatware
+ name: meatware
+
+ # IP addr CID STOMITH method
+ node: 131.225.7.51 1 SM: meatware
+ node: 131.225.7.66 2 SM: meatware
+ node: 131.225.7.67 3 SM: meatware
Read the GFS HOWTO http://www.sistina.com/gfs/Pages/howto.html
gfs-bugs mailing list
gfs-bugs@sistina.com
http://lists.sistina.com/mailman/listinfo/gfs-bugs
Read the GFS Howto: http://www.sistina.com/gfs/Pages/howto.html
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic