'NFS client deadlock on SMP machines'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-smp
Subject:    NFS client deadlock on SMP machines
From:       "Brian O'Keefe" <okeefe () spinnakernet ! com>
Date:       2001-01-09 16:56:31
[Download RAW message or body]

I'm not sure yet if this is a true bug, but it sure seems like one...

I've got a 2-processor machine that I'm using as an NFS client. I've
written some code that is doing a boatload of NFS reads from this
client, locking whole files as read-only as I do each read. I've got
multiple processes running the same code. Pretty regularly, I can get
this client machine to lock up. I've scoured the web looking for hints
about what might be wrong, and I'm using a kgdb to debug this from a
remote machine.

I've been using linux-2.4.0-prerelease, rebuilt with SMP support, and
the necessary devices built into the kernel (no modules at all).

In using kgdb, I stop the machine when I hit the hang, and my backtrace
for the current thread is:

(gdb) bt
#0  breakpoint () at gdbstub.c:1235
#1  0xc01907ac in gdb_interrupt (irq=4, dev_id=0x0, regs=0xcd917e5c)
    at gdbserial.c:143
#2  0xc010a989 in handle_IRQ_event (irq=4, regs=0xcd917e5c,
action=0xcdf8ef60)
    at irq.c:439
#3  0xc010ab8d in do_IRQ (regs={ebx = 51246, ecx = -840260512, edx =
51246,
      esi = -840260512, edi = -812463616, ebp = -846102896, eax =
-812463616,
      xds = -839974888, xes = -1072300008, orig_eax = -252, eip =
-1071305391,
      xcs = 16, eflags = 646, esp = -846102872, xss = -1072275952})
    at irq.c:609
#4  0xc01090d4 in ret_from_intr () at af_packet.c:1876
#5  0xc0165e10 in inode_schedule_scan (inode=0xcdeaa460, time=51248)
    at flushd.c:225
#6  0xc015f87e in nfs_readpage_async (file=0xcdeff6c0, inode=0xcdeaa460,
    page=0xc139ced4) at read.c:197
#7  0xc0160182 in nfs_readpage (file=0xcdeff6c0, page=0xc139ced4) at
read.c:503
#8  0xc0126bc3 in do_generic_file_read (filp=0xcdeff6c0,
ppos=0xcdeff6e0,
    desc=0xcd917f60, actor=0xc0126da4 <file_read_actor>) at
filemap.c:1156
#9  0xc0126e6d in generic_file_read (filp=0xcdeff6c0, buf=0x804efe8 "",
    count=24, ppos=0xcdeff6e0) at filemap.c:1259
#10 0xc015ef90 in nfs_file_read (file=0xcdeff6c0, buf=0x804efe8 "",
count=24,
    ppos=0xcdeff6e0) at file.c:103
#11 0xc0133b65 in sys_read (fd=3, buf=0x804efe8 "", count=24)
    at read_write.c:133
#12 0xc0109013 in system_call () at af_packet.c:1876
#13 0x804a578 in ?? () at af_packet.c:1876
#14 0x80496d4 in ?? () at af_packet.c:1876
#15 0x400969cb in ?? () at af_packet.c:1876    

I believe that frame 5 is where I really am, but in looking at the
assembly for this, I see no reason why I'd be hung there (I'm sitting on
a return from a routine, about to execute and "add" instruction). I'm
assuming that I'm actually hung on the other CPU. Can I see the other
CPU by looking at the other threads? Or does kgdb simply not allow me to
see the other CPU.

I looked at the other threads, and they are mostly all sitting in
schedule(), which seems normal to me:

(gdb) bt
#0  0xc0114182 in schedule () at sched.c:648
#1  0xcd8e8000 in ?? ()
#2  0xc01262b8 in __lock_page (page=0xc13a1b54) at filemap.c:642
#3  0xc0126301 in lock_page (page=0xc13a1b54) at filemap.c:660
#4  0xc0126af2 in do_generic_file_read (filp=0xcea3f1e0,
ppos=0xcea3f200,
    desc=0xcd8e9f60, actor=0xc0126da4 <file_read_actor>) at
filemap.c:1139
#5  0xc0126e6d in generic_file_read (filp=0xcea3f1e0,
    buf=0x8055108
"ÿgHÑ \n\004\e\032MfßN\221É¨ù\003ol¥År¥\024æÝürPõ\223ý\226c\001r\226h\237ö(\201\177\215TµÆ\0258m\030\bLÞ6*t9ëiºr\213",
 count=10000,
    ppos=0xcea3f200) at filemap.c:1259
#6  0xc015ef90 in nfs_file_read (file=0xcea3f1e0,
    buf=0x8055108
"ÿgHÑ \n\004\e\032MfßN\221É¨ù\003ol¥År¥\024æÝürPõ\223ý\226c\001r\226h\237ö(\201\177\215TµÆ\0258m\030\bLÞ6*t9ëiºr\213",
 count=10000,
    ppos=0xcea3f200) at file.c:103
#7  0xc0133b65 in sys_read (fd=3,
    buf=0x8055108
"ÿgHÑ \n\004\e\032MfßN\221É¨ù\003ol¥År¥\024æÝürPõ\223ý\226c\001r\226h\237ö(\201\177\215TµÆ\0258m\030\bLÞ6*t9ëiºr\213",
 count=10000)
    at read_write.c:133
#8  0xc0109013 in system_call () at af_packet.c:1876
#9  0x804a578 in ?? () at af_packet.c:1876
#10 0x80496d4 in ?? () at af_packet.c:1876
#11 0x400969cb in ?? () at af_packet.c:1876   

One of the threads, however, has a backtrace of this:
(gdb) bt
#0  0xc0114182 in schedule () at sched.c:648
#1  0x0 in ?? ()

This seems suspect to me, but I don't know the kernel well enough.

Can anyone give me some pointers on what I should be looking at to debug
this?

-- 

Brian O'Keefe
-
To unsubscribe from this list: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@vger.kernel.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic