[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [Lustre-discuss] NFS Crash during heavy writes
From:       dgucker () choopa ! com (David Gucker)
Date:       2010-07-22 2:59:49
Message-ID: 4C47B425.6090306 () choopa ! com
[Download RAW message or body]

We experienced an nfs lockup on the Lustre client exporting the 
filesystem.   nfs is tuned to run 256 daemons in anticipation of heavy 
load.  The Client is connected with bonded 10Gbps ethernet to the NFS 
export network, and also bonded 10Gbps ethernet to the Lustre network.  
At the time of the crash, approx 800Mbps of NFS writes to the Lustre 
filesystem was taking place.  Any input on on this issue and most 
importantly how to prevent it is appreciated!

Vitals:
2.6.18-164.11.1.el5_lustre.1.8.2
CentOS release 5.4
96GB RAM
Dual intel quad

Messages:

Jul 21 14:25:36 ID6316-Client1 kernel: LustreError: 
5464:0:(lov_merge.c:74:lov_merge_lvb()) 
ASSERTION(spin_is_locked(&lsm->lsm_lock)) failed
Jul 21 14:25:36 ID6316-Client1 kernel: LustreError: 
5464:0:(lov_merge.c:74:lov_merge_lvb()) LBUG
Jul 21 14:25:36 ID6316-Client1 kernel: Lustre: 
5464:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for 
process 5464
Jul 21 14:25:36 ID6316-Client1 kernel: nfsd          R  running 
task       0  5464      1          5465  5463 (L-TLB)
Jul 21 14:25:36 ID6316-Client1 kernel:  0000000000000000 
0000000000000000 0000000000000001 0000000000000000
Jul 21 14:25:36 ID6316-Client1 kernel:  0000000000000000 
0000000000000000 0000000000000086 ffffffff80047152
Jul 21 14:25:36 ID6316-Client1 kernel:  0000000000000001 
0000000000000001 0000000000000000 0000000000000001
Jul 21 14:25:36 ID6316-Client1 kernel: Call Trace:
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff80047152>] 
try_to_wake_up+0x472/0x484
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8016d94c>] 
vgacon_scroll+0x21e/0x23f
Jul 21 14:25:36 ID6316-Client1 last message repeated 2 times
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff80091d2d>] 
printk+0x52/0xbd
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8008ac95>] 
__wake_up_common+0x3e/0x68
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8008ac95>] 
__wake_up_common+0x3e/0x68
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8002e2f3>] 
__wake_up+0x38/0x4f
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff800a7b2e>] 
kallsyms_lookup+0xe6/0x1ae
Jul 21 14:25:36 ID6316-Client1 last message repeated 3 times
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8006bc3b>] 
printk_address+0x9f/0xab
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff80091d2d>] 
printk+0x52/0xbd
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff80091d2d>] 
printk+0x52/0xbd
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff800a54d2>] 
module_text_address+0x33/0x3c
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8009e65b>] 
kernel_text_address+0x1a/0x26
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8006b921>] 
dump_trace+0x206/0x22f
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88768c8f>] 
> osc:osc_enqueue_fini+0x1af/0x240
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8006b97e>] 
show_trace+0x34/0x47
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8006ba83>] 
_show_stack+0xdb/0xea
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88551b1a>] 
> libcfs:lbug_with_loc+0x7a/0xd0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88559e90>] 
> libcfs:tracefile_init+0x0/0x110
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8881ac2e>] 
> lov:lov_merge_lvb+0x4e/0x220
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff886b86d1>] 
> ptlrpc:lustre_swab_buf+0x81/0x170
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff888747cd>] 
> lustre:ll_inode_size_lock+0x5d/0x160
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8885d2ca>] 
> lustre:ll_extent_lock+0x8ea/0xaf0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff885522b8>] 
> libcfs:cfs_alloc+0x68/0xc0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8876cd50>] 
> osc:osc_extent_blocking_cb+0x0/0x2b0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88692e50>] 
> ptlrpc:ldlm_completion_ast+0x0/0x880
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88868510>] 
> lustre:ll_glimpse_callback+0x0/0x440
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8889ef0c>] 
> lustre:ll_tree_lock_iov+0x14c/0x310
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8889e987>] 
> lustre:ll_node_from_inode+0xc7/0x210
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8885da6a>] 
> lustre:ll_file_get_tree_lock_iov+0x59a/0x740
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff886754f4>] 
> ptlrpc:ldlm_lock_add_to_lru+0x74/0xe0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8867219a>] 
> ptlrpc:lock_res_and_lock+0xba/0xd0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8885e6f4>] 
> lustre:ll_file_writev+0xae4/0x1750
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8867736a>] 
> ptlrpc:ldlm_lock_decref+0x9a/0xc0
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff887e0d9a>] 
> mdc:mdc_set_lock_data+0x1da/0x250
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88692e50>] 
> ptlrpc:ldlm_completion_ast+0x0/0x880
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88884ebd>] 
> lustre:ras_reset+0x2d/0x110
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8885f360>] 
> lustre:ll_file_write+0x0/0x20
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff800df4cd>] 
do_readv_writev+0x172/0x291
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8885f360>] 
> lustre:ll_file_write+0x0/0x20
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff888698fb>] 
> lustre:ll_file_open+0xbab/0xd10
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8008ac95>] 
__wake_up_common+0x3e/0x68
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8850b753>] 
> nfsd:nfsd_acceptable+0x0/0xd8
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8850c711>] 
> nfsd:nfsd_vfs_write+0xf2/0x2e1
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88868d50>] 
> lustre:ll_file_open+0x0/0xd10
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8001e8b7>] 
__dentry_open+0x101/0x1dc
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8850d027>] 
> nfsd:nfsd_write+0xb5/0xd5
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88513ae2>] 
> nfsd:nfsd3_proc_write+0xea/0x109
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff885091db>] 
> nfsd:nfsd_dispatch+0xd8/0x1d6
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8843d529>] 
> sunrpc:svc_process+0x454/0x71b
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff80064644>] 
__down_read+0x12/0x92
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff88509746>] 
> nfsd:nfsd+0x1a5/0x2cb
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8005dfb1>] 
child_rip+0xa/0x11
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 14:25:36 ID6316-Client1 kernel:  [<ffffffff8005dfa7>] 
child_rip+0x0/0x11
Jul 21 14:25:36 ID6316-Client1 kernel:
Jul 21 14:25:36 ID6316-Client1 kernel: LustreError: dumping log to 
/tmp/lustre-log.1279736736.5464


And subsequently:
Jul 21 15:40:10 ID6316-Client1 kernel: BUG: soft lockup - CPU#6 stuck 
for 10s! [nfsd:5324]
Jul 21 15:40:10 ID6316-Client1 kernel: CPU 6:
Jul 21 15:40:10 ID6316-Client1 kernel: Modules linked in: 
iptable_filter(U) ip_tables(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) 
osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) 
nfsd(U) exportfs(U) nfs_acl(U) auth_rpcgss(U) autofs4(U) hidp(U) 
l2cap(U) bluetooth(U) lockd(U) sunrpc(U) bonding(U) ipt_REJECT(U) 
ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) 
ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) acpi_cpufreq(U) 
freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) 
backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) 
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) igb(U) ixgbe(U) 
8021q(U) joydev(U) sg(U) i2c_i801(U) serio_raw(U) i2c_core(U) pcspkr(U) 
dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) 
dm_mem_cache(U) usb_storage(U) ata_piix(U) libata(U) shpchp(U) 
3w_9xxx(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) 
ehci_hcd(U)
Jul 21 15:40:10 ID6316-Client1 kernel: Pid: 5324, comm: nfsd Tainted: 
G      2.6.18-164.11.1.el5_lustre.1.8.2 #1
Jul 21 15:40:10 ID6316-Client1 kernel: RIP: 0010:[<ffffffff80064bfc>]  
[<ffffffff80064bfc>] .text.lock.spinlock+0x2/0x30
Jul 21 15:40:10 ID6316-Client1 kernel: RSP: 0018:ffff810827163478  
EFLAGS: 00000286
Jul 21 15:40:10 ID6316-Client1 kernel: RAX: ffff81083ceea860 RBX: 
ffff8113b7703cc0 RCX: ffff8112e855dd10
Jul 21 15:40:10 ID6316-Client1 kernel: RDX: ffff810826a28040 RSI: 
ffff810827163760 RDI: ffff810f59ad4a40
Jul 21 15:40:10 ID6316-Client1 kernel: RBP: ffffffff88677168 R08: 
000000000000012d R09: ffff810c8e7e4ae0
Jul 21 15:40:10 ID6316-Client1 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: ffff8112e855dd18
Jul 21 15:40:10 ID6316-Client1 kernel: R13: 0000000000000000 R14: 
ffff81131fc73c00 R15: ffffffff8867219a
Jul 21 15:40:10 ID6316-Client1 kernel: FS:  00002b0c882886e0(0000) 
GS:ffff81086a2e1140(0000) knlGS:0000000000000000
Jul 21 15:40:10 ID6316-Client1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Jul 21 15:40:10 ID6316-Client1 kernel: CR2: 00002aaaae52c000 CR3: 
0000000000201000 CR4: 00000000000006e0
Jul 21 15:40:10 ID6316-Client1 kernel:
Jul 21 15:40:10 ID6316-Client1 kernel: Call Trace:
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88805e0d>] 
> lov:lov_stripe_lock+0x3d/0x80
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8882260e>] 
> lov:lov_update_enqueue_set+0x27e/0x4a0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8860a1b0>] 
> obdclass:class_handle2object+0xe0/0x170
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88768c8f>] 
> osc:osc_enqueue_fini+0x1af/0x240
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88768dd8>] 
> osc:osc_enqueue_interpret+0xb8/0x180
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff886bb240>] 
> ptlrpc:lustre_swab_ost_lvb+0x0/0x40
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff886ab181>] 
> ptlrpc:ptlrpc_check_set+0x11a1/0x1450
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8004b344>] 
try_to_del_timer_sync+0x51/0x5a
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff886ae34a>] 
> ptlrpc:ptlrpc_set_wait+0x36a/0x660
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88808442>] 
> lov:lov_enqueue+0x612/0x8b0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8008c86b>] 
default_wake_function+0x0/0xe
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff886ab9b9>] 
> ptlrpc:ptlrpc_prep_set+0x1e9/0x290
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88860eda>] 
> lustre:ll_glimpse_size+0x63a/0xc30
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8889b440>] 
> lustre:ll_mdc_blocking_ast+0x0/0x520
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8889adc6>] 
> lustre:ll_lookup_it+0x776/0x7c0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8889b440>] 
> lustre:ll_mdc_blocking_ast+0x0/0x520
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8876cd50>] 
> osc:osc_extent_blocking_cb+0x0/0x2b0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88692e50>] 
> ptlrpc:ldlm_completion_ast+0x0/0x880
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88868510>] 
> lustre:ll_glimpse_callback+0x0/0x440
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8886160a>] 
> lustre:ll_inode_revalidate_it+0x13a/0x1d0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff888616c4>] 
> lustre:ll_getattr_it+0x24/0x110
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff888617e4>] 
> lustre:ll_getattr+0x34/0x40
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885147fb>] 
> nfsd:encode_post_op_attr+0x3f/0x213
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8000d3a4>] dput+0x2c/0x113
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88514d50>] 
> nfsd:compose_entry_fh+0x113/0x121
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88514f8d>] 
> nfsd:encode_entry+0x22f/0x53c
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff886754f4>] 
> ptlrpc:ldlm_lock_add_to_lru+0x74/0xe0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8867219a>] 
> ptlrpc:lock_res_and_lock+0xba/0xd0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88676bb8>] 
> ptlrpc:ldlm_lock_decref_internal+0x538/0x7f0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8867219a>] 
> ptlrpc:lock_res_and_lock+0xba/0xd0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88677168>] 
> ptlrpc:__ldlm_handle2lock+0x2f8/0x360
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8867736a>] 
> ptlrpc:ldlm_lock_decref+0x9a/0xc0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8885443d>] 
> lustre:ll_get_dir_page+0x63d/0x680
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff887e0d9a>] 
> mdc:mdc_set_lock_data+0x1da/0x250
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88692e50>] 
> ptlrpc:ldlm_completion_ast+0x0/0x880
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885152a5>] 
> nfsd:nfs3svc_encode_entry_plus+0xb/0x10
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88855cdb>] 
> lustre:ll_readdir+0x87b/0x9e0
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8850b753>] 
> nfsd:nfsd_acceptable+0x0/0xd8
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8851529a>] 
> nfsd:nfs3svc_encode_entry_plus+0x0/0x10
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8001e8b7>] 
__dentry_open+0x101/0x1dc
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8851529a>] 
> nfsd:nfs3svc_encode_entry_plus+0x0/0x10
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff80035265>] 
vfs_readdir+0x77/0xa9
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8851529a>] 
> nfsd:nfs3svc_encode_entry_plus+0x0/0x10
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8850cea0>] 
> nfsd:nfsd_readdir+0x6d/0xc5
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88514122>] 
> nfsd:nfsd3_proc_readdirplus+0xf8/0x220
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885091db>] 
> nfsd:nfsd_dispatch+0xd8/0x1d6
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8843d529>] 
> sunrpc:svc_process+0x454/0x71b
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff80064644>] 
__down_read+0x12/0x92
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff88509746>] 
> nfsd:nfsd+0x1a5/0x2cb
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8005dfb1>] 
child_rip+0xa/0x11
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff885095a1>] 
> nfsd:nfsd+0x0/0x2cb
Jul 21 15:40:10 ID6316-Client1 kernel:  [<ffffffff8005dfa7>] 
child_rip+0x0/0x11
Jul 21 15:40:10 ID6316-Client1 kernel:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100721/bc50bda5/attachment.htm>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic