[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    Re: [lustre-discuss] Repeated ZFS panics on MDT
From:       Andreas Dilger via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date:       2023-03-17 14:18:28
Message-ID: 24C69E4C-5891-4472-ADC3-1F325582CA72 () whamcloud ! com
[Download RAW message or body]

It's been a while since I've worked with ZFS servers, but one old chestnut that \
caused problems with ZFS 0.7 on the MDTs was the variable dnode size feature. 

I believe there was a tunable, something like "dnodesize=auto" that caused problems, \
and this could be changed to "dnodesize=1024" or similar to avoid the issue. You'll \
have to check some Lustre-discuss archives and/or the ZFS docs to confirm, but that \
was the most common issue.

Alternately, you could try updating to ZFS 0.8 or later to see if this avoids the \
issue. 

Cheers, Andreas

> On Mar 17, 2023, at 05:39, Mountford, Christopher J. (Dr.) via lustre-discuss \
> <lustre-discuss@lists.lustre.org> wrote: 
> Unfortunately this problem seems to be getting worse, to the point where ZFS \
> Panics immediately after Lustre recovery completes when the system is under load. 
> Luckily this happened on our /home filesystem which is relatively small. We are \
> rebuilding onto spare hardware so we can return the system to production whilst we \
> investigate. 
> The panics seem to happen during writes - under low load we see 1 every few hours, \
> one trigger appears to be loading or reloading a page in firefox on a login node, \
> but this is definitely not the only trigger (we've also seen the panic when login \
> nodes were all down). 
> Stack trace seems fairly consistent across all panics we've seen:
> 
> Mar 17 09:44:46 amds01b kernel: PANIC at zfs_vfsops.c:584:zfs_space_delta_cb()
> Mar 17 09:44:46 amds01b kernel: Showing stack for process 10494
> Mar 17 09:44:46 amds01b kernel: CPU: 8 PID: 10494 Comm: mdt00_012 Tainted: P        \
>                 OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1
> Mar 17 09:44:46 amds01b kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant \
>                 DL360 Gen10, BIOS U32 02/09/2023
> Mar 17 09:44:46 amds01b kernel: Call Trace:
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d783539>] dump_stack+0x19/0x1b
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc08e3f24>] spl_dumpstack+0x44/0x50 [spl]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc08e3ff9>] spl_panic+0xc9/0x110 [spl]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0ea213c>] ? \
>                 dbuf_rele_and_unlock+0x34c/0x4c0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d107cf4>] ? getrawmonotonic64+0x34/0xc0
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0ebfaa3>] ? dmu_zfetch+0x393/0x520 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0ea2073>] ? \
>                 dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc08e5ff1>] ? __cv_init+0x41/0x60 [spl]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0f4753c>] zfs_space_delta_cb+0x9c/0x200 \
>                 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0eb2944>] \
>                 dmu_objset_userquota_get_ids+0x154/0x440 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0ec1e98>] dnode_setdirty+0x38/0xf0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0ec221c>] dnode_allocate+0x18c/0x230 \
>                 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0eaed2b>] \
>                 dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1bac052>] __osd_object_create+0x82/0x170 \
>                 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc08df238>] ? spl_kmem_zalloc+0xd8/0x180 \
>                 [spl]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1bac3bd>] osd_mkreg+0x7d/0x210 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1ba88f6>] osd_create+0x336/0xb10 \
>                 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1dc6fb5>] lod_sub_create+0x1f5/0x480 \
>                 [lod]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1db7729>] lod_create+0x69/0x340 [lod]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1b9e690>] ? osd_trans_create+0x410/0x410 \
>                 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1e31993>] \
>                 mdd_create_object_internal+0xc3/0x300 [mdd]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1e1aa4b>] mdd_create_object+0x7b/0x820 \
>                 [mdd]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1e24fd8>] mdd_create+0xdd8/0x14a0 [mdd]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cef118>] mdt_reint_open+0x2588/0x3970 \
>                 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc177d2b9>] ? check_unlink_entry+0x19/0xd0 \
>                 [obdclass]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1ccee52>] ? \
>                 ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1ce1f23>] mdt_reint_rec+0x83/0x210 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cbd413>] mdt_reint_internal+0x6e3/0xaf0 \
>                 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cc9ec6>] ? \
>                 mdt_intent_fixup_resent+0x36/0x220 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cca132>] mdt_intent_open+0x82/0x3a0 \
>                 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cc074a>] mdt_intent_opc+0x1ba/0xb50 \
>                 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13c76c0>] ? \
>                 lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cca0b0>] ? \
>                 mdt_intent_fixup_resent+0x220/0x220 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1cc89e4>] mdt_intent_policy+0x1a4/0x360 \
>                 [mdt]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13764e6>] ldlm_lock_enqueue+0x376/0x9b0 \
>                 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0a152b7>] ? \
>                 cfs_hash_bd_add_locked+0x67/0x90 [libcfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0a18a4e>] ? cfs_hash_add+0xbe/0x1a0 \
>                 [libcfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc139daa6>] \
>                 ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13c7740>] ? \
>                 lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1427092>] tgt_enqueue+0x62/0x210 \
>                 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc142deea>] \
>                 tgt_request_handle+0xada/0x1570 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc1407601>] ? \
>                 ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc0a09bde>] ? \
>                 ktime_get_real_seconds+0xe/0x10 [libcfs]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13d2bcb>] \
>                 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13cf6e5>] ? ptlrpc_wait_event+0xa5/0x360 \
>                 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d0d3233>] ? __wake_up+0x13/0x20
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13d6534>] ptlrpc_main+0xb34/0x1470 \
>                 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffffc13d5a00>] ? \
>                 ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d0c5e61>] kthread+0xd1/0xe0
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d0c5d90>] ? \
>                 insert_kthread_work+0x40/0x40
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d795ddd>] \
>                 ret_from_fork_nospec_begin+0x7/0x21
> Mar 17 09:44:46 amds01b kernel: [<ffffffff8d0c5d90>] ? \
> insert_kthread_work+0x40/0x40 
> Has anyone seen similar when using Lustre on ZFS and were you able to recover the \
> filesystem to a workable state - I've found a few similar problems in the zfs \
> mailing lists over the last few years, but no solutions. 
> Kind Regards,
> Christopher.
> 
> ________________________________________
> From: Mountford, Christopher J. (Dr.) <cjm14@leicester.ac.uk>
> Sent: 15 March 2023 20:35
> To: Colin Faber; Mountford, Christopher J. (Dr.)
> Cc: lustre-discuss
> Subject: Re: [lustre-discuss] Repeated ZFS panics on MDT
> 
> The ZFS scrub completed without any errors/corrections.
> 
> Following the scrub (and clearing all users from our cluster login nodes) I \
> remounted the MDT and it appears to be running fine (just running the remaining \
> batch jobs). I'm now able to get onto our monitoring system - hopefully a look at \
> the lustre jobstats might reveal if there was unusually high filesystem load. 
> zpool status following scrub:
> 
> [root@amds01b ~]# zpool status
> pool: ahomemdt00
> state: ONLINE
> scan: scrub repaired 0B in 0h36m with 0 errors on Wed Mar 15 20:16:53 2023
> config:
> 
> NAME                  STATE     READ WRITE CKSUM
> ahomemdt00            ONLINE       0     0     0
> mirror-0            ONLINE       0     0     0
> A12-amds01j1-001  ONLINE       0     0     0
> A12-amds01j1-002  ONLINE       0     0     0
> mirror-1            ONLINE       0     0     0
> A12-amds01j1-003  ONLINE       0     0     0
> A12-amds01j1-004  ONLINE       0     0     0
> mirror-2            ONLINE       0     0     0
> A12-amds01j1-005  ONLINE       0     0     0
> A12-amds01j1-006  ONLINE       0     0     0
> mirror-3            ONLINE       0     0     0
> A12-amds01j1-007  ONLINE       0     0     0
> A12-amds01j1-008  ONLINE       0     0     0
> mirror-4            ONLINE       0     0     0
> A12-amds01j1-009  ONLINE       0     0     0
> A12-amds01j1-010  ONLINE       0     0     0
> 
> errors: No known data errors
> [root@amds01b ~]#
> 
> We did see the following error in the MDS syslog after the lustre recovery \
> completed (Mar 15 20:19:55 amds01b kernel: LustreError: \
> 6052:0:(mdd_orphans.c:375:mdd_orphan_key_test_and_delete()) ahome3-MDD0000: error \
> unlinking orphan [0x20001bb76:0xd86e:0x0]: rc = -2) - not sure if this could be \
> related to the problem. 
> We also had one login node locked up with a load of >4000 and many ps commands \
> uninterruptible and hung. Again, not sure if this could be a cause of the problems \
> we have been seeing. 
> Kind Regards,
> Christopher.
> 
> ________________________________________
> From: lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on behalf of \
> Mountford, Christopher J. (Dr.) via lustre-discuss \
>                 <lustre-discuss@lists.lustre.org>
> Sent: 15 March 2023 19:21
> To: Colin Faber
> Cc: lustre-discuss
> Subject: Re: [lustre-discuss] Repeated ZFS panics on MDT
> 
> ***CAUTION:*** This email was sent from an EXTERNAL source. Think before clicking \
> links or opening attachments. 
> Hi Colin,
> 
> Not yet, we last scrubbed the pool ~2 weeks ago when we first saw this problem. \
> I've got a few additional tests to run now to see if we can track the cause to a \
> particular job/process, but kicking off a scrub is my next thing to do (It should \
> only take ~40 minutes, it's a fairly small ssd based MDT). 
> Thanks,
> Chris
> 
> ________________________________________
> From: Colin Faber <cfaber@gmail.com>
> Sent: 15 March 2023 18:41
> To: Mountford, Christopher J. (Dr.)
> Cc: lustre-discuss
> Subject: Re: [lustre-discuss] Repeated ZFS panics on MDT
> 
> ***CAUTION:*** This email was sent from an EXTERNAL source. Think before clicking \
> links or opening attachments. Have you tried resilvering the pool?
> 
> On Wed, Mar 15, 2023, 11:57 AM Mountford, Christopher J. (Dr.) via lustre-discuss \
> <lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> wrote: \
> I'm hoping someone offer some suggestions. 
> We have a problem on our production Lustre/ZFS filesystem (CentOS 7, ZFS 0.7.13, \
> Lustre 2.12.9), so far I've drawn a blank trying to track down the cause of this. 
> We see the following zfs panic message in the logs (in every case the VERIFY3/panic \
> lines are identical): 
> 
> Mar 15 17:15:39 amds01a kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (8 == \
>                 3100762)
> Mar 15 17:15:39 amds01a kernel: PANIC at zfs_vfsops.c:584:zfs_space_delta_cb()
> Mar 15 17:15:39 amds01a kernel: Showing stack for process 15381
> Mar 15 17:15:39 amds01a kernel: CPU: 31 PID: 15381 Comm: mdt00_020 Tainted: P       \
>                 OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1
> Mar 15 17:15:39 amds01a kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant \
>                 DL360 Gen10, BIOS U32 02/09/2023
> Mar 15 17:15:39 amds01a kernel: Call Trace:
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99d83539>] dump_stack+0x19/0x1b
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76f24>] spl_dumpstack+0x44/0x50 [spl]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76ff9>] spl_panic+0xc9/0x110 [spl]
> Mar 15 17:15:39 amds01a kernel: [<ffffffff996e482c>] ? update_curr+0x14c/0x1e0
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99707cf4>] ? getrawmonotonic64+0x34/0xc0
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c87aa3>] ? dmu_zfetch+0x393/0x520 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c6a073>] ? \
>                 dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b78ff1>] ? __cv_init+0x41/0x60 [spl]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0d0f53c>] zfs_space_delta_cb+0x9c/0x200 \
>                 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c7a944>] \
>                 dmu_objset_userquota_get_ids+0x154/0x440 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c89e98>] dnode_setdirty+0x38/0xf0 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c8a21c>] dnode_allocate+0x18c/0x230 \
>                 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c76d2b>] \
>                 dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d73052>] __osd_object_create+0x82/0x170 \
>                 [osd_zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d7ce23>] ? \
>                 osd_declare_xattr_set+0xb3/0x190 [osd_zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d733bd>] osd_mkreg+0x7d/0x210 [osd_zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99828f01>] ? __kmalloc_node+0x1d1/0x2b0
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d6f8f6>] osd_create+0x336/0xb10 \
>                 [osd_zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc2016fb5>] lod_sub_create+0x1f5/0x480 \
>                 [lod]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc2007729>] lod_create+0x69/0x340 [lod]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d65690>] ? osd_trans_create+0x410/0x410 \
>                 [osd_zfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc2081993>] \
>                 mdd_create_object_internal+0xc3/0x300 [mdd]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc206aa4b>] mdd_create_object+0x7b/0x820 \
>                 [mdd]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc2074fd8>] mdd_create+0xdd8/0x14a0 [mdd]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f0e118>] mdt_reint_open+0x2588/0x3970 \
>                 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc16f82b9>] ? check_unlink_entry+0x19/0xd0 \
>                 [obdclass]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1eede52>] ? \
>                 ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f00f23>] mdt_reint_rec+0x83/0x210 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edc413>] mdt_reint_internal+0x6e3/0xaf0 \
>                 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee8ec6>] ? \
>                 mdt_intent_fixup_resent+0x36/0x220 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee9132>] mdt_intent_open+0x82/0x3a0 \
>                 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edf74a>] mdt_intent_opc+0x1ba/0xb50 \
>                 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d6c0>] ? \
>                 lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee90b0>] ? \
>                 mdt_intent_fixup_resent+0x220/0x220 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee79e4>] mdt_intent_policy+0x1a4/0x360 \
>                 [mdt]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc19bc4e6>] ldlm_lock_enqueue+0x376/0x9b0 \
>                 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a22b7>] ? \
>                 cfs_hash_bd_add_locked+0x67/0x90 [libcfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a5a4e>] ? cfs_hash_add+0xbe/0x1a0 \
>                 [libcfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc19e3aa6>] \
>                 ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d740>] ? \
>                 lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a6d092>] tgt_enqueue+0x62/0x210 \
>                 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a73eea>] \
>                 tgt_request_handle+0xada/0x1570 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a4d601>] ? \
>                 ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1096bde>] ? \
>                 ktime_get_real_seconds+0xe/0x10 [libcfs]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a18bcb>] \
>                 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a156e5>] ? ptlrpc_wait_event+0xa5/0x360 \
>                 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99d7dcf3>] ? \
>                 queued_spin_lock_slowpath+0xb/0xf
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99d8baa0>] ? _raw_spin_lock+0x20/0x30
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1c534>] ptlrpc_main+0xb34/0x1470 \
>                 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1ba00>] ? \
>                 ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
> Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5e61>] kthread+0xd1/0xe0
> Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? \
>                 insert_kthread_work+0x40/0x40
> Mar 15 17:15:39 amds01a kernel: [<ffffffff99d95ddd>] \
>                 ret_from_fork_nospec_begin+0x7/0x21
> Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? \
> insert_kthread_work+0x40/0x40 
> At this point all ZFS I/O freezes completely and the MDS has to be fenced. This has \
> happened ~4 times in the last hour. 
> I'm at a loss how to correct this - I'm currently thinking that we may have to \
> rebuild and recover our entire filesystem from backups (thankfully this is our home \
> file system which is small and entirely ssd based, so should not take to long to \
> recover). 
> May be related to this: \
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2 \
> Fbugzilla%2Fshow_bug.cgi%3Fid%3D216586&data=05%7C01%7Ccjm14%40leicester.ac.uk%7Cc1eb \
> 6b89fcc8424d03ca08db2594d992%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C6381450937 \
> 33181036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWw \
> iLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=F5h69peWwWNfd6dhz3mRR%2FGOxmpKmm5kwuxdqc%2Bfxp \
> U%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbu \
> gs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D216586&data=05%7C01%7Ccjm14%40leices \
> ter.ac.uk%7Cc1eb6b89fcc8424d03ca08db2594d992%7Caebecd6a31d44b0195ce8274afe853d9%7C0% \
> 7C0%7C638145093733181036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzI \
> iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=F5h69peWwWNfd6dhz3mRR%2FGOxmpKmm5kwuxdqc%2BfxpU%3D&reserved=0> \
> bug seen on freebsd (with a much more recent ZFS version). 
> The problem was first seen 3 weeks ago, but went away after a couple of reboots. \
> This time it seems to be more serious. 
> Kind Regards,
> Christopher.
> 
> ------------------------------------
> Dr. Christopher Mountford,
> System Specialist,
> RCS,
> Digital Services,
> University Of Leicester.
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2F \
> listinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Ccjm14%40leicester.ac.uk%7Cc1 \
> eb6b89fcc8424d03ca08db2594d992%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C63814509 \
> 3733181036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha \
> WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gxodwdywaxx6ANuzQkF9ySJ0yyVuN%2BEXRXyveXe3zX \
> w%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flis \
> ts.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Ccjm14%40leic \
> ester.ac.uk%7Cc1eb6b89fcc8424d03ca08db2594d992%7Caebecd6a31d44b0195ce8274afe853d9%7C \
> 0%7C0%7C638145093733181036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM \
> zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gxodwdywaxx6ANuzQkF9ySJ0yyVuN%2BEXRXyveXe3zXw%3D&reserved=0>
>  _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2F \
> listinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Ccjm14%40leicester.ac.uk%7Cc1 \
> eb6b89fcc8424d03ca08db2594d992%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C63814509 \
> 3733337273%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha \
> WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YR6BDajunTTCu5xIsvXB2yKHaj%2B09Ime%2FfvcnN9COAw%3D&reserved=0
>  _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic