'Re: bcache: workqueue lockup'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-bcache
Subject:    Re: bcache: workqueue lockup
From:       Stefan Priebe - Profihost AG <s.priebe () profihost ! ag>
Date:       2018-08-22 7:31:49
Message-ID: c79c08ea-3b2d-fa76-03e4-3c7a2b1b4e4b () profihost ! ag
[Download RAW message or body]


Am 21.08.2018 um 11:55 schrieb Coly Li:
> On 2018/8/21 5:46 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Coly,
>>
>> today i had again the following deadlock where the whole systems get
>> stuck and can't recover:
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666200] BUG: workqueue
>> lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 35s!
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666228] BUG: workqueue
>> lockup - pool cpus=4 node=0 flags=0x1 nice=0 stuck for 47s!
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666248] BUG: workqueue
>> lockup - pool cpus=11 node=0 flags=0x1 nice=0 stuck for 47s!
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666267] Showing busy
>> workqueues and worker pools:
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666279] workqueue events:
>> flags=0x0
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666289]   pwq 22: cpus=11
>> node=0 flags=0x1 nice=0 active=4/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666305]     pending:
>> mlx5e_rx_am_work [mlx5_core], kernfs_notify_workfn, mlx5e_rx_am_work
>> [mlx5_core], free_work
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666346]   pwq 18: cpus=9
>> node=0 flags=0x1 nice=0 active=3/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666362]     pending:
>> free_work, mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work [mlx5_core]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666390]   pwq 8: cpus=4
>> node=0 flags=0x1 nice=0 active=8/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666405]     pending:
>> journal_write [bcache], mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work
>> [mlx5_core], free_work, rht_deferred_worker, rht_deferred_worker,
>> rht_deferred_worker, rht_deferred_worker
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666468]   pwq 4: cpus=2
>> node=0 flags=0x1 nice=0 active=8/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666501]     pending:
>> mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work [mlx5_core],
>> rht_deferred_worker, rht_deferred_worker, rht_deferred_worker,
>> rht_deferred_worker, vmpressure_work_fn, free_work
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666600] workqueue bcache:
>> flags=0x8
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666610]   pwq 22: cpus=11
>> node=0 flags=0x1 nice=0 active=9/256 MAYDAY
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666626]     in-flight:
>> 375565:bch_data_insert_keys [bcache], 375566:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666653]     pending:
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666710]   pwq 18: cpus=9
>> node=0 flags=0x1 nice=0 active=12/256 MAYDAY
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666726]     in-flight:
>> 375501:bch_data_insert_keys [bcache], 373349:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666753]     pending:
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache],
>> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666837]   pwq 16: cpus=8
>> node=0 flags=0x0 nice=0 active=1/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666854]     in-flight:
>> 375294:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666873]   pwq 10: cpus=5
>> node=0 flags=0x0 nice=0 active=44/256
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666889]     in-flight:
>> 376496:bch_data_insert_keys [bcache], 376501:bch_data_insert_keys
>> [bcache], 376520:bch_data_insert_keys [bcache],
>> 376503:bch_data_insert_keys [bcache], 376492:bch_data_insert_keys
>> [bcache], 376516:bch_data_insert_keys [bcache],
>> 376526:bch_data_insert_keys [bcache], 376504:bch_data_insert_keys
>> [bcache], 376524:bch_data_insert_keys [bcache],
>> 376497:bch_data_insert_keys [bcache], 376507:bch_data_insert_keys
>> [bcache], 376523:bch_data_insert_keys [bcache],
>> 376522:bch_data_insert_keys [bcache], 376510:bch_data_insert_keys
>> [bcache], 376498:bch_data_insert_keys [bcache],
>> 376518:bch_data_insert_keys [bcache], 376506:bch_data_insert_keys
>> [bcache], 376493:bch_data_insert_keys [bcache],
>> 376513:bch_data_insert_keys [bcache], 376530:bch_data_insert_keys
>> [bcache], 376499:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.669998] ,
>> 376519:bch_data_insert_keys [bcache], 376512:bch_data_insert_keys
>> [bcache], 376532:bch_data_insert_keys [bcache],
>> 376490:bch_data_insert_keys [bcache], 376527:bch_data_insert_keys
>> [bcache], 376514:bch_data_insert_keys [bcache],
>> 376491:bch_data_insert_keys [bcache], 376500:bch_data_insert_keys
>> [bcache], 376502:bch_data_insert_keys [bcache],
>> 376508:bch_data_insert_keys [bcache], 376525:bch_data_insert_keys
>> [bcache], 376494:bch_data_insert_keys [bcache],
>> 376509:bch_data_insert_keys [bcache], 376528:bch_data_insert_keys
>> [bcache], 376517:bch_data_insert_keys [bcache],
>> 376521:bch_data_insert_keys [bcache], 376495:bch_data_insert_keys
>> [bcache], 376515:bch_data_insert_keys [bcache],
>> 376531:bch_data_insert_keys [bcache], 376505:bch_data_insert_keys
>> [bcache], 376529:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.674767] ,
>> 376533:bch_data_insert_keys [bcache], 376511:bch_data_insert_keys [bcache]
>> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.675501]   pwq 8: cpus=4
>> node=0 flags=0x1 nice=0 active=256/256
>> Aug 21 00:28:37 cloud1-1472 kernel: [1150884.676303]     in-flight:
>> 377020:bch_data_insert_keys [bcache], 375498:bch_data_insert_keys
>> [bcache], 377015:bch_data_insert_keys [bcache],
>> 377019:bch_data_insert_keys [bcache], 377010:bch_data_insert_keys
>> [bcache], 377004:bch_data_insert_keys [bcache],
>> 377016:bch_data_insert_keys [bcache], 377012:bch_data_insert_keys
>> [bcache], 377007:bch_data_insert_keys [bcache],
>> 377014:bch_data_insert_keys [bcache], 377017:bch_data_insert_keys
>> [bcache], 377013:bch_data_insert_keys [bcache],
>> 377018:bch_data_insert_keys [bcache], 375496:bch_data_insert_keys
>> [bcache], 377005:bch_data_insert_keys [bcache],
>> 377009:bch_data_insert_keys [bcache], 362(RESCUER):bch_data_insert_keys
>> [bcache] bch_data_insert_keys [bcache] bch_data_insert_keys [bcache]
>> bch_data_insert_keys [bcache] bch_data_insert_keys [bcache]
>> bch_data_insert_keys [bcache]
>> kern
> 
> Hi Stefan,
> 
> Thanks for the informative log. I see journal write and key insert key
> words from the log. So I keep to suspect the potential locking issue in
> bcache journal code.
> 
> This issue is on my todo list, and I am looking at this for now.

OK great and thanks. I hope you can fix it.

> 
> Coly Li
> 
> 
>>
>> Am 09.08.2018 um 12:22 schrieb Coly Li:
>>> On 2018/8/9 6:12 PM, Stefan Priebe - Profihost AG wrote:
>>>> Hi,
>>>> Am 09.08.2018 um 08:37 schrieb Coly Li:
>>>>> On 2018/8/8 7:52 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> Hi,
>>>>>> Am 07.08.2018 um 16:35 schrieb Coly Li:
>>>>>>> On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> Am 06.08.2018 um 16:21 schrieb Coly Li:
>>>>>>>>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>> Hi Coly,
>>>>>>>>>> Am 06.08.2018 um 15:06 schrieb Coly Li:
>>>>>>>>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>>>> Hi Coly,
>>>>>>>>>>>>
>>>>>>>>>>>> while running the SLES15 kernel i observed a workqueue lockup and a
>>>>>>>>>>>> totally crashed system today.
>>>>>>>>>>>>
>>>>>>>>>>>> dmesg output is about 3,5mb but it seems it just repeats the
>>>>>>>>>>>> bch_data_insert_keys msg.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Stefan,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your information!
>>>>>>>>>>>
>>>>>>>>>>> Could you please to give me any hint on how to reproduce it ? Even it is
>>>>>>>>>>> not stable reproducible, a detailed procedure may help me a lot.
>>>>>>>>>>
>>>>>>>>>> i'm sorry but i can't reproduce it. It happens just out of nothing in
>>>>>>>>>> our ceph production cluster.
>>>>>>>>>
>>>>>>>>> I see. Could you please share the configuration information. E.g.
>>>>>>>>
>>>>>>>> sure.
>>>>>>>>
>>>>>>>>> - How many CPU cores
>>>>>>>> 12
>>>>>>>>
>>>>>>>>> - How many physical memory
>>>>>>>> 64GB
>>>>>>>>
>>>>>>>>> - How large the SSD size, NVMe or SATA
>>>>>>>> bcache cache size: 250GB SATA SSD
>>>>>>>>
>>>>>>>>> - How many SSDs
>>>>>>>> 1x
>>>>>>>>
>>>>>>>>> - How large (many) the backing hard drives are
>>>>>>>> 2x 1TB
>>>>>>>>
>>>>>>>>> I try to simulate similar workload with fio, see how lucky I am.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Generally the workload at the timeframe was mostly read, fssync inside
>>>>>>>> guests and fstrim.
>>>>>>>
>>>>>>> Hi Stefan,
>>>>>>>
>>>>>>> From your information, I suspect this was a journal related deadlocking.
>>>>>>>
>>>>>>> If there are too many small I/O to make the btree inside bcache grows
>>>>>>> too fast, and in turn make the journal space to be exhausted, there is
>>>>>>> probably a dead-lock-like hang happens.
>>>>>>>
>>>>>>> Junhui tries to fix it by increase journal slot size, but the root cause
>>>>>>> is not fixed yet. The journal operation in bcache is not atomic, that
>>>>>>> means a btree node write goes into journal firstly, then insert into
>>>>>>> btree node by journal replay. If the btree node has to be split during
>>>>>>> the journal replay, the split meta data needs to go into journal first,
>>>>>>> if journal space is already exhausted, a dead-lock may happen.
>>>>>>>
>>>>>>> A real fix is to make bcache journal operation to be atomic, that means,
>>>>>>> 1, Reserve estimated journal slots before a journal I/O
>>>>>>> 2, If reservation succeed, go ahead; if failed wait and try again.
>>>>>>> 3, If journal reply results btree split, journal slot for new meta data
>>>>>>> is reserved in journal already and never failed.
>>>>>>>
>>>>>>> This fix is not simple, and I am currently working on other fixes (4Kn
>>>>>>> hard drive and big endian...). If no one else helps on the fix, it would
>>>>>>> be a while before I may focus on it.
>>>>>>>
>>>>>>> Because you mentioned fstrim happend in your guests, if the backing
>>>>>>> device of bcache supports DISCARD/TRIM, bcache will also invalidate the
>>>>>>> fstrim range in its internal btree, which may generate more btree metata
>>>>>>> I/O. Therefore I guess it might be related to journal.
>>>>>>>
>>>>>>> Hmm, how about I compose a patch to display free journal slot number. If
>>>>>>> next time such issue happens and you may still access sysfs, let's check
>>>>>>> and see whether this is a journal issue. Maybe I am wrong, but it's good
>>>>>>> to try.
>>>>>>
>>>>>> I don't believe the journal was full - the workload at that time (02:00
>>>>>> AM) is mostly read only and delete / truncate files and the journal is
>>>>>> pretty big with 250GB. Ceph handles fstrim inside guests as truncate and
>>>>>> file deletes outside guest. So the real workload for bcache was:
>>>>>> - read (backup time)
>>>>>> - delete file (xfs)
>>>>>> - truncate files (xfs)
>>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I guess maybe we talked about different journals. I should explicit say:
>>>>> bcache journal, sorry for misleading you.
>>>>>
>>>>> There is a bcache journal too, which is around 500MB size for meta data.
>>>>> If there are too many metadata operations, it is quite easy to be full
>>>>> filled.
>>>>
>>>> ah OK perfect. The only problem is, that i wasn't able to connect to the
>>>> server anymore when this has happened. So i'm unable to get data from
>>>> sysfs. Would it be possible to add the value to the kprint line we
>>>> already get?
>>>
>>> Hi Stefan,
>>>
>>> I see. Now I will try to compose some patches, to display the
>>> information, and also try to reduce the possibility of such locking
>>> issue. Give me some time.
>>>
>>> Maybe reboot the server and reload bcache and wait for a quite long time
>>> may have chance to escape from such awkward situation. Sorry that I
>>> don't have idea how to fix such deadlock on a running system...
>>>
>>> Coly Li
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic