[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-nfs
Subject:    Re: decant_cull_table intermittently aborting cachefilesd
From:       Chris Chilvers <chilversc () gmail ! com>
Date:       2023-03-28 17:22:49
Message-ID: CAAmbk-dAD65xUNQ5C004rc_AU4qXhYj5NTLzwm7khQr-KV1LYg () mail ! gmail ! com
[Download RAW message or body]

On Fri, 3 Feb 2023 at 11:17, Chris Chilvers <chilversc@gmail.com> wrote:
> 
> I have been having an issue where cachefilesd will randomly crash causing the
> cache to be withdrawn. The crash is intermittent and can sometimes happen
> within minutes, other times it can take hours, or never.
> 
> Fortunately it has produced a crash dump so I've been able to analyse what
> happened.
> 
> From the stack trace (and debug logging) the last operation it was running is
> the decant_cull_table. The code fails in the check block at the end of the
> function when it calls abort().
> 
> (gdb) bt
> #0  __pthread_kill_implementation (no_tid=0, signo=6,
> threadid=140614334650176) at ./nptl/pthread_kill.c:44
> #1  __pthread_kill_internal (signo=6, threadid=140614334650176) at
> ./nptl/pthread_kill.c:78
> #2  __GI___pthread_kill (threadid=140614334650176,
> signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> #3  0x00007fe353442476 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/posix/raise.c:26
> #4  0x00007fe3534287f3 in __GI_abort () at ./stdlib/abort.c:79
> #5  0x0000556d6c9f0965 in decant_cull_table () at cachefilesd.c:1571
> #6  cachefilesd () at cachefilesd.c:780
> #7  0x0000556d6c9f140b in main (argc=<optimized out>,
> argv=<optimized out>) at cachefilesd.c:581
> 
> For reference the code at frame 5 from the decant_cull_table function is:
> 
> check:
> for (loop = 0; loop < nr_in_ready_table; loop++)
> if (((long)cullready[loop] & 0xf0000000) == 0x60000000)
> abort();
> 
> Checking the cull table, the first object in the cull table appears to be
> valid.
> 
> (gdb) p nr_in_ready_table
> $1 = 242
> 
> (gdb) p cullready[0]
> $2 = (struct object *) 0x556d6d7382a0
> 
> (gdb) p -pretty -- *cullready[0]
> $3 = {
> parent = 0x556d6d7352b0,
> children = 0x0,
> next = 0x0,
> prev = 0x0,
> dir = 0x0,
> ino = 13631753,
> usage = 1,
> empty = false,
> new = false,
> cullable = true,
> type = OBJTYPE_DATA,
> atime = 1675349423,
> name = "E"
> }
> 
> The inode number from the struct matches a file in the fscache.
> 
> $ sudo find /var/cache/fscache -inum 13631753
> /var/cache/fscache/cache/Infs,3.0,2,,300000a,e5e9b1269df2b0d,,,d0,100000,100000,249f0,249f0,249f0,249f0,1/@00/E210w114Hg92Az0HAMYCClFMVmkMY050002w1qO200
>  
> However, the memory address of the struct matches (fails) the check.
> 
> (gdb) p (((long)cullready[0] & 0xf0000000) == 0x60000000)
> $4 = 1
> 
> 0000 556d 6d73 82a0
> & 0000 0000 f000 0000
> = 0000 0000 6000 0000
> 
> $ file /sbin/cachefilesd
> /sbin/cachefilesd: ELF 64-bit LSB pie executable, x86-64
> 
> Looking at the code, I suspect that this magic 0x60000000 number is supposed
> to be some kind of sentinel value that's used as a bug check for errors such
> as use after free? This would make sense when the application was 32 bit, as
> address pattern 0110 in the highest nibble either cannot occur, or lies within
> the kernel address space. However, when compiled as 64 bit this assumption is
> no longer true and the bit pattern can appear in perfectly valid addresses.
> 
> This would also explain the random nature of the crashes, as the cachefilesd
> is at the whims of the OS and calloc function.
> 
> --
> Chris

Any thoughts on this issue? I think the main question to be answered is if the
debug checks such as "(0x6b000000 | __LINE__)" still have any value. If not
this can be simplified by simply setting the pointer to null, and updating
the check to look for nulls.

If __LINE__ still has value then there are two questions to answer:

1. How to make this safe for 64 bit architectures?
2. Should __LINE__ only be included in debug builds, and null used normally?


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic