'Re: High frequency garbage collection leading to High load average'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-user
Subject:    Re: High frequency garbage collection leading to High load average
From:       Bowen Song <bowen () bso ! ng>
Date:       2022-03-09 12:51:58
Message-ID: bfa6b8ac-a76d-2812-788b-9551e3f8d071 () bso ! ng
[Download RAW message or body]

It sounds like you either have hot partition(s) or hardware issue on 
that node. I;m mentioning hardware issue because I had a server with 
faulty CPU fan and the CPU on it overheats and causes frequency 
throttling, the result is a single server with much higher load than the 
rest of the nodes in the cluster, and the symptom looks very similar to 
hot partitions.

To answer your questions:

1. Slow queries can be a cause of GC pressure, a result of GC pressure, 
or both. In my experience, it more often the GC pressure leads to slow 
queries than the other way around.

2. This is very suspicious. I can think of a few causes, such as hot 
partitions + token aware load balancing, client that only connects to a 
single node in the cluster, heavy streaming activities within some token 
ranges, bad retry policies, etc., and it's pretty hard to know what 
exactly has happened without digging too deep into it. If this happens 
frequently, you should properly investigate and fix it. This certainly 
can lead to higher GC pressure on the affected node.

3. Just higher cache hits but the overall number of queries did not 
change much, or even gone down? That would be an indicator of bad retry 
policies. More cache hits alone won't cause much GC pressure, but the 
underlaying issue lead to the higher cache hits may.


On 08/03/2022 21:44, Inquistive allen wrote:
> Hello team,
>
> On a given day , a node in 27 node cluster observed higher frequency 
> of garbage collection. Mostly young gc.
>
> I have found below issues:
> 1. Higher number of slow queries being observed on that particular 
> node for that particular day compared to other days
>
> 2. Higher outgoing traffic observed from the node , 10 times the 
> average outbound traffic on that particular day
>
> 3. Higher number of cache requests hitting the key cache and chunk 
> cache that other days on the particular node
>
> The cluster has large partition warning as well.
>
> My query is, which of the above is a likely cause of higher frequency 
> of GC leading to High load average on the system.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic