'Re: GMFD messages'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-dev
Subject:    Re: GMFD messages
From:       Anthony Molinaro <anthonym () alumni ! caltech ! edu>
Date:       2010-05-26 23:52:36
Message-ID: 20100526235236.GB63843 () alumni ! caltech ! edu
[Download RAW message or body]

Hi,

  Still haven't heard from anyone about this, while the restart helped
temporarily, it now seems to be broken again.  I restarted one node
with TRACE logging on, and I see this

INFO [GMFD:1] 2010-05-26 23:44:44,260 GossipDigestSynMessage.java (line 129)
  @@@@ Breaking out to respect the MTU size in EPS. Estimate is 56 @@@@
TRACE [GMFD:1] 2010-05-26 23:44:44,260 Gossiper.java (line 293) @@@@ Size of
  GossipDigestAckMessage is 1374
TRACE [GMFD:1] 2010-05-26 23:44:44,261 Gossiper.java (line 937) Sending a
  GossipDigestAckMessage to /10.192.63.127

So it seems like cassandra needs 56 bytes for each server in a gossip packet
which with a maximum packet size of 1428 means at most you can have 24 servers?

Which sort of sucks, since I have 27 right now.  However, I'm not certain
how this would explain what I see which is a complete cluster restart works
fine for about 14 hours, then suddenly no longer works?

Ideas?

-Anthony

On Wed, May 26, 2010 at 08:59:30AM -0700, Anthony Molinaro wrote:
> Hi,
> 
>   I noticed yesterday I have lots of these messages
> 
> INFO [GMFD:1] 2010-05-25 23:21:04,070 GossipDigestSynMessage.java (line 152)
>   Remaining bytes zero. Stopping deserialization in EndPointState.
> INFO [GMFD:1] 2010-05-25 23:21:05,224 GossipDigestSynMessage.java (line 129)
>   @@@@ Breaking out to respect the MTU size in EPS. Estimate is 56 @@@@
> 
> The first message only occurs on some machines in my cluster.  The second
> on all of them.
> 
> The ones with the first message seem to be building up quite a backlog
> in their MessageDeserializer PendingTasks.
> 
> I assume there is a correlation, what could be causing this sort of thing?
> 
> This cluster is now at 27 m1.xlarge boxes on ec2 running 0.6.2 of some flavor.
> 
> I ended up restarting one of the boxes which was behind and when it came
> back it only had some parts of the ring, so I shutdown everything, brought
> the seed nodes back, then brought the rest back and that seemed to fix
> things, but this definitely seems like some sort of bug with gossip?
> 
> Thanks,
> 
> -Anthony
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@alumni.caltech.edu>

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@alumni.caltech.edu>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic