[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-stable
Subject:    Re: mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
From:       Ben RUBSON <ben.rubson () gmail ! com>
Date:       2017-08-31 16:04:27
Message-ID: 82EFBD5E-8FC2-4156-A030-AF70D97A37BA () gmail ! com
[Download RAW message or body]

> On 28 Aug 2017, at 11:27, Julien Charbon <jch@freebsd.org> wrote:
> 
> On 8/28/17 10:25 AM, Ben RUBSON wrote:
> > > On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson@gmail.com> wrote:
> > > 
> > > > On 15 Aug 2017, at 23:33, Julien Charbon <jch@freebsd.org> wrote:
> > > > 
> > > > On 8/11/17 11:32 AM, Ben RUBSON wrote:
> > > > > > On 08 Aug 2017, at 13:33, Julien Charbon <jch@freebsd.org> wrote:
> > > > > > 
> > > > > > On 8/8/17 10:31 AM, Hans Petter Selasky wrote:
> > > > > > > 
> > > > > > > Suggested fix attached.
> > > > > > 
> > > > > > I agree we your conclusion.  Just for the record, more precisely this
> > > > > > regression seems to have been introduced with:
> > > > > > (...)
> > > > > > Thus good catch, and your patch looks good.  I am going to just verify
> > > > > > the other in_pcbrele_wlocked() calls in TCP stack.
> > > > > 
> > > > > Julien, do you plan to make this fix reach 11.0-p12 ?
> > > > 
> > > > I am checking if your issue is another flavor of the issue fixed by:
> > > > 
> > > > https://svnweb.freebsd.org/base?view=revision&revision=307551
> > > > https://reviews.freebsd.org/D8211
> > > > 
> > > > This fix in not in 11.0 but in 11.1.  Currently I did not found how an
> > > > inp in INP_TIMEWAIT state can have been INP_FREED without having its tw
> > > > set to NULL already except the issue fixed by r307551.
> > > > 
> > > > Thus could you try to apply this patch:
> > > > 
> > > > https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch
> > > >  
> > > > and see if you can still reproduce this issue?
> > > 
> > > Thank you for your answer Julien.
> > > Unfortunately, I'm not sure at all how to reproduce the issue.
> > > I have other servers which are 100% identical to this one, same workload,
> > > same some-months uptime, but they did not trigger the bug yet.
> > > 
> > > If other network stack experts (I'm not) agree with your analysis,
> > > we could then certainly go further with D8211 / r307551.
> > > 
> > > One thing that perhaps might help :
> > > # netstat -an | grep TIME_WAIT$ | wc -l
> > > 468
> > > 
> > > Note that due to this running bug, sendmail has lots of difficulties to send \
> > > outgoing mails. As soon as I run the above netstat command, I receive a lot of \
> > > stacked mails (more than 20 this time). As if netstat was able to somehow \
> > > help... 
> > > Number of TIME_WAIT connections however does not decrease, but increases.
> > > 
> > > > And in the spirit of r307551 fix and based on Hans patch I will also
> > > > propose to add a kernel log describing the issue instead of starting an
> > > > infinite loop when INVARIANT is not set.
> > > 
> > > Which should then never be triggered :)
> > > Good idea I think !
> > 
> > What about :
> > D8211/r307551
> > + Hans' patch
> > + Julien's idea of a kernel log (sort of "We should not be here but we are")
> 
> I did this change and I am testing it

Good news !

> on your side did you try this patch applied on 11.0?
> 
> https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087c5f7d0a0.patch
> 

Yes, patch applied and running correctly,
however hard to say whether or not it solves this issue,
as there is no easy way to reproduce it.

> > And backporting all this to 11.0 (and so to 11.1 too) ?
> > 
> > As this bug can impact every FreeBSD machine / server,
> > leading to an unavailable / unreachable system (this is how mine ended),
> > sounds like it could inevitably be a good thing, for production stability \
> > purpose.
> 
> The main fix for your issue is (I believe):
> 
> Fix a double-free when an inp transitions to INP_TIMEWAIT state
> after having been dropped.
> https://svnweb.freebsd.org/base?view=revision&revision=307551
> 
> This fix has been MFC-ed on both stable/11, stable/10 and is already
> included in 11.1 and will be in 10.4.  To push in 11.0 release directly,
> I guess you have to promote this change to an Errata (never did that
> myself):
> 
> https://www.freebsd.org/security/notices.html
> https://www.freebsd.org/security/security.html#reporting

Mail sent to FreeBSD Security Team !

Many thanks, let's stay tuned !

Ben

_______________________________________________
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic