'RE: Followup from Verisign after last week's developer summit'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-net
Subject:    RE: Followup from Verisign after last week's developer summit
From:       "Bentkofsky, Michael" <MBentkofsky () verisign ! com>
Date:       2013-05-23 16:44:00
Message-ID: 080FBD5B7A09F845842100A6DE79623321F703B5 () BRN1WNEXMBX01 ! vcorp ! ad ! vrsn ! com
[Download RAW message or body]

I am adding freebsd-net to this and will re-summarize to get additional input. Thanks \
for all of the initial suggestions.

For benefit of those on freebsd-net@, we are noticing significant locking contention \
on the V_tcpinfo lock under moderately high connection establishment and teardown \
rates (around 45-50k connections per second). Our profiling suggests the lock \
contention on V_tcpinfo effectively single-threads all TCP connections. Similar \
testing on Linux with equivalent hardware does not show this contention and can get a \
much higher connection establishment rate. We can attach profiling and test details \
if anyone would like.

JHB recommends:
- He has seen similar results in other kinds of testing. 
- Linux uses RCU for the locking on the equivalent table (we've confirmed this to be \
                the case).
- Looking into a lock per bucket on the PCB lookup.

Jeff recommends:
- Changing the lock strategy so the hash lookup can be effectively pushed further \
                down into the stack.
- Making the [list] iterators more complex like those in use in the hash lookup now.

We are starting down these paths to try to break the locking down. We'll post some \
initial patch ideas soon. Meanwhile, any additional suggestions are certainly \
welcome.

Finally, I will mention that we have enabled PCBGROUPS in some of our testing with \
9.1 and found no change for our particular workload with high connection \
establishment rates.

Thanks,
Mike

-----Original Message-----
From: Jeff Roberson [mailto:jroberson@jroberson.net] 
Sent: Wednesday, May 22, 2013 12:50 AM
To: John Baldwin
Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Julien
Subject: Re: Followup from Verisign after last week's developer summit

On Tue, 21 May 2013, Jeff Roberson wrote:

> On Tue, 21 May 2013, John Baldwin wrote:
> 
> > On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote:
> > > Greetings gentlemen,
> > > 
> > > It was a pleasure to meet you all last week at the FreeBSD developer 
> > > summit.
> > I would like to thank you for spending the time to discuss all the 
> > wonderful internals of the network stack. We also thoroughly enjoyed 
> > the discussion on receive side scaling.
> > > 
> > > I'm sure you will remember both Julien Charbon and me asking 
> > > questions
> > regarding the TCP stack implementation, specifically around the 
> > locking internals. I am hoping to follow-up with a path forward so we 
> > might be able to enhance the connection rate performance. Our 
> > internal testing has found that the V_tcpinfo lock prevents TCP 
> > scaling under high connection setup and teardown rates. In fact, we 
> > surmise that a new "FIN flood" attack may be possible to degrade 
> > server connections significantly.
> > > 
> > > In short, we are interested in changing this locking strategy and 
> > > hope to
> > get input from someone with more familiarity with the implementation. 
> > We're willing to be part of the coding effort and are willing to 
> > submit our suggestions to the community. I think we might just need 
> > some occasional input.
> > > 
> > > Also, I will point out that our similar testing on Linux shows that 
> > > the
> > comparable performance between the two operating systems on the same 
> > multi- core hardware is significantly different. We're able to drive 
> > over 200,000 connections per second on a Linux server compared to 
> > fewer than 50,000 on the FreeBSD server. We have kernel profiling 
> > details that we can share if you'd like.
> > 
> > I have seen similar results with a redis cluster at work (we ended up 
> > deploying proxies to allow applications to reuse existing connections 
> > to avoid this).  I believe Linux uses RCU for this table.  You could 
> > perhaps use an rm lock instead of an rw lock.  On idea I considered 
> > was to split the the pcbhash lock up further so you had one lock per 
> > hash bucket so that you could allow concurrent connection 
> > setup/teardown so long as they were referencing different buckets.  
> > However, I did not think this would have been useful for the case at 
> > work since those connections were insane (single packet request 
> > followed by single packet reply with all the setup/teardown overhead) 
> > and all going to the same listening socket (so all the setup's would 
> > hash to the same bucket).  Handling concurrent setup on the same 
> > listen socket is a PITA but is in fact the common case.
> 
> I don't think it's simply a synchronization primitive problem.  It 
> looks to me like the fundamental issue is that the lock order for the 
> tables is prior to the inp lock which means we have to grab it very 
> early. Presumably this is the classic sort of container -> 
> datastructure, datastructure -> container lock order problem.  This 
> seems to be made more complex by protecting the list of all pcbs, the 
> port allocation, and parts of the hash by the same lock.
> 
> Have we tried to further decompose this lock?  I would experiment with 
> that as a first step.  Is this grabbed in so many places just due to 
> the complex lock order issue?  That seems to be the case.  There are 
> only a handful of fields marked as protected by the inp info lock.  Do 
> we know that this list is complete?
> 
> My second step would be to attempt to turn the locking on its head. 
> Change the lock order from inp lock to inp info lock.  You can resolve 
> the lookup problem by adding an atomic reference count that holds the 
> datastructure while you drop the hash lock and before you acquire the 
> inp lock.  Then you could re-validate the inp after lookup.  I suspect 
> it's not that simple and there are higher level races that you'll 
> discover are being serialized by this big lock but that's just a hunch.
> 

I read some more.  We have already done this lookup/ref/etc. dance for the hash lock. \
It handles the hard cases of multiple inp_* calls and synchronizing the ports, bind, \
connect, etc.  It looks like the list locks have been optimized to make the iterators \
simple.  I think this is backwards now.  We should make the iterators complex and the \
normal setup/teardown path simple.  The iterators can follow a model like the hash \
lock using sentinels to hold their place.  We have the same pattern elsewhere.  It \
would allow you to acquire the INP_INFO lock after the INP lock and push it much \
deeper into the stack.

Jeff

> What do you think Robert?  If it would make improving the tcb locking 
> simpler it may fall under the umbrella of what Isilon needs but I'm 
> not sure that's the case.  Certainly my earlier attempts at deferred 
> processing were made more complex by this arrangement.
> 
> Thanks,
> Jeff
> 
> > 
> > The best forum for discussing this is probably on net@ as there are 
> > likely other interested parties who might have additional ideas.  
> > Also, it might be interesting to look at how connection groups try to 
> > handle this.  I believe they use an altenate method of decomposing 
> > the global lock into smaller chunks, and I think they might do 
> > something to help mitigate the listen socket problem (perhaps they 
> > duplicate listen sockets in all groups)?  Robert would be able to 
> > chime in on that, but I believe he is not really back home until next 
> > week.
> > 
> > --
> > John Baldwin
> > 
> 
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

[prev in list] [next in list] [prev in thread] [next in thread]