[prev in list] [next in list] [prev in thread] [next in thread]
List: freebsd-net
Subject: RE: Followup from Verisign after last week's developer summit
From: "Bentkofsky, Michael" <MBentkofsky () verisign ! com>
Date: 2013-05-23 16:44:00
Message-ID: 080FBD5B7A09F845842100A6DE79623321F703B5 () BRN1WNEXMBX01 ! vcorp ! ad ! vrsn ! com
[Download RAW message or body]
I am adding freebsd-net to this and will re-summarize to get additional input. Thanks \
for all of the initial suggestions.
For benefit of those on freebsd-net@, we are noticing significant locking contention \
on the V_tcpinfo lock under moderately high connection establishment and teardown \
rates (around 45-50k connections per second). Our profiling suggests the lock \
contention on V_tcpinfo effectively single-threads all TCP connections. Similar \
testing on Linux with equivalent hardware does not show this contention and can get a \
much higher connection establishment rate. We can attach profiling and test details \
if anyone would like.
JHB recommends:
- He has seen similar results in other kinds of testing.
- Linux uses RCU for the locking on the equivalent table (we've confirmed this to be \
the case).
- Looking into a lock per bucket on the PCB lookup.
Jeff recommends:
- Changing the lock strategy so the hash lookup can be effectively pushed further \
down into the stack.
- Making the [list] iterators more complex like those in use in the hash lookup now.
We are starting down these paths to try to break the locking down. We'll post some \
initial patch ideas soon. Meanwhile, any additional suggestions are certainly \
welcome.
Finally, I will mention that we have enabled PCBGROUPS in some of our testing with \
9.1 and found no change for our particular workload with high connection \
establishment rates.
Thanks,
Mike
-----Original Message-----
From: Jeff Roberson [mailto:jroberson@jroberson.net]
Sent: Wednesday, May 22, 2013 12:50 AM
To: John Baldwin
Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Julien
Subject: Re: Followup from Verisign after last week's developer summit
On Tue, 21 May 2013, Jeff Roberson wrote:
> On Tue, 21 May 2013, John Baldwin wrote:
>
> > On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote:
> > > Greetings gentlemen,
> > >
> > > It was a pleasure to meet you all last week at the FreeBSD developer
> > > summit.
> > I would like to thank you for spending the time to discuss all the
> > wonderful internals of the network stack. We also thoroughly enjoyed
> > the discussion on receive side scaling.
> > >
> > > I'm sure you will remember both Julien Charbon and me asking
> > > questions
> > regarding the TCP stack implementation, specifically around the
> > locking internals. I am hoping to follow-up with a path forward so we
> > might be able to enhance the connection rate performance. Our
> > internal testing has found that the V_tcpinfo lock prevents TCP
> > scaling under high connection setup and teardown rates. In fact, we
> > surmise that a new "FIN flood" attack may be possible to degrade
> > server connections significantly.
> > >
> > > In short, we are interested in changing this locking strategy and
> > > hope to
> > get input from someone with more familiarity with the implementation.
> > We're willing to be part of the coding effort and are willing to
> > submit our suggestions to the community. I think we might just need
> > some occasional input.
> > >
> > > Also, I will point out that our similar testing on Linux shows that
> > > the
> > comparable performance between the two operating systems on the same
> > multi- core hardware is significantly different. We're able to drive
> > over 200,000 connections per second on a Linux server compared to
> > fewer than 50,000 on the FreeBSD server. We have kernel profiling
> > details that we can share if you'd like.
> >
> > I have seen similar results with a redis cluster at work (we ended up
> > deploying proxies to allow applications to reuse existing connections
> > to avoid this). I believe Linux uses RCU for this table. You could
> > perhaps use an rm lock instead of an rw lock. On idea I considered
> > was to split the the pcbhash lock up further so you had one lock per
> > hash bucket so that you could allow concurrent connection
> > setup/teardown so long as they were referencing different buckets.
> > However, I did not think this would have been useful for the case at
> > work since those connections were insane (single packet request
> > followed by single packet reply with all the setup/teardown overhead)
> > and all going to the same listening socket (so all the setup's would
> > hash to the same bucket). Handling concurrent setup on the same
> > listen socket is a PITA but is in fact the common case.
>
> I don't think it's simply a synchronization primitive problem. It
> looks to me like the fundamental issue is that the lock order for the
> tables is prior to the inp lock which means we have to grab it very
> early. Presumably this is the classic sort of container ->
> datastructure, datastructure -> container lock order problem. This
> seems to be made more complex by protecting the list of all pcbs, the
> port allocation, and parts of the hash by the same lock.
>
> Have we tried to further decompose this lock? I would experiment with
> that as a first step. Is this grabbed in so many places just due to
> the complex lock order issue? That seems to be the case. There are
> only a handful of fields marked as protected by the inp info lock. Do
> we know that this list is complete?
>
> My second step would be to attempt to turn the locking on its head.
> Change the lock order from inp lock to inp info lock. You can resolve
> the lookup problem by adding an atomic reference count that holds the
> datastructure while you drop the hash lock and before you acquire the
> inp lock. Then you could re-validate the inp after lookup. I suspect
> it's not that simple and there are higher level races that you'll
> discover are being serialized by this big lock but that's just a hunch.
>
I read some more. We have already done this lookup/ref/etc. dance for the hash lock. \
It handles the hard cases of multiple inp_* calls and synchronizing the ports, bind, \
connect, etc. It looks like the list locks have been optimized to make the iterators \
simple. I think this is backwards now. We should make the iterators complex and the \
normal setup/teardown path simple. The iterators can follow a model like the hash \
lock using sentinels to hold their place. We have the same pattern elsewhere. It \
would allow you to acquire the INP_INFO lock after the INP lock and push it much \
deeper into the stack.
Jeff
> What do you think Robert? If it would make improving the tcb locking
> simpler it may fall under the umbrella of what Isilon needs but I'm
> not sure that's the case. Certainly my earlier attempts at deferred
> processing were made more complex by this arrangement.
>
> Thanks,
> Jeff
>
> >
> > The best forum for discussing this is probably on net@ as there are
> > likely other interested parties who might have additional ideas.
> > Also, it might be interesting to look at how connection groups try to
> > handle this. I believe they use an altenate method of decomposing
> > the global lock into smaller chunks, and I think they might do
> > something to help mitigate the listen socket problem (perhaps they
> > duplicate listen sockets in all groups)? Robert would be able to
> > chime in on that, but I believe he is not really back home until next
> > week.
> >
> > --
> > John Baldwin
> >
>
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic