[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-nfs
Subject:    [NFS] Re: [PATCH] resend: knfsd multiple UDP sockets
From:       Greg Banks <gnb () sgi ! com>
Date:       2004-05-28 7:42:32
Message-ID: 20040528074232.GC9014 () sgi ! com
[Download RAW message or body]

On Fri, May 28, 2004 at 03:14:10PM +1000, Neil Brown wrote:
> I have two concerns, that can possibly be allayed.
> 
> Firstly, the reply to any request is going to go out the socket that
> the request came in on.  However some network configurations can have
> requests arrive on one interface that need to be replied to on a
> different interface(*).  Does setting sk_bound_dev_if cause packets sent
> go out that interface (I'm not familiar enough with the networking
> code to be sure)?  If it does, then this will cause problems in those
> network configurations.

As near as I can tell from inspecting the (amazingly convoluted)
routing code and without running a test, the output routing algorithm
is short circuited when the socket is bound to a device, and will
send the packet out the bound interface regardless.  This comment
by Alexei K in ip_route_output_slow() explains:

> [...] When [output interface] is specified, routing
> tables are looked up with only one purpose:
> to catch if destination is gatewayed, rather than
> direct. 

So, yes we could have a problem if the routing is asymmetric.

A similar issue (which I haven't tested either) is what happens
with virtual interfaces like tunnels or bonding.

> Secondly, while the need for multiple udp sockets is clear,

Agreed.

> it isn't
> clear that they should be per-device.
> Other options are per-CPU and per-thread.  Alternately there could
> simply be a pool of free sockets (or a per-cpu pool).

Ok, there are a number of separate issue here.

Firstly, my performance numbers show that the current serialisation of
svc_sendto() gives about 1.5 NICs worth of performance per socket, so
whatever method governs the number of sockets it needs to ensure that
the number of sockets grows at least as fast as the number of NICs.
On at least some of the SGI hardware the number of NICs can grow
faster than the number of CPUs.  For example, the minimal Altix
3000 has 4 CPUs and can have 10 NICs.  Similarly, a good block for
building scalable NFS servers is an Altix 350 with 2 CPUs and 4 NICs
(we can't do this yet due to tg3 driver limitations).  So this makes
per-CPU sockets unattractive.

Secondly, for the traffic levels I'm trying to reach I need lots of
nfsd threads.  I haven't done the testing to find the exact number,
but it's somewhere between above 32.  I run with 128 threads to make
sure.  If we had per-thread sockets that would be a *lot* of sockets.
Under many loads an nfsd thread spends most of its time waiting for
disk IO, and having its own socket would just be wasteful.

Thirdly, where there are enough CPUs I see significant performance
advantage when each NIC directs irqs to a separate dedicated CPU.
In this case having one socket per NIC will mean all the cachelines
for that socket will tend to stay on the interrupt CPU (currently this
doesn't happen because of the way the tg3 driver handles interrupts,
but that will change).

What all of the above means is that I think having one socket per
NIC is very close to the right scaling ratio.

What I'm not sure about on is the precise way in which multiple
sockets should be achieved.  Using device-bound sockets just seemed
like a really easy way (read: no changes to the network stack) to get
exactly the right scaling.  Having a global (or per NUMA node, say)
pool of sockets which scaled by the number of NICs would be fine too,
assuming it could be made to work and handle the routing corner cases.

> Having multiple sockets that are not bound differently is not
> currently possible without setting sk_reuse, and doing this allows
> user programs to steal NFS requests.

Yes, this was another good thing about using device-bound sockets,
it doesn't open that security hole.

> However if we could create a socket that was bound to an address only
> for sending and not for receiving, we could use a pool of such sockets
> for sending.

Interesting idea.  Alternatively, we could use the single UDP socket
for receive as now, and a pool of connected UDP sockets for sending.
That should work without needing to modify the network stack.

> This might be as easy as adding a "sk_norecv" field to struct sock,
> and skipping the sk->sk_prot->get_port call in inet_bind if it is set.
> Then all incoming requests would arrive on the one udp socket (there
> is very little contention between incoming packets on the one socket),

Sure, the limit is the send path.

> and reply could go out one of the sk_norecv ports.

Aha.

> Does this make sense?  Would you be willing to try it?

I think it's an intriguing idea and I'll try it as soon as I can.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic