[prev in list] [next in list] [prev in thread] [next in thread] 

List:       tru64-unix-managers
Subject:    SUMMARY: Tru64 server can't handle 900 network clients
From:       Ole Holm Nielsen <Ole.H.Nielsen () fysik ! dtu ! dk>
Date:       2005-12-22 8:36:42
Message-ID: 43AA659A.8000303 () fysik ! dtu ! dk
[Download RAW message or body]

This is an old question, but for anyone with >512 machines on the local
network you need to know how to increase the Ethernet ARP cache size
in Tru64 UNIX.  I received a resolution of the problem from an HP Denmark
consultant:

You need to look at and possibly increase the Tru64 kernel's internal
variable "arpqmaxlen", which unfortunately cannot be set through the
usual /etc/sysconfigtab method.  This variable is the number of
Ethernet MAC addresses kept in the cache, and should be somewhat
larger than 2 times the number of nodes on your network.  The kernel
variables related to the ARP cache are defined in
/usr/sys/include/netinet/inet_config.h.

To display the "arpqmaxlen" value use /usr/bin/dbx on the kernel:
    # dbx -k /vmunix
    (dbx) p arpqmaxlen
    1024
To assign a new value until next reboot:
    (dbx) assign arpqmaxlen = 2048
To assign a new value permanently in /vmunix:
    (dbx) patch arpqmaxlen = 2048
Then exit dbx by a "quit" command.  If a new kernel gets installed,
for example by installing a new Patch Kit, you will need to modify
/vmunix again as described.

We've been running a local network with about 950 nodes without ARP
cache problems for over a year now, so this solution seems to be well
tested.

Additional note in case anyone is interested:
On Linux hosts the same modification can be implemented via the
/etc/sysctl.conf file (Redhat RHEL4 with kernel 2.6.9) at boot time:

# Don't allow the arp table to become bigger than this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600


Ole Holm Nielsen wrote:
> I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6) to
> handle client node MAC-addresses for close to 1000 NFS clients.
> We expanded our Linux cluster to 900+ nodes, and suddenly the
> Tru64 UNIX NFS file-server randomly looses network communication with
> many (or most) of the new nodes.  A "ping" doesn't work at either end of
> the server-client connection.  Communication between Linux servers and
> nodes works perfectly, however, so we do not believe there to be a
> problem with the network setup.
> 
> What happens is I believe "ARP cache trashing":  The Tru64 kernel
> apparently can't cope with close to 1000 MAC-addresses simultaneously
> because a fixed-size ARP cache fills up, and the kernel starts deleting
> MAC-addresses from the ARP cache randomly.  See "man 7 arp"
> on a Linux box about the cache.  On the Linux boxes we solve the ARP
> cache problem by loading a static cache from the /etc/ethers file, but
> on Tru64 UNIX this causes a dead-sure communications failure :-(
> 
> Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I haven't
> been able to find any kernel parameter which may increase the maximum
> size of the ARP cache.  Can anyone help ?
> Note: The 900 nodes are divided about equally between two Gigabit
> interfaces on the Tru64 UNIX server.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic