[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-smp
Subject:    Detailed report on SMB-build lockups [seems that it is locking problem in networking code] (2.4.0-te
From:       Alexander Demenshin <aldem-nf () aldem ! net>
Date:       2000-07-11 10:50:32
[Download RAW message or body]

Hello folks,

	So here it is... I spent nearly 20 hours non-stop to find it out,
	to test under different circumstances, and finally found
	_where_ it is, but not _why_ (unfortunately)...
	
	Conditions to reproduce:
	
		- Very heavy traffic load (> 8000 pkts/sec);
		- Netfilter (built as modules or statically)
		- ip_queue and QUEUE target(s) in effect
		  (it is the only essential part of Netfilter for this case)
		- User-space module which ACCEPT all packets queued;
		- Traffic generator used on _local_ interface:
		
			> A lot of fragmented packets:
			
				ifconfig lo mtu 256
				ping -f -s 8192 127.0.0.1
				
			> A lot of TCP traffic (connect/transfer/disconnect);
			> MTU does not matter.
			
	In my tests I used the following rules for iptables:
	
		iptables -t mangle -A PREROUTING -j QUEUE
		iptables -t mangle -A OUTPUT     -j QUEUE
		
	I assume there are no other rules; but the problem occurs _only_
	when QUEUE target is in effect - other rules does not matter as long
	as there is no QUEUE targets or if packets are not accepted in userspace.
	In case if I use table 'filter' it also occurs (so nothing magical
	in 'mangle' table).
	
	So, once rules above are in effect, userspace module is running, and after
	certain period of time running traffic generator system lockup occurs
	(in my case - after processing of ca. 300K packets; but it depends - 
	be patient :).
	
	No OOPs, no other kernel messages, _nothing_ except SysRq is active.
	
	Examining of code under EIP shows, that lockup occurs at:
	
		- In case of TCP traffic:
		
			src/net/ipv4/tcp_timer.c:690
			
--- src/net/ipv4/tcp_timer.c:690 tcp_synack_timer() ---
                                /* Drop this request */
                                write_lock(&tp->syn_wait_lock);		/* <<< AT THIS PLACE */
                                *reqp = req->dl_next;
                                write_unlock(&tp->syn_wait_lock);

--- CUT ---

		- In case of ICMP (fragmented) traffic:
		
--- src/net/ipv4/ip_fragment:202 ip_expire ---
        spin_lock(&ipfrag_lock);					/* <<< AT THIS PLACE */
        if(!qp->fragments)
        {       
#ifdef IP_EXPIRE_DEBUG 
                printk("warning: possible ip-expire attack\n");
#endif
                goto out;
        }

--- CUT ---

	Again, problem occurs _only_ on SMB build - and it does not matter
	is it running SMP box or not (occurs on non-SMB and 2CPU box).
	
	During the tests there are no other network or disk activities
	except of sleeping regular daemons like getty.
	
	Problem still persist on 2.4.0-test3-pre8.
	
	I've no idea where to look now, and even no idea is it related to
	netfilter itself (or ip_queue)... Logically, it has to relate in some
	way to ip_queue (at least nf_reinject()) - because it occurs only when
	QUEUE target is active and packets are reinjected. It _does not_ occur
	when there is no QUEUE target or if packets are _not_ accepted.
	
	But (there is always but) - nf_reiject() and ip_queue itself produce
	a lot of calls to related networking parts; including netlink code.
	
	Hardware used for tests:
	
		- Single-CPU ASUS board:
		
			PII-350 Deschutes (512K L2 Cache)
			256M RAM
			
		- Double-CPU ASUS board:
		
			PIII-550 Coppermine (256L L2 Cache)
			256M RAM
			
	(I am not sure does it matter or not - may be only CPU freq to
	generate necessary amount of traffic).
	
	Software:
	
		- kernel 2.4.0-test2-ac2 (and later 2.4.0-test3-pre8)
		- iptables 1.1.0
		- ab ("apache bench" from apache distribution; used
		  as traffic generator: ab -n 1000000 http://localhost/)
		- any fast enough web server (to run "ab" at)
		- ping -f (for ICMP fragmented traffic).
	
	I suspect that problem will never occur under normal circumstances,
	but anyway it _does exist_, so it _may_ occur (sooner or later).
	
	Here is the link to my demo userspace ip_queue handling program (which
	can be used in test if someone will try to hunt this bug):
	
		http://aldem.net/netfilter/
		
	That's all. Any ideas? Suggestions? Comments? (Flames?) :)
	
	(I am only on netfilter list so please CC me in case if any additional
	information is necessary).

	Good luck!
	
/Al
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/dmentre/smp-howto/
To Unsubscribe: send "unsubscribe linux-smp" to majordomo@vger.rutgers.edu

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic