[prev in list] [next in list] [prev in thread] [next in thread] 

List:       quagga-dev
Subject:    [quagga-dev 4854] bgpd 'router force open' bug
From:       "Paul H. Anderson" <pha () arbor ! net>
Date:       2007-05-18 19:16:14
Message-ID: Pine.OSX.4.64.0705181514420.3362 () duckman
[Download RAW message or body]


Describing timing and state sensitive issues is tricky, but here it goes.

My test case was Quagga 0.99.6 on a fairly recent Redhat linux talking to a 
Cisco router configured as a bgp peer to my quagga bgpd host.

In some situations, ISP's drop inbound SYN packets for bgpd in order to help 
protect internal core routers.  So the Quaggq host isn't allowed to establish a 
peering connection to the router, but the router is able to establish one with 
the Quagga host.

When the Quagga host bgpd is on the outside of that firewall, bgpd will never 
establish a connection with routers inside the firewall that refuse connections 
(or where the firewall filters SYN incoming packets on the bgpd port).

The sequence goes something like this:

1) Quagga host bgpd sends a SYN packet for the peer TCP connection to the 
router.

2) a firewall intended to protect the router, drops this packet (or the router 
is configured to ignore the request)

3) Quagga host bgpd waits a long while for the TCP stack to timeout, all the 
while the bgpd has the peering conenction in the 'Connect' state.

4) the router attempts to establish a connection, is able to do so, and send an 
OPEN message.

5) Quagga host bgpd sees the OPEN request, notes that the peer connection is in 
the 'Connect' state, and tells the router to close its connection.

6) process continues indefinitely, for some reason Quagga bgpd either

   a) never gets to 'Active' state or

   b) the timers aren't being set correctly, so that the state machine is
      immediately going back into 'Connect' state (i.e. immediately
      re-trying when the TCP connect times out), or

   c) it is sensitive to how the TCP connection fails.  When I used
      iptables to return an ICMP 'connection refused' then the Quagga bgpd
      would go to the 'Active' state, and the router was then able to
      establish the connection.  If the SYN packet is dropped, Quagga
      bgpd appears to stay in 'Connect' state indefinitely, preventing
      the router from establishing a connection.

Cisco IOS 12.4 now has a new command to force this behaviour: "neighbor 
192.168.1.2 transport connection-mode [active|passive]", which allows 
configuring the router so it will initiate the peering connection and not 
accept externally initiated connections.  I happen to be testing it using Linux 
iptables rules to drop the output SYN packet.

I can't tell if this is a timer problem (too short - or no - timeout between 
moving from ACTIVE back to "Connect" in the face of past connection failure), 
or if it is a RFC collision related state machine issue.

It appears to me that the collision resolution discussion in RFC 1771 section 
6.8 covers the "OpenSent", "OpenConfirm" and "Active" states, but not the 
"Connect" state.

In bgp_packet.c, in the /* Hack part. */ code, it is basically closing the 
opposite peering attempt when the quagga machine is in the "Connect" state, 
which can go on for a potentially long time if the SYN packet is dropped.

The fix I applied is to modify quagga 0.99.6/bgpd/bgp_packet.c as follows:

--- bgp_packet.c        (revision 851)
+++ bgp_packet.c        (working copy)
@@ -1240,10 +1240,13 @@
           SET_FLAG (realpeer->sflags, PEER_STATUS_NSF_WAIT);
         }
         else if (ret == 0 && realpeer->status != Active
+                && realpeer->status != Connect
                  && realpeer->status != OpenSent
                  && realpeer->status != OpenConfirm)

         {
+         /* close connection - pre-existing reelpeer connection is good enough 
*/
+         /* note the conditions for "good enough" now exclude the Connect 
state */
           if (BGP_DEBUG (events, EVENTS))
             zlog_debug ("%s peer status is %s close connection",
                         realpeer->host, LOOKUP (bgp_status_msg,

I don't know if this is the correct fix.  However, it does solve the problem we 
face here in our lab testing.

As I look over the possible reasons, I'm suspecting now that timers may play a 
role - that is, Quagga bpgd may need to put the peering connection into the 
'Active' state for some longer period of time.

Please let me know if you would like any additional information.

Thank you!

Paul Anderson
SW developer, Arbor Networks
_______________________________________________
Quagga-dev mailing list
Quagga-dev@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-dev
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic