[prev in list] [next in list] [prev in thread] [next in thread]
List: activemq-dev
Subject: [jira] Resolved: (AMQ-1855) bridge reconnection stops because of
From: "Gary Tully (JIRA)" <jira () apache ! org>
Date: 2009-08-31 9:07:21
Message-ID: 1689123945.1251709641854.JavaMail.jira () brutus
[Download RAW message or body]
[ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]
Gary Tully resolved AMQ-1855.
-----------------------------
Resolution: Fixed
though I could not reproduce, the changes in revision \
http://svn.apache.org/viewcvs?view=rev&rev=808890 will fix this issue for both the \
simple discovery and multicast discovery providers.
> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
> Key: AMQ-1855
> URL: https://issues.apache.org/activemq/browse/AMQ-1855
> Project: ActiveMQ
> Issue Type: Bug
> Components: Connector
> Affects Versions: 4.1.2
> Reporter: Mario Lukica
> Assignee: Gary Tully
> Fix For: 5.3.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause \
> subsequent bridge restart to fail, without starting new thread that should restart \
> a bridge. As a consequence, network bridge is never restarted. Following scenario \
> leads to this: 1. bridge is disconnected (e.g. local error: \
> org.apache.activemq.transport.InactivityIOException: Channel was inactive for too \
> long) 2. bridge is disposed in separate thread in \
> DemandForwardingBridge.serviceLocalException 3. SimpleDiscoveryAgent.serviceFailed \
> is called which starts up another thread which calls \
> DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge 4. bridge \
> startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - \
> Client: NC_some_broker1_inboundlocalhost already connected (this one is caused by \
> race condition with thread disposing the bridge, since given client subscription \
> should be removed by thread disposing the bridge (step 2) 5. this causes invocation \
> of DemandForwardingBridge.serviceLocalException (this call can be made \
> asynchronously, while previous bridge startup is still in progress) As a \
> consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed \
> simultaneously. serviceFailed will call DiscoveryNetworkConnector.onServiceAdd \
> which will try to reconnect bridge. Reconnect logic is guarded by if( \
> event.failed.compareAndSet(false, true) ) which tries to ensure that only a single \
> thread is reconnecting bridge at some point. {code}
> public void serviceFailed(DiscoveryEvent devent) throws IOException {
>
> final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
> if( event.failed.compareAndSet(false, true) ) {
>
> listener.onServiceRemove(event);
> Thread thread = new Thread() {
> public void run() {
>
>
> // We detect a failed connection attempt because the service fails right
> // away.
> if( event.connectTime + minConnectTime > System.currentTimeMillis() ) {
>
> event.connectFailures++;
>
> if( maxReconnectAttempts>0 && event.connectFailures >= \
> maxReconnectAttempts ) { // Don' try to re-connect
> return;
> }
>
> synchronized(sleepMutex){
> try{
> if( !running.get() )
> return;
>
> sleepMutex.wait(event.reconnectDelay);
> }catch(InterruptedException ie){
> Thread.currentThread().interrupt();
> return;
> }
> }
>
> if (!useExponentialBackOff) {
> event.reconnectDelay = initialReconnectDelay;
> } else {
> // Exponential increment of reconnect delay.
> event.reconnectDelay*=backOffMultiplier;
> if(event.reconnectDelay>maxReconnectDelay)
> event.reconnectDelay=maxReconnectDelay;
> }
>
> } else {
> event.connectFailures = 0;
> event.reconnectDelay = initialReconnectDelay;
> }
>
> if( !running.get() )
> return;
>
> event.connectTime = System.currentTimeMillis();
> event.failed.set(false);
>
> listener.onServiceAdd(event);
> }
> };
> thread.setDaemon(true);
> thread.start();
> }
> }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to \
> false (T1), and it's possible for some other thread (T2) to enter block guarded by \
> if( event.failed.compareAndSet(false, true) ) , while reconnect process has already \
> begun by first thread. T2 can satisfy condition: if( event.connectTime + \
> minConnectTime > System.currentTimeMillis() ) and will enter \
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true \
> (causing all other calls to serviceFailed not to start thread that will reconnect \
> bridge). If first thread (T1) fails to reconnect bridge (e.g because of \
> InvalidClientIDException described in step 4), it will not schedule new thread to \
> restart broker (and call DiscoveryNetworkConnector.onServiceRemove, and cleanup \
> DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still \
> waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker \
> and fail because of following condition in DiscoveryNetworkConnector: {code}
> if ( bridges.containsKey(uri)
> > > localURI.equals(uri)
> > > (connectionFilter!=null && !connectionFilter.connectTo(uri))
> )
> return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully \
> trying to reconnect bridge), and T2 will return from \
> DiscoveryNetworkConnector.onServiceAdd and will not start bridge. No additional \
> attempt to reconnect bridge will be made, since T2 held event.failed == true, \
> effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads \
> processing local or remote bridge exceptions. End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents \
> all other attempts to restart bridge (onServiceAdd always returns because \
> bridges.containsKey(uri) == true)
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt \
> which returned without restarting the bridge - SimpleDiscoveryAgent.serviceFailed \
> is not called again, since bridge is not started I think that synchronization of \
> threads processing bridge exceptions and entering \
> SimpleDiscoveryAgent.serviceFailed should be verified and/or improved. Also, \
> InvalidClientIDException is relatively common (at least on multicore machines, e.g. \
> Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another \
> thread doing ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be \
> changed to wait a bit for bridge disposal to finish (e.g. sleep for some time) and \
> then try to restart a bridge - waiting for a second more to restart a bridge is \
> better then not to start it at all I've seen this problem in 4.1.0 and 4.1.2, but I \
> think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and \
> DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using \
> ASYNC_TASKS to execute asynchronous calls, instead of starting new threads \
> directly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic