[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-dev
Subject:    [jira] [Created] (AVRO-1407) NettyTransceiver can cause a infinite loop when slow to connect
From:       "Gareth Davis (JIRA)" <jira () apache ! org>
Date:       2013-11-29 11:01:36
Message-ID: JIRA.12681865.1385722840261.46697.1385722896201 () arcas
[Download RAW message or body]

Gareth Davis created AVRO-1407:
----------------------------------

             Summary: NettyTransceiver can cause a infinite loop when slow to connect
                 Key: AVRO-1407
                 URL: https://issues.apache.org/jira/browse/AVRO-1407
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.7.5, 1.7.6
            Reporter: Gareth Davis



When a new {{NettyTransceiver}} is created it forces the channel to be allocated and \
connected to the remote host. it waits for the connectTimeout ms on the [connect \
channel future|https://github.com/apache/avro/blob/1579ab1ac95731630af58fc303a07c9bf28 \
541d6/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java#L271] \
this is obivously a good thing it's only that on being unsuccessful, ie \
{{!channelFuture.isSuccess()}} an exception is thrown and the call to the constructor \
fails with an {{IOException}}, but has the potential to leave a active channel \
associated with the {{ChannelFactory}}

The problem is that a Netty {{NioClientSocketChannelFactory}} will not shutdown if \
there are active channels still around and if you have supplied the \
{{ChannelFactory}} to the {{NettyTransceiver}} then  you will not be able to cancel \
it by calling {{ChannelFactory.releaseExternalResources()}} like the [Flume Avro RPC \
client does|https://github.com/apache/flume/blob/b8cf789b8509b1e5be05dd0b0b16c5d9af969 \
8ae/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java#L158]. In \
order to recreate this you need a very laggy network, where the connect attempt takes \
longer than the connect timeout but does actually work, this very hard to organise in \
a test case, although I do have a test setup using vagrant VM's that recreates this \
everytime, using the Flume RPC client and server.

The following stack is from a production system, it won't ever leave recover until \
the channel is disconnected (by forcing a disconnect at the remote host) or \
restarting the JVM.

{noformat:title=Production stack trace}
"TLOG-0" daemon prio=10 tid=0x00007f581c7be800 nid=0x39a1 waiting on condition \
[0x00007f57ef9f2000]  java.lang.Thread.State: TIMED_WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  parking to wait for <0x00000007218b16e0> (a \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)  at \
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)  at \
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
  at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
  at org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:103)
  at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.releaseExternalResources(AbstractNioWorkerPool.java:80)
  at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:181)
  at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:142)
  at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:101)
  at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:564)
  locked <0x00000006c30ae7b0> (a org.apache.flume.api.NettyAvroRpcClient)
  at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
  at org.apache.flume.api.LoadBalancingRpcClient.createClient(LoadBalancingRpcClient.java:214)
  at org.apache.flume.api.LoadBalancingRpcClient.getClient(LoadBalancingRpcClient.java:205)
  locked <0x00000006a97b18e8> (a org.apache.flume.api.LoadBalancingRpcClient)
  at org.apache.flume.api.LoadBalancingRpcClient.appendBatch(LoadBalancingRpcClient.java:95)
  at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:45)
  at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:43)
 {noformat}

The solution is very simple, and a patch should be along in a moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic