[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-serviceability-dev
Subject:    Re: Review request: 8049303: Transient network problems cause JMX thread to fail silenty
From:       shanliang <shanliang.jiang () oracle ! com>
Date:       2014-09-10 17:18:09
Message-ID: 541087D1.9080109 () oracle ! com
[Download RAW message or body]

Daniel Fuchs wrote:
> Hi,
>
> Thanks for the detailed explanations.
>
> The fact that the server doesn't store any client state
> and can arbitrarily close the connection, leaving it to
> the client to reestablish the connection, makes all this
> code quite tricky and hard to follow.
Yes it is complicated, we allowed a server to close a client after a 
specific timeout because in some case a client was dead but the server 
needed long long time to be informed, that could make memory problem.
>
> I believe what you propose - making sure that the
> notification thread doesn't stop without closing the
> connection, is the right thing to do.
>
> I wonder however if the code that closes the connection
> should better be moved to ClientNotifForwarder::fetchNotifs?
>
> ClientNotifForwarder::fetchNotifs has the following statement:
>
> 601         if (!shouldStop()) {
> 602             logger.error("NotifFetcher-run",
> 603                          "Failed to fetch notification, " +
> 604                          "stopping thread. Error is: " + ioe, ioe);
> 605             logger.debug("NotifFetcher-run",ioe);
> 607         }
>
> Then it proceeds to return null, which causes the thread to die.
ClientNotifForwarder is an abstract super class and does not know how to 
close a connection, this class is extended by 
RMIConnector.RMINotifClient and JMXMP, if we modify the class to have 
connection reference, that might make problem for JMXMP.

>
> It looks as if that's the place where the connection should ideally
> be closed if it is not already closed, because it would ensure
> that the thread never dies silently.
>
> Otherwise I'd suggest improving the comment below:
>
> 1369                             // JDK-8049303
> 1370                             // possible again transient or a
> 1371                             // non-deserialization exception, not 
> know how
> 1372                             // to treat, close the connection
>
> May I suggest something like:
>
> // JDK-8049303
> // We received an IOException - but the communicatorAdmin
> // did not close the connection - possibly because the
> // the original exception was raised by a transient network
> // problem?
> // We already know that this exception is not due to a deserialization
> // issue as we already took care of that before involving the
> // communicatorAdmin. Moreover - we already made one retry attempt
> // at fetching the next batch of notifications - and the
> // problem persisted.
> // Since trying again doesn't seem to solve the issue, we will now
> // close the connection. Doing otherwise might cause the
> // NotifFetcher thread to die silently.
Yes more clear, here is the new webrev:

http://cr.openjdk.java.net/~sjiang/JDK-8049303/01/

Thanks Daniel!
Shanliang
>
> best regards,
>
> -- daniel
>
> On 9/10/14 5:42 PM, shanliang wrote:
>> Hi,
>>
>> The issue could happen like this:
>> 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
>> 2) communicatorAdmin.gotIOException(ioe) was called to check the
>> connection, it did not close the connection because the connection was
>> now OK.
>> 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
>> exception and found it was not a dersialization exception, it re-threw
>> the original IOException
>> 4) the caller ClientNotifForwarder did not know how to treat this
>> exception, decided to end silently.
>>
>> The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:
>>
>> if the fetchNotifs request gets an IOException, we examine the chain of
>> exceptions to determine whether this is a deserialization issue. If so -
>> we propagate the appropriate exception to the caller, who will then
>> proceed with fetching notifications one by one, otherwise we call
>> communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
>>     1) the call returns OK, means the connection is re-established, we
>> re-call the fetchNotifs;
>>     2) the call throws IOException, we check the connection status:
>>         2-1) "terminated", that means the connection is closed, we
>> re-throw the original IOException, the caller will end silently.
>>         2-2) not "terminated", we add a flag "retried" for this
>> situation, if the flag is false, we set the flag to true and re-do the
>> fetchNotifs request, this is useful for a transient network problem,
>> otherwise we close the connection and re-throw the original IOException,
>> it is here we fix the bug.
>>
>> We do not modify communicatorAdmin.gotIOException(ioe), it is called too
>> by all other remote requests.
>>
>> It is not easy to have a test reproducing the bug.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>
>>
>> Thanks,
>> Shanliang
>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic