[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    problem with completion notification from block movement
From:       Karl Kleinpaste <karl () conviva ! com>
Date:       2009-01-30 16:59:59
Message-ID: 1233334799.5164.80.camel () awol ! kleinpaste ! org
[Download RAW message or body]

We have a small test cluster, a double master (NameNode+JobTracker) plus
2 slaves, running 0.18.1.  We are seeing an intermittent problem where
our application logs failures out of DFSClient, thus:

2009-01-30 01:59:42,072 WARN org.apache.hadoop.dfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_7603130349014268849_2349933java.net.SocketTimeoutException: 66000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.10.102:54700
remote=/10.0.10.108:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at java.io.DataInputStream.readLong(DataInputStream.java:380)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream
$ResponseProcessor.run(DFSClient.java:2044)

(Apologies for paste formatting.  I hate Evolution.)

Our application here is our "JobConsole," which is responsible for
taking notifications from an external data-generating application: The
external app scribbles files into DFS and then tells JobConsole about
them.  JobConsole submits jobs to crunch that data in response to the
external app's notifications of data availability.  JobConsole runs on
the master node.

Chasing that block identifier through our JobConsole log plus the
DataNode logs on the slaves, we have an odd timeline, which is this:
01:58:32	slave (.108, above): receiving blk from master (.102)
01:58:35	other slave (.107): receiving blk from .108
01:58:36	.107: received blk
01:58:38	.107: terminate PacketResponder
01:59:42	JobConsole (.102): 66s t.o. + Error Recovery (above)
01:59:42	.107: invoke recoverBlock on that blk
02:01:15	.108: received blk + terminate PacketResponder
03:03:24	.108: deleting blk, from Linux pathname in DFS storage

What's clear from this is that .108 got the block quickly, because it
was in a position immediately to send a copy to .107, which responded
promptly enough to say that it was in possession.  But .108's DataNode
sat on the block for a full 3 minutes before announcing what appears to
have been ordinary completion and responder termination.  After the
first minute-plus of that long period, JobConsole gave up and did a
recovery operation, which appears to work.  If .108's DataNode sent a
notification when it finally logged completed reception, no doubt there
was nobody listening for it any more.

What's particularly of interest to us is that the NameNode log shows us
that the data being moved is job.jar:

2009-01-30 01:58:32,353 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock: \
/usr/local/rinera/hadoop/hadoop-runtime/0/mapred/system/job_200901291752_3021/job.jar. \
blk_7603130349014268849_2349933

Note block name and timestamp.

Does anyone else have knowledge or history with such glitches?  We've
recently begun seeing a number of problems in communication between task
management processes and DFS that previously had not been seen, and
we're trying to nail down where they're coming from, without success.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic