[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-dev
Subject:    [jira] Updated: (HADOOP-5361) Tasks freeze with "No live nodes
From:       "Matei Zaharia (JIRA)" <jira () apache ! org>
Date:       2009-02-28 2:43:13
Message-ID: 373522696.1235788993064.JavaMail.jira () brutus
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/HADOOP-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Matei Zaharia updated HADOOP-5361:
----------------------------------

    Description: 
Running a recent version of trunk on 100 nodes, I occasionally see some tasks freeze \
at startup and hang the job. These tasks are not speculatively executed either. \
Here's sample output from one of them:

{noformat}
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing \
JVM Metrics with processName=MAP, sessionId= 2009-02-27 15:19:10,486 INFO \
org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2009-02-27 15:21:20,952 INFO \
org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
contain current block 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: \
Could not obtain block blk_2086525142250101885_39076 from any node:  \
java.io.IOException: No live nodes contain current block 2009-02-27 15:25:26,992 INFO \
org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
contain current block 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: \
DFS Read: java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 \
file=/user/root/rand2/part-00864  at \
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)  at \
java.io.DataInputStream.read(DataInputStream.java:83)  at \
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)  at \
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)  at \
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)  at \
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)  at \
org.apache.hadoop.mapred.Child.main(Child.java:155)

2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error running \
                child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 \
file=/user/root/rand2/part-00864  at \
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)  at \
java.io.DataInputStream.read(DataInputStream.java:83)  at \
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)  at \
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)  at \
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)  at \
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)  at \
org.apache.hadoop.mapred.Child.main(Child.java:155) {noformat}

Note how the DFS client fails multiple times to retrieve the block, with a 2 minute \
wait between each one, without giving up. During this time, the task is *not* \
speculated. However, once this task finally failed, a new version of it ran \
successfully. Getting the input file in question with bin/hadoop fs -get also worked \
fine.

There is no mention of the task attempt in question in the NameNode logs but my guess \
is that something to do with RPC queues is causing its connection to get lost, and \
the DFSClient does not recover.

  was:
Running a recent version of trunk on 100 nodes, I occasionally see some tasks freeze \
at startup and hang the job. These tasks are not speculatively executed either. \
Here's sample output from one of them:

{noformat}
009-02-27 15:19:09,856 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: \
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. \
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties \
of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 2009-02-27 \
15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics \
with processName=MAP, sessionId= 2009-02-27 15:19:10,486 INFO \
org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2009-02-27 15:21:20,952 INFO \
org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
contain current block 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: \
Could not obtain block blk_2086525142250101885_39076 from any node:  \
java.io.IOException: No live nodes contain current block 2009-02-27 15:25:26,992 INFO \
org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
contain current block 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: \
DFS Read: java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 \
file=/user/root/rand2/part-00864  at \
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)  at \
java.io.DataInputStream.read(DataInputStream.java:83)  at \
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)  at \
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)  at \
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)  at \
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)  at \
org.apache.hadoop.mapred.Child.main(Child.java:155)

2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error running \
                child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 \
file=/user/root/rand2/part-00864  at \
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)  \
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)  at \
java.io.DataInputStream.read(DataInputStream.java:83)  at \
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)  at \
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)  at \
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)  at \
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)  at \
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)  at \
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)  at \
org.apache.hadoop.mapred.Child.main(Child.java:155) {noformat}

Note how the DFS client fails multiple times to retrieve the block, with a 2 minute \
wait between each one, without giving up. During this time, the task is *not* \
speculated. However, once this task finally failed, a new version of it ran \
successfully. Getting the input file in question with bin/hadoop fs -get also worked \
fine.

There is no mention of the task attempt in question in the NameNode logs but my guess \
is that something to do with RPC queues is causing its connection to get lost, and \
the DFSClient does not recover.


Updated description to remove an insanely long line.

> Tasks freeze with "No live nodes contain current block", job takes long time to \
>                 recover
> ---------------------------------------------------------------------------------------
>  
> Key: HADOOP-5361
> URL: https://issues.apache.org/jira/browse/HADOOP-5361
> Project: Hadoop Core
> Issue Type: Bug
> Affects Versions: 0.21.0
> Reporter: Matei Zaharia
> 
> Running a recent version of trunk on 100 nodes, I occasionally see some tasks \
> freeze at startup and hang the job. These tasks are not speculatively executed \
> either. Here's sample output from one of them: {noformat}
> 2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing \
> JVM Metrics with processName=MAP, sessionId= 2009-02-27 15:19:10,486 INFO \
> org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2009-02-27 15:21:20,952 INFO \
> org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
> blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
> contain current block 2009-02-27 15:23:23,972 INFO \
> org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
> blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
> contain current block 2009-02-27 15:25:26,992 INFO \
> org.apache.hadoop.hdfs.DFSClient: Could not obtain block \
> blk_2086525142250101885_39076 from any node:  java.io.IOException: No live nodes \
> contain current block 2009-02-27 15:27:30,012 WARN \
> org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain \
> block: blk_2086525142250101885_39076 file=/user/root/rand2/part-00864 at \
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664) \
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492) \
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619) at \
> java.io.DataInputStream.read(DataInputStream.java:83) at \
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at \
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136) at \
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40) at \
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) \
> at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) at \
> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at \
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) at \
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at \
> org.apache.hadoop.mapred.Child.main(Child.java:155) 2009-02-27 15:27:30,018 WARN \
>                 org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 \
> file=/user/root/rand2/part-00864 at \
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664) \
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492) \
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619) at \
> java.io.DataInputStream.read(DataInputStream.java:83) at \
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at \
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136) at \
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40) at \
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191) \
> at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175) at \
> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at \
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) at \
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at \
> org.apache.hadoop.mapred.Child.main(Child.java:155) {noformat}
> Note how the DFS client fails multiple times to retrieve the block, with a 2 minute \
> wait between each one, without giving up. During this time, the task is *not* \
> speculated. However, once this task finally failed, a new version of it ran \
> successfully. Getting the input file in question with bin/hadoop fs -get also \
> worked fine. There is no mention of the task attempt in question in the NameNode \
> logs but my guess is that something to do with RPC queues is causing its connection \
> to get lost, and the DFSClient does not recover.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic