[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Reducers stuck in Shuffle ...
From:       Miles Osborne <miles () inf ! ed ! ac ! uk>
Date:       2009-01-30 18:05:07
Message-ID: 73e5a5310901301005t495d1f8ag14365e472dd8333f () mail ! gmail ! com
[Download RAW message or body]

i've been seeing a lot of jobs where large numbers of reducers keep
failing at the shuffle phase due to timeouts (see a sample reducer
syslog entry below).  our setup consists of 8-core machines, with one
box acting as both a slave and a namenode.  the load on the namenode
is not at full capacity so that doesn't appear to be the problem.  we
also run 0.18.1

reducers which run on the namenode are fine, it is only those running
on slaves which seem affected.

note that i seem to get this if i vary the number of reducers run, so
it doesn't appear to be a function of the shard size

is there some flag i should modify to increase the timeout value?  or,
is this fixed in the latest release?

(i found one thread on this which talked about DNS entries and another
which mentioned HADOOP-3155)

thanks

Miles
> 
2009-01-30 10:26:14,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2009-01-30 10:26:14,229 INFO org.apache.hadoop.streaming.PipeMapRed:
PipeMapRed exec
[/disk2/hadoop/mapred/local/taskTracker/jobcache/job_200901301017_0001/attempt_200901301017_0001_r_000011_0/work/./r-compute-ngram-counts]
 2009-01-30 10:26:14,368 INFO org.apache.hadoop.mapred.ReduceTask:
ShuffleRamManager: MemoryLimit=78643200,
MaxSingleShuffleLimit=19660800
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 Thread started: Thread for
merging on-disk files
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 Thread waiting: Thread for
merging on-disk files
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 Thread started: Thread for
merging in memory files
2009-01-30 10:26:14,489 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 Need another 3895 map output(s)
where 0 is already in progress
2009-01-30 10:26:14,495 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0: Got 6 new map-outputs & number
of known map outputs is 6
2009-01-30 10:26:14,496 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 Scheduled 1 of 6 known outputs (0
slow hosts and 5 dup hosts)
2009-01-30 10:26:44,566 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 copy failed:
attempt_200901301017_0001_m_000003_0 from crom.inf.ed.ac.uk
2009-01-30 10:26:44,567 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.SocketTimeoutException: connect timed out
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
                
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1296)
  at java.security.AccessController.doPrivileged(Native Method)
        at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1290)
                
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:944)
                
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1143)
                
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1084)
                
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:997)
                
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:946)
 Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.Socket.connect(Socket.java:519)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:152)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)at
sun.net.www.http.HttpClient.New(HttpClient.java:306)
        at sun.net.www.http.HttpClient.New(HttpClient.java:323)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788)
                
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729)
                
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
                
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977)
                
        ... 4 more

2009-01-30 10:26:45,493 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_200901301017_0001_r_000011_0: Failed fetch #1 from
attempt_200901301017_0001_m_000003_0
2009-01-30 10:26:45,494 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0 adding host crom.inf.ed.ac.uk to
penalty box, next contact in 4 seconds
2009-01-30 10:26:46,493 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_000011_0: Got 6 map-outputs from previous
faisyslog lines 8-44/190 19%



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic