'Re: How does Hadoop choose machines for Reducers?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: How does Hadoop choose machines for Reducers?
From:       Nathan Marz <nathan () rapleaf ! com>
Date:       2009-01-30 19:08:43
Message-ID: 8B087C48-3FE1-45DA-809D-83BF9F2550D7 () rapleaf ! com
[Download RAW message or body]


This is a huge problem for my application. I tried setting  
mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but  
that didn't have any effect. I'm using a custom output format and it's  
essential that Hadoop distribute the reduce tasks to make use of all  
the machine's as there is contention when multiple reduce tasks run on  
one machine. Since my number of reduce tasks is guaranteed to be less  
than the number of machines in the cluster, there's no reason for  
Hadoop not to make use of the full cluster.

Does anyone know of a way to force Hadoop to distribute reduce tasks  
evenly across all the machines?


On Jan 30, 2009, at 7:32 AM, jason hadoop wrote:

> Hadoop just distributes to the available reduce execution slots. I  
> don't
> believe it pays attention to what machine they are on.
> I believe the plan is to take account data locality in future (ie:
> distribute tasks to machines that are considered more topologically  
> close to
> their input split first, but I don't think this is available to most  
> users.)
>
>
> On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz <nathan@rapleaf.com>  
> wrote:
>
>> I have a MapReduce application in which I configure 16 reducers to  
>> run on
>> 15 machines. My mappers output exactly 16 keys, IntWritable's from  
>> 0 to 15.
>> However, only 12 out of the 15 machines are used to run the 16  
>> reducers (4
>> machines have 2 reducers running on each). Is there a way to get  
>> Hadoop to
>> use all the machines for reducing?
>>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic