[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: Find reducer for a key
From:       Alberto Cordioli <cordioli.alberto () gmail ! com>
Date:       2013-03-30 13:28:09
Message-ID: CAFnReZbdtxpHOmiTsWzvv+-gRvfwXO2TL4p0uMq3kQ_aOEHCCA () mail ! gmail ! com
[Download RAW message or body]

You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.

I want to read two files. In the first one the rows are composed as
the following:
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.

The second file contains tuples of this form:
Flight_Id,Airport_From,Airport_To,Time
and summarize all the flights timetable with the respective airports.

Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.

My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?


Thanks,
Alberto



On 29 March 2013 04:47, Hemanth Yamijala <yhemanth@thoughtworks.com> wrote:
> Hi,
>
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
>
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
>
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> you.
>
> Thanks
> hemanth
>
>
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <cordioli.alberto@gmail.com> wrote:
>>
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>>
>> Do you think that the solution I proposed is not so elegant or efficient?
>>
>> Alberto
>>
>> On 28 March 2013 13:12, Hemanth Yamijala <yhemanth@thoughtworks.com>
>> wrote:
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> >
>> >
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <cordioli.alberto@gmail.com> wrote:
>> >>
>> >> Hi Hemanth,
>> >>
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >>
>> >>
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> On 28 March 2013 11:01, Hemanth Yamijala <yhemanth@thoughtworks.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.
>> >> > Every
>> >> > MapReduce job has a partitioner associated to it. The default
>> >> > partitioner is
>> >> > a HashPartitioner. You can as a user write your own partitioner as
>> >> > well
>> >> > and
>> >> > plug it into the job. The partitioner is responsible for splitting
>> >> > the
>> >> > map
>> >> > outputs key space among the reducers.
>> >> >
>> >> > So, to know which reducer a key will go to, it is basically the value
>> >> > returned by the partitioner's getPartition method. For e.g this is
>> >> > the
>> >> > code
>> >> > in the HashPartitioner:
>> >> >
>> >> >   public int getPartition(K2 key, V2 value,
>> >> >                           int numReduceTasks) {
>> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >> >   }
>> >> >
>> >> > mapred.task.partition is the key that defines the partition number of
>> >> > this
>> >> > reducer.
>> >> >
>> >> > I guess you can piece together these bits into what you'd want..
>> >> > However, I
>> >> > am interested in understanding why you want to know this ? Can you
>> >> > share
>> >> > some info ?
>> >> >
>> >> > Thanks
>> >> > Hemanth
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> >> > <cordioli.alberto@gmail.com> wrote:
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> how can i know the keys that are associated to a particular reducer
>> >> >> in
>> >> >> the setup method?
>> >> >> Let's assume in the setup method to read from a file where each line
>> >> >> is a string that will become a key emitted from mappers.
>> >> >> For each of these lines I would like to know if the string will be a
>> >> >> key associated with the current reducer or not.
>> >> >>
>> >> >> I read something about mapred.task.partition and mapred.task.id, but
>> >> >> I
>> >> >> didn't understand the usage.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Alberto
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Alberto Cordioli
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic