[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-user
Subject: Re: 'Combining' input files for maps
From: Alexandre Rochette <alexroch () yahoo-inc ! com>
Date: 2007-06-21 21:36:02
Message-ID: 467AEF42.20405 () yahoo-inc ! com
[Download RAW message or body]
Thanks!
It seems to work very well for my particular use.
Runping Qi wrote:
> You can use the data_join lib in contrib to do your job.
>
> Runping
>
>
>
>
>> -----Original Message-----
>> From: Alexandre Rochette [mailto:alexroch@yahoo-inc.com]
>> Sent: Wednesday, June 20, 2007 5:44 PM
>> To: hadoop-user@lucene.apache.org
>> Subject: 'Combining' input files for maps
>>
>> Hello Hadoop users,
>>
>> I've been scratching my head over this one and wondered if anybody had
>> ever encountered something similar :
>>
>> I have 2 output files from a MapReduce job.
>>
>> Now I want to use these files to use as an input for a second MapReduce
>> job but not by randomly taking lines from each one. I want to combine
>> each item in the first file with every item in the second.
>>
>> For instance, if I have file 1 with :
>> a
>> b
>> c
>> and file 2 with:
>> 1
>> 2
>> 3
>> I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.
>>
>> To do it, I first thought about accumulating all the data in a single
>> reducer and outputing it correctly in the close() method, but that
>> requires too much memory as I have keep every item in my working set to
>> be able to iterate over them on close().
>>
>> So right now I'm stuck at sequentially going through the files by
>> reading them with the filesystem api and constructing a new file that
>> combines both. I do that in the my driver's main() before running my
>> second MapReduce job.
>>
>> Is there anything already available that would let me do what I want in
>> a distributed fashion? Like maybe a way to generate InputSlices by
>> reading from two files at once ?
>>
>> Thank you,
>> alex.r.
>>
>
>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic