[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: 'Combining' input files for maps
From:       Alexandre Rochette <alexroch () yahoo-inc ! com>
Date:       2007-06-21 21:36:02
Message-ID: 467AEF42.20405 () yahoo-inc ! com
[Download RAW message or body]

Thanks!

It seems to work very well for my particular use.

Runping Qi wrote:
> You can use the data_join lib in contrib to do your job.
>
> Runping
>  
>
>
>   
>> -----Original Message-----
>> From: Alexandre Rochette [mailto:alexroch@yahoo-inc.com]
>> Sent: Wednesday, June 20, 2007 5:44 PM
>> To: hadoop-user@lucene.apache.org
>> Subject: 'Combining' input files for maps
>>
>> Hello Hadoop users,
>>
>> I've been scratching my head over this one and wondered if anybody had
>> ever encountered something similar :
>>
>> I have 2 output files from a MapReduce job.
>>
>> Now I want to use these files to use as an input for a second MapReduce
>> job but not by randomly taking lines from each one. I want to combine
>> each item in the first file with every item in the second.
>>
>> For instance, if I have file 1 with :
>> a
>> b
>> c
>> and file 2 with:
>> 1
>> 2
>> 3
>> I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.
>>
>> To do it, I first thought about accumulating all the data in a single
>> reducer and outputing it correctly in the close() method, but that
>> requires too much memory as I have keep every item in my working set to
>> be able to iterate over them on close().
>>
>> So right now I'm stuck at sequentially going through the files by
>> reading them with the filesystem api and constructing a new file that
>> combines both. I do that in the my driver's main() before running my
>> second MapReduce job.
>>
>> Is there anything already available that would let me do what I want in
>> a distributed fashion? Like maybe a way to generate InputSlices by
>> reading from two files at once ?
>>
>> Thank you,
>> alex.r.
>>     
>
>
>   


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic