[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: A new way to merge up those small files!
From:       Edward Capriolo <edlinuxguru () gmail ! com>
Date:       2010-09-27 14:04:07
Message-ID: AANLkTi=NU-7eaTo-grjJA1rknviySohG_6s2-VRe63z7 () mail ! gmail ! com
[Download RAW message or body]

Ted,

Good point. Patches are welcome :) I will add it onto my to-do list.

Edward

On Sat, Sep 25, 2010 at 12:05 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> Edward:
> Thanks for the tool.
>
> I think the last parameter can be omitted if you follow what hadoop fs -text
> does.
> It looks at a file's magic number so that it can attempt to *detect* the
> type of the file.
>
> Cheers
>
> On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>
>> Many times a hadoop job produces a file per reducer and the job has
>> many reducers. Or a map only job one output file per input file and
>> you have many input files. Or you just have many small files from some
>> external process. Hadoop has sub optimal handling of small files.
>> There are some ways to handle this inside a map reduce program,
>> IdentityMapper + IdentityReducer for example, or multi outputs However
>> we wanted a tool that could be used by people using hive, or pig, or
>> map reduce. We wanted to allow people to combine a directory with
>> multiple files or a hierarchy of directories like the root of a hive
>> partitioned table. We also wanted to be able to combine text or
>> sequence files.
>>
>> What we came up with is the filecrusher.
>>
>> Usage:
>> /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
>> /user/edward/backup 50 SEQUENCE
>> (50 is the number of mappers here)
>>
>> Code is Apache V2 and you can get it here:
>> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>>
>> Enjoy,
>> Edward
>>
>
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic