'Re: How hdfs splits blocks on record boundaries'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: How hdfs splits blocks on record boundaries
From:       Harsh J <harsh () cloudera ! com>
Date:       2012-06-21 6:19:32
Message-ID: CAOcnVr1N4uretuZdPPCTZeLffgrx3KJaK_n=B7a9LoaB3cjgAw () mail ! gmail ! com
[Download RAW message or body]

Sachin,

That would require knowledge on record boundaries in the file, a
solution that wouldn't scale for very large files nor for large number
of files. You don't really have to do that, its the hard way. Please
see my previous response for a proper MR way of doing this.

On Thu, Jun 21, 2012 at 10:45 AM, Sachin Aggarwal
<different.sachin@gmail.com> wrote:
> when u store data to hdfs it will split in 64 MB chunks automaticaly
>
> usse this to create the no of mappers u want as per size in bytes
> now u can read each file and can use split function
>
>    FileInputFormat.setMaxInputSplitSize(job, 2097152);
>    FileInputFormat.setMinInputSplitSize(job, 1048576);
>
> now u can read each file and can use split function as
>    String record = line.split(",");
>
> On Thu, Jun 14, 2012 at 10:56 AM, Harsh J <harsh@cloudera.com> wrote:
>
>> You may use TextInputFormat with "textinputformat.record.delimiter"
>> config set to the character you use. This feature is available in the
>> Apache Hadoop 2.0.0 release (and perhaps in other distributions that
>> carry backports).
>>
>> In case you don't have a Hadoop cluster with this feature
>> (MAPREDUCE-2254), you can read up on how \n is handled and handle your
>> files in the same way (swapping \n in LineReader with your character,
>> essentially what the above feature does):
>> http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
>> the logic)
>>
>> Does this help?
>>
>> On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
>> <prasen.bea@gmail.com> wrote:
>> > I have a textfile which doesn't have any newline characters. The
>> > records are separated by a special character ( e.g. $ ). if I push a
>> > single file of 5 GB to hdfs, how will it identify the boundaries on
>> > which the files should be split ?
>> >
>> > What are the options I have in such scenaion so that I can run mapreduce
>> jobs :
>> >
>> > 1. Replace record-separator with new line ? ( Not very convincing as I
>> > have newline in the data )
>> >
>> > 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
>> > it can be avoided )
>> >
>> > 3. I can definitely write my customloader for my mapreduce jobs, but
>> > even then is it possible to reach out across hdfs nodes if the files
>> > are not aligned with recoird boundaries ?
>> >
>> > Thanks,
>> > Prasenjit
>> >
>> > --
>> > Sent from my mobile device
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772



-- 
Harsh J
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic