'Re: CombineInputFormat for mix of small and large files.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: CombineInputFormat for mix of small and large files.
From:       Ravindra <ravindra.bajpai () gmail ! com>
Date:       2017-02-24 9:28:53
Message-ID: CANuZ=14sS_73_m4YJXyzg2NQqzPWwUAg80KtcVSZcrYiQgyvqg () mail ! gmail ! com
[Download RAW message or body]

Also to add that my test input file has record less than what I see as
going to mappers (i.e. Map Input Records). and the input file is more than
double the size of block size.


On Fri, Feb 24, 2017 at 4:25 PM Ravindra <ravindra.bajpai@gmail.com> wrote:

> Hi All,
>
> I have implemented CombineInputFormat for my job and it works well for
> small files i.e. combine those to the block boundary. But there are few
> very large file that it gets from the input source along with small files.
> Hence the mapper that got to work on this large file becomes a laggard.
>
> I had overwritten isSplitable to return false. I guess that was the reason
> and hence I removed this overriding (i.e. allow hadoop to have default
> behaviour on this). Hadoop splits the big files now, fine but then I see
> inconsistency with the output records.
>
> Is there anything related with my CustomRecordReader that I need to take
> care of. Not sure.
>
> Please advise!
>
> Thanks.
>

[Attachment #3 (text/html)]

<div dir="ltr">Also to add that my test input file has record less than what I see as \
going to mappers (i.e. Map Input Records). and the input file is more than double the \
size of block size.<div><br></div></div><br><div class="gmail_quote"><div \
dir="ltr">On Fri, Feb 24, 2017 at 4:25 PM Ravindra &lt;<a \
href="mailto:ravindra.bajpai@gmail.com">ravindra.bajpai@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="gmail_msg">Hi \
All,<div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">I have \
implemented CombineInputFormat for my job and it works well for small files i.e. \
combine those to the block boundary. But there are few very large file that it gets \
from the input source along with small files. Hence the mapper that got to work on \
this large file becomes a laggard.  </div><div class="gmail_msg"><br \
class="gmail_msg"></div><div class="gmail_msg">I had overwritten isSplitable to \
return false. I guess that was the reason and hence I removed this overriding (i.e. \
allow hadoop to have default behaviour on this). Hadoop splits the big files now, \
fine but then I see inconsistency with the output records.</div><div \
class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Is there \
anything related with my CustomRecordReader that I need to take care of. Not \
sure.</div><div class="gmail_msg"><br class="gmail_msg"></div><div \
class="gmail_msg">Please advise!</div><div class="gmail_msg"><br \
class="gmail_msg"></div><div class="gmail_msg">Thanks.</div></div></blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic