'Re: hdfs.fileType = CompressedStream'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Re: hdfs.fileType = CompressedStream
From:       Jimmy <jimmyjack () gmail ! com>
Date:       2014-01-30 22:30:10
Message-ID: CAE0GdZVAvmrcEv2ie7Gbd9YbvSv_R-Odd1Jqg3bEYEs08MkL5A () mail ! gmail ! com
[Download RAW message or body]

snappy is not splittable neither, combining with sequence files it gives
identical result - bulk dumps whole file into HDFS

I feel a bit uneasy to keep 120MB (almost 1GB uncompressed) file open for
one hour.....



On Thu, Jan 30, 2014 at 1:59 PM, Jeff Lord <jlord@cloudera.com> wrote:

> You are using gzip so the files won't splittable.
> You may be better off using snappy and sequence files.
>
>
> On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <jimmyjack@gmail.com> wrote:
>
>> I am running few tests and would like to confirm whether this is true...
>>
>> hdfs.codeC = gzip
>> hdfs.fileType = CompressedStream
>> hdfs.writeFormat = Text
>> hdfs.batchSize = 100
>>
>>
>> now lets assume I have large number of transactions I roll file every 10
>> minutes
>>
>> it seems the tmp file stay 0bytes and flushes at once after 10 minutes vs
>> if I dont use compression, the file will grow as data are written to HDFS
>>
>> is this correct?
>>
>> Do you see any drawback in using compressedstream and with very large
>> files? In my case 120MB compressed file (block size) is 10x uncompressed
>>
>>
>

[Attachment #3 (text/html)]

<div dir="ltr">snappy is not splittable neither, combining with sequence files it \
gives identical result - bulk dumps whole file into HDFS<div><br></div><div>I feel a \
bit uneasy to keep 120MB (almost 1GB uncompressed) file open for one hour.....</div> \
<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On \
Thu, Jan 30, 2014 at 1:59 PM, Jeff Lord <span dir="ltr">&lt;<a \
href="mailto:jlord@cloudera.com" target="_blank">jlord@cloudera.com</a>&gt;</span> \
wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex"><div dir="ltr">You are using gzip so the files won&#39;t \
splittable.<div>You may be better off using snappy and sequence files.</div> \
</div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div \
class="gmail_quote">On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <span dir="ltr">&lt;<a \
href="mailto:jimmyjack@gmail.com" target="_blank">jimmyjack@gmail.com</a>&gt;</span> \
wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">I am running few tests and would like to \
confirm whether this is true...<div><br></div><div><div>hdfs.codeC = gzip</div>

<div>hdfs.fileType = CompressedStream</div><div>hdfs.writeFormat = \
Text</div></div><div> hdfs.batchSize = \
100<br></div><div><br></div><div><br></div><div>now lets assume I have large number \
of transactions I roll file every 10 minutes</div><div><br></div><div>it seems the \
tmp file stay 0bytes and flushes at once after 10 minutes vs if I dont use \
compression, the file will grow as data are written to HDFS</div>


<div><br></div><div>is this correct?</div><div><br></div><div>Do you see any drawback \
in using compressedstream and with very large files? In my case 120MB compressed file \
(block size) is 10x uncompressed</div><div><br></div>


</div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic