[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Fwd: best way to make all hdfs records in one file under a folder?
From:       Jimmy <jimmyjack () gmail ! com>
Date:       2014-01-20 20:35:13
Message-ID: CAE0GdZWCCO5FQdOfB_QA=AoaFxcRhNOsWRSnBemO9zwmG0YRTA () mail ! gmail ! com
[Download RAW message or body]

Seems like the only reason is "too many files" issue, correct?

File Crusher executed regularly might be better option than trying to tune
it in flume

http://www.jointhegrid.com/hadoop_filecrush/index.jsp



---------- Forwarded message ----------
From: Chen Wang <chen.apache.solr@gmail.com>
Date: Mon, Jan 20, 2014 at 11:21 AM
Subject: Re: best way to make all hdfs records in one file under a folder?
To: user@flume.apache.org


Chris,
Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
data size is around 15M. Thus I want them all in one file.
Chen


On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <cshannon108@gmail.com
> wrote:

> How is your data partitioned, by date?
>
>
> On Monday, January 20, 2014, Chen Wang <chen.apache.solr@gmail.com> wrote:
>
>> Guys,
>> I have flume setup to flow partitioned data to hdfs, each partition has
>> its own file folder. Is there a way to specify all the data under one
>> partition to be in one file?
>> I am currently using
>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>
>> to make the file roll on 15m data or after 6 minute.
>>
>> Is this the best way to achieve my goal?
>> Thanks,
>> Chen
>>
>>

[Attachment #3 (text/html)]

<div dir="ltr"><div class="gmail_quote">Seems like the only reason is &quot;too many \
files&quot; issue, correct? </div><div class="gmail_quote"><br></div><div \
class="gmail_quote">File Crusher executed regularly might be better option than \
trying to tune it in flume</div> <div class="gmail_quote"><br></div><div \
class="gmail_quote"><a \
href="http://www.jointhegrid.com/hadoop_filecrush/index.jsp">http://www.jointhegrid.com/hadoop_filecrush/index.jsp</a></div><div \
class="gmail_quote"><br><div dir="ltr"> <div \
style="font-family:verdana,sans-serif;color:rgb(51,51,51)"><br></div><br><div \
class="gmail_quote">---------- Forwarded message ----------<br>From: <b \
class="gmail_sendername">Chen Wang</b> <span dir="ltr">&lt;<a \
href="mailto:chen.apache.solr@gmail.com" \
target="_blank">chen.apache.solr@gmail.com</a>&gt;</span><br>


Date: Mon, Jan 20, 2014 at 11:21 AM<br>Subject: Re: best way to make all hdfs records \
in one file under a folder?<br>To: <a href="mailto:user@flume.apache.org" \
target="_blank">user@flume.apache.org</a><br><br><br><div dir="ltr"> \
<div>Chris,</div>

Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the data size is \
around 15M. Thus I want them all in one file.<div>Chen</div></div><div \
class="gmail_extra"><br><br><div class="gmail_quote"> On Mon, Jan 20, 2014 at 10:57 \
AM, Christopher Shannon <span dir="ltr">&lt;<a href="mailto:cshannon108@gmail.com" \
target="_blank">cshannon108@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">




How is your data partitioned, by date?<div><div><span></span><br><br>On Monday, \
January 20, 2014, Chen Wang &lt;<a href="mailto:chen.apache.solr@gmail.com" \
target="_blank">chen.apache.solr@gmail.com</a>&gt; wrote:<br> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <div dir="ltr">Guys,<div>I have flume setup to flow partitioned data to hdfs, each \
partition has its own file folder. Is there a way to specify all the data under one \
partition to be in one file?</div><div>I am currently using </div>





<div><div>MyAgent.sinks.HDFS.hdfs.batchSize = \
10000</div><div>MyAgent.sinks.HDFS.hdfs.rollSize = \
15000000</div><div>MyAgent.sinks.HDFS.hdfs.rollCount = \
10000</div><div>MyAgent.sinks.HDFS.hdfs.rollInterval = 360</div></div>





<div><br></div><div>to make the file roll on 15m data or after 6 \
minute.</div><div><br></div><div>Is this the best way to achieve my \
goal?</div><div>Thanks,</div><div>Chen</div><div><br></div></div> </blockquote>
</div></div></blockquote></div><br></div>
</div><br></div>
</div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic