[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Re: best way to make all hdfs records in one file under a folder?
From:       Jeff Lord <jlord () cloudera ! com>
Date:       2014-01-20 21:46:13
Message-ID: CAJmzdX=D_sZTpC3QjpiDNHxoU5UZiZuMeTFH0+OtZWY_9=sS_w () mail ! gmail ! com
[Download RAW message or body]

If you don't intend to roll based on # of events than you will want to set
rollCount to 0.
MyAgent.sinks.HDFS.hdfs.rollCount = 0


On Mon, Jan 20, 2014 at 12:35 PM, Jimmy <jimmyjack@gmail.com> wrote:

> Seems like the only reason is "too many files" issue, correct?
>
> File Crusher executed regularly might be better option than trying to tune
> it in flume
>
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>
>
>
> ---------- Forwarded message ----------
> From: Chen Wang <chen.apache.solr@gmail.com>
> Date: Mon, Jan 20, 2014 at 11:21 AM
> Subject: Re: best way to make all hdfs records in one file under a folder?
> To: user@flume.apache.org
>
>
> Chris,
> Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
> data size is around 15M. Thus I want them all in one file.
> Chen
>
>
> On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <
> cshannon108@gmail.com> wrote:
>
>> How is your data partitioned, by date?
>>
>>
>> On Monday, January 20, 2014, Chen Wang <chen.apache.solr@gmail.com>
>> wrote:
>>
>>> Guys,
>>> I have flume setup to flow partitioned data to hdfs, each partition has
>>> its own file folder. Is there a way to specify all the data under one
>>> partition to be in one file?
>>> I am currently using
>>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>>
>>> to make the file roll on 15m data or after 6 minute.
>>>
>>> Is this the best way to achieve my goal?
>>> Thanks,
>>> Chen
>>>
>>>
>
>
>

[Attachment #3 (text/html)]

<div dir="ltr">If you don&#39;t intend to roll based on # of events than you will \
want to set rollCount to 0.<div><span \
style="font-family:arial,sans-serif;font-size:12.666666984558105px">MyAgent.sinks.HDFS.hdfs.</span><span \
style="font-family:arial,sans-serif;font-size:12.666666984558105px">rollCount = \
0</span><br> </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On \
Mon, Jan 20, 2014 at 12:35 PM, Jimmy <span dir="ltr">&lt;<a \
href="mailto:jimmyjack@gmail.com" target="_blank">jimmyjack@gmail.com</a>&gt;</span> \
wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote">Seems like the \
only reason is &quot;too many files&quot; issue, correct? </div> <div \
class="gmail_quote"><br></div><div class="gmail_quote">File Crusher executed \
regularly might be better option than trying to tune it in flume</div> <div \
class="gmail_quote"><br></div><div class="gmail_quote"><a \
href="http://www.jointhegrid.com/hadoop_filecrush/index.jsp" \
target="_blank">http://www.jointhegrid.com/hadoop_filecrush/index.jsp</a></div><div><div \
class="h5"> <div class="gmail_quote"><br><div dir="ltr">
<div style="font-family:verdana,sans-serif;color:rgb(51,51,51)"><br></div><br><div \
class="gmail_quote">---------- Forwarded message ----------<br>From: <b \
class="gmail_sendername">Chen Wang</b> <span dir="ltr">&lt;<a \
href="mailto:chen.apache.solr@gmail.com" \
target="_blank">chen.apache.solr@gmail.com</a>&gt;</span><br>



Date: Mon, Jan 20, 2014 at 11:21 AM<br>Subject: Re: best way to make all hdfs records \
in one file under a folder?<br>To: <a href="mailto:user@flume.apache.org" \
target="_blank">user@flume.apache.org</a><br><br><br><div dir="ltr">

<div>Chris,</div>

Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the data size is \
around 15M. Thus I want them all in one file.<div>Chen</div></div><div \
class="gmail_extra"><br><br><div class="gmail_quote"> On Mon, Jan 20, 2014 at 10:57 \
AM, Christopher Shannon <span dir="ltr">&lt;<a href="mailto:cshannon108@gmail.com" \
target="_blank">cshannon108@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">





How is your data partitioned, by date?<div><div><span></span><br><br>On Monday, \
January 20, 2014, Chen Wang &lt;<a href="mailto:chen.apache.solr@gmail.com" \
target="_blank">chen.apache.solr@gmail.com</a>&gt; wrote:<br> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <div dir="ltr">Guys,<div>I have flume setup to flow partitioned data to hdfs, each \
partition has its own file folder. Is there a way to specify all the data under one \
partition to be in one file?</div><div>I am currently using </div>






<div><div>MyAgent.sinks.HDFS.hdfs.batchSize = \
10000</div><div>MyAgent.sinks.HDFS.hdfs.rollSize = \
15000000</div><div>MyAgent.sinks.HDFS.hdfs.rollCount = \
10000</div><div>MyAgent.sinks.HDFS.hdfs.rollInterval = 360</div></div>






<div><br></div><div>to make the file roll on 15m data or after 6 \
minute.</div><div><br></div><div>Is this the best way to achieve my \
goal?</div><div>Thanks,</div><div>Chen</div><div><br></div></div> </blockquote>
</div></div></blockquote></div><br></div>
</div><br></div>
</div><br></div></div></div>
</blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic