[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Re: .tmp in hdfs sink
From:       Mohit Anchlia <mohitanchlia () gmail ! com>
Date:       2012-11-29 16:46:47
Message-ID: CAOT3TWriNcqpy3pnKtvie5hK+CGseLvDEbTA0RZpru91H8xurA () mail ! gmail ! com
[Download RAW message or body]

Thanks for your response so far. I checkedout flume-1.3.0 and have built
it. My next question is the property hdfs.closeIdleTimeout correct? Do I
need to set any other property? My current config looks like and I write by
YYYY/MM/DD/HH format so essentially I get 1-2 files per hour.


webanalytics.sinks.hdfsSink.hdfs.filePrefix = web

webanalytics.sinks.hdfsSink.hdfs.rollInterval = 4000

webanalytics.sinks.hdfsSink.hdfs.rollCount = 20000000

#webanalytics.sinks.hdfsSink.hdfs.rollCount = 40000

webanalytics.sinks.hdfsSink.hdfs.rollSize = 15000000000

webanalytics.sinks.hdfsSink.hdfs.fileType = SequenceFile

webanalytics.sinks.hdfsSink.hdfs.writeFormat = Text

webanalytics.sinks.hdfsSink.hdfs.codeC = snappy


On Wed, Nov 28, 2012 at 9:20 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  The changes are in both the 1.3 RC5 and in the 1.4 trunk
>
>
> On 11/29/2012 01:26 PM, Mohit Anchlia wrote:
>
> If I grab the last snapshot would I get these changes?
>
> On Tue, Nov 20, 2012 at 3:24 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:
>
>> that's awesome!
>>
>>
>> On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mpercy@apache.org> wrote:
>>
>>> Mohit,
>>> No problem, but Juhani did all the work. :)
>>>
>>> The behavior is that you can configure an HDFS sink to close a file if
>>> it hasn't gotten any writes in some time. After it's been idle for 5
>>> minutes or something, it gets closed. If you get a "late" event that goes
>>> to the same path after the file is closed, it will just create a new file
>>> in the same path as usual.
>>>
>>> Regards,
>>> Mike
>>>
>>>
>>> On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <brock@cloudera.com>wrote:
>>>
>>>> We are currently voting on a 1.3.0 RC on the dev@ list:
>>>>
>>>> http://s.apache.org/OQ0W
>>>>
>>>> You don't have to be a committer to vote! :)
>>>>
>>>> Brock
>>>>
>>>> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>>>> wrote:
>>>> > Thanks a lot!! Now with this what should be the expected behaviour?
>>>> After
>>>> > file is closed a new file is created for writes that come after
>>>> closing the
>>>> > file?
>>>> >
>>>> > Thanks again for committing this change. Do you know when 1.3.0 is
>>>> out? I am
>>>> > currently using the snapshot version of 1.3.0
>>>> >
>>>> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mpercy@apache.org>
>>>> wrote:
>>>> >>
>>>> >> Mohit,
>>>> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case
>>>> where you
>>>> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
>>>> files
>>>> >> will roll normally.
>>>> >>
>>>> >> Regards,
>>>> >> Mike
>>>> >>
>>>> >>
>>>> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>>>> >> <juhani_connolly@cyberagent.co.jp> wrote:
>>>> >>>
>>>> >>> I am actually working on a patch for exactly this, refer to
>>>> FLUME-1660
>>>> >>>
>>>> >>> The patch is on review board right now, I fixed a corner case issue
>>>> that
>>>> >>> came up with unit testing, but the implementation is not really to
>>>> my
>>>> >>> satisfaction. If you are interested please have a look and add your
>>>> opinion.
>>>> >>>
>>>> >>> https://issues.apache.org/jira/browse/FLUME-1660
>>>> >>> https://reviews.apache.org/r/7659/
>>>> >>>
>>>> >>>
>>>> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>>> >>>
>>>> >>> Another question I had was about rollover. What's the best way to
>>>> >>> rollover files in reasonable timeframe? For instance our path is
>>>> YY/MM/DD/HH
>>>> >>> so every hour there is new file and the -1 hr is just sitting with
>>>> .tmp and
>>>> >>> it takes sometimes even hour before .tmp is closed and renamed to
>>>> .snappy.
>>>> >>> In this situation is there a way to tell flume to rollover files
>>>> sooner
>>>> >>> based on some idle time limit?
>>>> >>>
>>>> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Thanks Mike it makes sense. Anyway I can help?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mpercy@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Hi Mohit, this is a complicated issue. I've filed
>>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>> >>>>>
>>>> >>>>> In short, it would require a non-trivial amount of work to
>>>> implement
>>>> >>>>> this, and it would need to be done carefully. I agree that it
>>>> would be
>>>> >>>>> better if Flume handled this case more gracefully than it does
>>>> today. Today,
>>>> >>>>> Flume assumes that you have some job that would go and clean up
>>>> the .tmp
>>>> >>>>> files as needed, and that you understand that they could be
>>>> partially
>>>> >>>>> written if a crash occurred.
>>>> >>>>>
>>>> >>>>> Regards,
>>>> >>>>> Mike
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> What we are seeing is that if flume gets killed either because of
>>>> >>>>>> server failure or other reasons, it keeps around the .tmp file.
>>>> Sometimes
>>>> >>>>>> for whatever reasons .tmp file is not readable. Is there a way
>>>> to rollover
>>>> >>>>>> .tmp file more gracefully?
>>>> >>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Apache MRUnit - Unit testing MapReduce -
>>>> http://incubator.apache.org/mrunit/
>>>>
>>>
>>>
>>
>
>

[Attachment #3 (text/html)]

<div>Thanks for your response so far. I checkedout flume-1.3.0 and have built it. My \
next question is the property hdfs.closeIdleTimeout correct? Do I need to set any \
other property? My current config looks like and I write by YYYY/MM/DD/HH format so \
essentially I get 1-2 files per hour.</div>

<div> </div>
<div><font size="1">
<p>webanalytics.sinks.hdfsSink.hdfs.filePrefix = web</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollInterval = 4000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollCount = 20000000</p>
<p>#webanalytics.sinks.hdfsSink.hdfs.rollCount = 40000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollSize = 15000000000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.fileType = SequenceFile</p>
<p>webanalytics.sinks.hdfsSink.hdfs.writeFormat = Text</p>
<p>webanalytics.sinks.hdfsSink.hdfs.codeC = snappy</p></font><br><br></div>
<div class="gmail_quote">On Wed, Nov 28, 2012 at 9:20 PM, Juhani Connolly <span \
dir="ltr">&lt;<a href="mailto:juhani_connolly@cyberagent.co.jp" \
target="_blank">juhani_connolly@cyberagent.co.jp</a>&gt;</span> wrote:<br> \
<blockquote style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px \
0.8ex;PADDING-LEFT:1ex" class="gmail_quote"> <div bgcolor="#FFFFFF" text="#000000">
<div>The changes are in both the 1.3 RC5 and in the 1.4 trunk 
<div>
<div class="h5"><br><br>On 11/29/2012 01:26 PM, Mohit Anchlia \
wrote:<br></div></div></div> <div>
<div class="h5">
<blockquote type="cite">If I grab the last snapshot would I get these \
changes?<br><br> <div class="gmail_quote">On Tue, Nov 20, 2012 at 3:24 PM, Mohit \
Anchlia <span dir="ltr">&lt;<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">that&#39;s awesome!  <div>
<div><br><br>
<div class="gmail_quote">On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <span \
dir="ltr">&lt;<a href="mailto:mpercy@apache.org" \
target="_blank">mpercy@apache.org</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">Mohit,  <div>No problem, but Juhani did all the work. :) 
<div><br></div>
<div>The behavior is that you can configure an HDFS sink to close a file if it \
hasn&#39;t gotten any writes in some time. After it&#39;s been idle for 5 minutes or \
something, it gets closed. If you get a &quot;late&quot; event that goes to the same \
path after the file is closed, it will just create a new file in the same path as \
usual.</div>

<div><br></div>
<div>Regards,</div>
<div>Mike 
<div>
<div><br><br>
<div class="gmail_quote">On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <span \
dir="ltr">&lt;<a href="mailto:brock@cloudera.com" \
target="_blank">brock@cloudera.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">We are currently voting on a 1.3.0 RC on the dev@ list:<br><br><a \
href="http://s.apache.org/OQ0W" target="_blank">http://s.apache.org/OQ0W</a><br> \
<br>You don&#39;t have to be a committer to vote! :)<br><br>Brock<br> <div>
<div><br>On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia &lt;<a \
href="mailto:mohitanchlia@gmail.com" target="_blank">mohitanchlia@gmail.com</a>&gt; \
wrote:<br>&gt; Thanks a lot!! Now with this what should be the expected behaviour? \
After<br> &gt; file is closed a new file is created for writes that come after \
closing the<br>&gt; file?<br>&gt;<br>&gt; Thanks again for committing this change. Do \
you know when 1.3.0 is out? I am<br>&gt; currently using the snapshot version of \
1.3.0<br> &gt;<br>&gt; On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy &lt;<a \
href="mailto:mpercy@apache.org" target="_blank">mpercy@apache.org</a>&gt; \
wrote:<br>&gt;&gt;<br>&gt;&gt; Mohit,<br>&gt;&gt; FLUME-1660 is now committed and it \
will be in 1.3.0. In the case where you<br> &gt;&gt; are using 1.2.0, I suggest \
running with hdfs.rollInterval set so the files<br>&gt;&gt; will roll \
normally.<br>&gt;&gt;<br>&gt;&gt; Regards,<br>&gt;&gt; \
Mike<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; On Thu, Nov 15, 2012 at 11:23 PM, Juhani \
Connolly<br> &gt;&gt; &lt;<a href="mailto:juhani_connolly@cyberagent.co.jp" \
target="_blank">juhani_connolly@cyberagent.co.jp</a>&gt; \
wrote:<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; I am actually working on a patch for exactly \
this, refer to FLUME-1660<br> &gt;&gt;&gt;<br>&gt;&gt;&gt; The patch is on review \
board right now, I fixed a corner case issue that<br>&gt;&gt;&gt; came up with unit \
testing, but the implementation is not really to my<br>&gt;&gt;&gt; satisfaction. If \
you are interested please have a look and add your opinion.<br> \
&gt;&gt;&gt;<br>&gt;&gt;&gt; <a \
href="https://issues.apache.org/jira/browse/FLUME-1660" \
target="_blank">https://issues.apache.org/jira/browse/FLUME-1660</a><br>&gt;&gt;&gt; \
<a href="https://reviews.apache.org/r/7659/" \
target="_blank">https://reviews.apache.org/r/7659/</a><br> \
&gt;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; On 11/16/2012 01:16 PM, Mohit Anchlia \
wrote:<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; Another question I had was about rollover. \
What&#39;s the best way to<br>&gt;&gt;&gt; rollover files in reasonable timeframe? \
For instance our path is YY/MM/DD/HH<br> &gt;&gt;&gt; so every hour there is new file \
and the -1 hr is just sitting with .tmp and<br>&gt;&gt;&gt; it takes sometimes even \
hour before .tmp is closed and renamed to .snappy.<br>&gt;&gt;&gt; In this situation \
is there a way to tell flume to rollover files sooner<br> &gt;&gt;&gt; based on some \
idle time limit?<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; On Thu, Nov 15, 2012 at 8:14 PM, \
Mohit Anchlia &lt;<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>&gt;<br>&gt;&gt;&gt; wrote:<br> \
&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; Thanks Mike it makes sense. Anyway I can \
help?<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; On Thu, Nov 15, \
2012 at 11:54 AM, Mike Percy &lt;<a href="mailto:mpercy@apache.org" \
target="_blank">mpercy@apache.org</a>&gt; wrote:<br> \
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; Hi Mohit, this is a complicated issue. \
I&#39;ve filed<br>&gt;&gt;&gt;&gt;&gt; <a \
href="https://issues.apache.org/jira/browse/FLUME-1714" \
target="_blank">https://issues.apache.org/jira/browse/FLUME-1714</a> to track it.<br> \
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; In short, it would require a non-trivial \
amount of work to implement<br>&gt;&gt;&gt;&gt;&gt; this, and it would need to be \
done carefully. I agree that it would be<br>&gt;&gt;&gt;&gt;&gt; better if Flume \
handled this case more gracefully than it does today. Today,<br> &gt;&gt;&gt;&gt;&gt; \
Flume assumes that you have some job that would go and clean up the \
.tmp<br>&gt;&gt;&gt;&gt;&gt; files as needed, and that you understand that they could \
be partially<br>&gt;&gt;&gt;&gt;&gt; written if a crash occurred.<br> \
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; Regards,<br>&gt;&gt;&gt;&gt;&gt; \
Mike<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; On Sun, \
Nov 11, 2012 at 8:32 AM, Mohit Anchlia &lt;<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>&gt;<br> &gt;&gt;&gt;&gt;&gt; \
wrote:<br>&gt;&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; What we are seeing is \
that if flume gets killed either because of<br>&gt;&gt;&gt;&gt;&gt;&gt; server \
failure or other reasons, it keeps around the .tmp file. Sometimes<br> \
&gt;&gt;&gt;&gt;&gt;&gt; for whatever reasons .tmp file is not readable. Is there a \
way to rollover<br>&gt;&gt;&gt;&gt;&gt;&gt; .tmp file more \
gracefully?<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;<br> \
&gt;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;&gt;<br>&gt;<br><br><br><br></div></div><span><font \
color="#888888">--<br>Apache MRUnit - Unit testing MapReduce - <a \
href="http://incubator.apache.org/mrunit/" \
target="_blank">http://incubator.apache.org/mrunit/</a><br> \
</font></span></blockquote></div><br></div></div></div></div></blockquote></div><br></ \
div></div></blockquote></div><br></blockquote><br></div></div></div></blockquote></div><br>




[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic