[prev in list] [next in list] [prev in thread] [next in thread]
List: flume-user
Subject: Re: .tmp in hdfs sink
From: Mohit Anchlia <mohitanchlia () gmail ! com>
Date: 2012-11-29 16:46:47
Message-ID: CAOT3TWriNcqpy3pnKtvie5hK+CGseLvDEbTA0RZpru91H8xurA () mail ! gmail ! com
[Download RAW message or body]
Thanks for your response so far. I checkedout flume-1.3.0 and have built
it. My next question is the property hdfs.closeIdleTimeout correct? Do I
need to set any other property? My current config looks like and I write by
YYYY/MM/DD/HH format so essentially I get 1-2 files per hour.
webanalytics.sinks.hdfsSink.hdfs.filePrefix = web
webanalytics.sinks.hdfsSink.hdfs.rollInterval = 4000
webanalytics.sinks.hdfsSink.hdfs.rollCount = 20000000
#webanalytics.sinks.hdfsSink.hdfs.rollCount = 40000
webanalytics.sinks.hdfsSink.hdfs.rollSize = 15000000000
webanalytics.sinks.hdfsSink.hdfs.fileType = SequenceFile
webanalytics.sinks.hdfsSink.hdfs.writeFormat = Text
webanalytics.sinks.hdfsSink.hdfs.codeC = snappy
On Wed, Nov 28, 2012 at 9:20 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:
> The changes are in both the 1.3 RC5 and in the 1.4 trunk
>
>
> On 11/29/2012 01:26 PM, Mohit Anchlia wrote:
>
> If I grab the last snapshot would I get these changes?
>
> On Tue, Nov 20, 2012 at 3:24 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:
>
>> that's awesome!
>>
>>
>> On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mpercy@apache.org> wrote:
>>
>>> Mohit,
>>> No problem, but Juhani did all the work. :)
>>>
>>> The behavior is that you can configure an HDFS sink to close a file if
>>> it hasn't gotten any writes in some time. After it's been idle for 5
>>> minutes or something, it gets closed. If you get a "late" event that goes
>>> to the same path after the file is closed, it will just create a new file
>>> in the same path as usual.
>>>
>>> Regards,
>>> Mike
>>>
>>>
>>> On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <brock@cloudera.com>wrote:
>>>
>>>> We are currently voting on a 1.3.0 RC on the dev@ list:
>>>>
>>>> http://s.apache.org/OQ0W
>>>>
>>>> You don't have to be a committer to vote! :)
>>>>
>>>> Brock
>>>>
>>>> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>>>> wrote:
>>>> > Thanks a lot!! Now with this what should be the expected behaviour?
>>>> After
>>>> > file is closed a new file is created for writes that come after
>>>> closing the
>>>> > file?
>>>> >
>>>> > Thanks again for committing this change. Do you know when 1.3.0 is
>>>> out? I am
>>>> > currently using the snapshot version of 1.3.0
>>>> >
>>>> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mpercy@apache.org>
>>>> wrote:
>>>> >>
>>>> >> Mohit,
>>>> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case
>>>> where you
>>>> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
>>>> files
>>>> >> will roll normally.
>>>> >>
>>>> >> Regards,
>>>> >> Mike
>>>> >>
>>>> >>
>>>> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>>>> >> <juhani_connolly@cyberagent.co.jp> wrote:
>>>> >>>
>>>> >>> I am actually working on a patch for exactly this, refer to
>>>> FLUME-1660
>>>> >>>
>>>> >>> The patch is on review board right now, I fixed a corner case issue
>>>> that
>>>> >>> came up with unit testing, but the implementation is not really to
>>>> my
>>>> >>> satisfaction. If you are interested please have a look and add your
>>>> opinion.
>>>> >>>
>>>> >>> https://issues.apache.org/jira/browse/FLUME-1660
>>>> >>> https://reviews.apache.org/r/7659/
>>>> >>>
>>>> >>>
>>>> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>>> >>>
>>>> >>> Another question I had was about rollover. What's the best way to
>>>> >>> rollover files in reasonable timeframe? For instance our path is
>>>> YY/MM/DD/HH
>>>> >>> so every hour there is new file and the -1 hr is just sitting with
>>>> .tmp and
>>>> >>> it takes sometimes even hour before .tmp is closed and renamed to
>>>> .snappy.
>>>> >>> In this situation is there a way to tell flume to rollover files
>>>> sooner
>>>> >>> based on some idle time limit?
>>>> >>>
>>>> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Thanks Mike it makes sense. Anyway I can help?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mpercy@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Hi Mohit, this is a complicated issue. I've filed
>>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>> >>>>>
>>>> >>>>> In short, it would require a non-trivial amount of work to
>>>> implement
>>>> >>>>> this, and it would need to be done carefully. I agree that it
>>>> would be
>>>> >>>>> better if Flume handled this case more gracefully than it does
>>>> today. Today,
>>>> >>>>> Flume assumes that you have some job that would go and clean up
>>>> the .tmp
>>>> >>>>> files as needed, and that you understand that they could be
>>>> partially
>>>> >>>>> written if a crash occurred.
>>>> >>>>>
>>>> >>>>> Regards,
>>>> >>>>> Mike
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> What we are seeing is that if flume gets killed either because of
>>>> >>>>>> server failure or other reasons, it keeps around the .tmp file.
>>>> Sometimes
>>>> >>>>>> for whatever reasons .tmp file is not readable. Is there a way
>>>> to rollover
>>>> >>>>>> .tmp file more gracefully?
>>>> >>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Apache MRUnit - Unit testing MapReduce -
>>>> http://incubator.apache.org/mrunit/
>>>>
>>>
>>>
>>
>
>
[Attachment #3 (text/html)]
<div>Thanks for your response so far. I checkedout flume-1.3.0 and have built it. My \
next question is the property hdfs.closeIdleTimeout correct? Do I need to set any \
other property? My current config looks like and I write by YYYY/MM/DD/HH format so \
essentially I get 1-2 files per hour.</div>
<div> </div>
<div><font size="1">
<p>webanalytics.sinks.hdfsSink.hdfs.filePrefix = web</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollInterval = 4000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollCount = 20000000</p>
<p>#webanalytics.sinks.hdfsSink.hdfs.rollCount = 40000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.rollSize = 15000000000</p>
<p>webanalytics.sinks.hdfsSink.hdfs.fileType = SequenceFile</p>
<p>webanalytics.sinks.hdfsSink.hdfs.writeFormat = Text</p>
<p>webanalytics.sinks.hdfsSink.hdfs.codeC = snappy</p></font><br><br></div>
<div class="gmail_quote">On Wed, Nov 28, 2012 at 9:20 PM, Juhani Connolly <span \
dir="ltr"><<a href="mailto:juhani_connolly@cyberagent.co.jp" \
target="_blank">juhani_connolly@cyberagent.co.jp</a>></span> wrote:<br> \
<blockquote style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px \
0.8ex;PADDING-LEFT:1ex" class="gmail_quote"> <div bgcolor="#FFFFFF" text="#000000">
<div>The changes are in both the 1.3 RC5 and in the 1.4 trunk
<div>
<div class="h5"><br><br>On 11/29/2012 01:26 PM, Mohit Anchlia \
wrote:<br></div></div></div> <div>
<div class="h5">
<blockquote type="cite">If I grab the last snapshot would I get these \
changes?<br><br> <div class="gmail_quote">On Tue, Nov 20, 2012 at 3:24 PM, Mohit \
Anchlia <span dir="ltr"><<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>></span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">that's awesome! <div>
<div><br><br>
<div class="gmail_quote">On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <span \
dir="ltr"><<a href="mailto:mpercy@apache.org" \
target="_blank">mpercy@apache.org</a>></span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">Mohit, <div>No problem, but Juhani did all the work. :)
<div><br></div>
<div>The behavior is that you can configure an HDFS sink to close a file if it \
hasn't gotten any writes in some time. After it's been idle for 5 minutes or \
something, it gets closed. If you get a "late" event that goes to the same \
path after the file is closed, it will just create a new file in the same path as \
usual.</div>
<div><br></div>
<div>Regards,</div>
<div>Mike
<div>
<div><br><br>
<div class="gmail_quote">On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <span \
dir="ltr"><<a href="mailto:brock@cloudera.com" \
target="_blank">brock@cloudera.com</a>></span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote">We are currently voting on a 1.3.0 RC on the dev@ list:<br><br><a \
href="http://s.apache.org/OQ0W" target="_blank">http://s.apache.org/OQ0W</a><br> \
<br>You don't have to be a committer to vote! :)<br><br>Brock<br> <div>
<div><br>On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <<a \
href="mailto:mohitanchlia@gmail.com" target="_blank">mohitanchlia@gmail.com</a>> \
wrote:<br>> Thanks a lot!! Now with this what should be the expected behaviour? \
After<br> > file is closed a new file is created for writes that come after \
closing the<br>> file?<br>><br>> Thanks again for committing this change. Do \
you know when 1.3.0 is out? I am<br>> currently using the snapshot version of \
1.3.0<br> ><br>> On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <<a \
href="mailto:mpercy@apache.org" target="_blank">mpercy@apache.org</a>> \
wrote:<br>>><br>>> Mohit,<br>>> FLUME-1660 is now committed and it \
will be in 1.3.0. In the case where you<br> >> are using 1.2.0, I suggest \
running with hdfs.rollInterval set so the files<br>>> will roll \
normally.<br>>><br>>> Regards,<br>>> \
Mike<br>>><br>>><br>>> On Thu, Nov 15, 2012 at 11:23 PM, Juhani \
Connolly<br> >> <<a href="mailto:juhani_connolly@cyberagent.co.jp" \
target="_blank">juhani_connolly@cyberagent.co.jp</a>> \
wrote:<br>>>><br>>>> I am actually working on a patch for exactly \
this, refer to FLUME-1660<br> >>><br>>>> The patch is on review \
board right now, I fixed a corner case issue that<br>>>> came up with unit \
testing, but the implementation is not really to my<br>>>> satisfaction. If \
you are interested please have a look and add your opinion.<br> \
>>><br>>>> <a \
href="https://issues.apache.org/jira/browse/FLUME-1660" \
target="_blank">https://issues.apache.org/jira/browse/FLUME-1660</a><br>>>> \
<a href="https://reviews.apache.org/r/7659/" \
target="_blank">https://reviews.apache.org/r/7659/</a><br> \
>>><br>>>><br>>>> On 11/16/2012 01:16 PM, Mohit Anchlia \
wrote:<br>>>><br>>>> Another question I had was about rollover. \
What's the best way to<br>>>> rollover files in reasonable timeframe? \
For instance our path is YY/MM/DD/HH<br> >>> so every hour there is new file \
and the -1 hr is just sitting with .tmp and<br>>>> it takes sometimes even \
hour before .tmp is closed and renamed to .snappy.<br>>>> In this situation \
is there a way to tell flume to rollover files sooner<br> >>> based on some \
idle time limit?<br>>>><br>>>> On Thu, Nov 15, 2012 at 8:14 PM, \
Mohit Anchlia <<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>><br>>>> wrote:<br> \
>>>><br>>>>> Thanks Mike it makes sense. Anyway I can \
help?<br>>>>><br>>>>><br>>>>> On Thu, Nov 15, \
2012 at 11:54 AM, Mike Percy <<a href="mailto:mpercy@apache.org" \
target="_blank">mpercy@apache.org</a>> wrote:<br> \
>>>>><br>>>>>> Hi Mohit, this is a complicated issue. \
I've filed<br>>>>>> <a \
href="https://issues.apache.org/jira/browse/FLUME-1714" \
target="_blank">https://issues.apache.org/jira/browse/FLUME-1714</a> to track it.<br> \
>>>>><br>>>>>> In short, it would require a non-trivial \
amount of work to implement<br>>>>>> this, and it would need to be \
done carefully. I agree that it would be<br>>>>>> better if Flume \
handled this case more gracefully than it does today. Today,<br> >>>>> \
Flume assumes that you have some job that would go and clean up the \
.tmp<br>>>>>> files as needed, and that you understand that they could \
be partially<br>>>>>> written if a crash occurred.<br> \
>>>>><br>>>>>> Regards,<br>>>>>> \
Mike<br>>>>>><br>>>>>><br>>>>>> On Sun, \
Nov 11, 2012 at 8:32 AM, Mohit Anchlia <<a href="mailto:mohitanchlia@gmail.com" \
target="_blank">mohitanchlia@gmail.com</a>><br> >>>>> \
wrote:<br>>>>>>><br>>>>>>> What we are seeing is \
that if flume gets killed either because of<br>>>>>>> server \
failure or other reasons, it keeps around the .tmp file. Sometimes<br> \
>>>>>> for whatever reasons .tmp file is not readable. Is there a \
way to rollover<br>>>>>>> .tmp file more \
gracefully?<br>>>>>><br>>>>>><br>>>>><br> \
>>><br>>>><br>>><br>><br><br><br><br></div></div><span><font \
color="#888888">--<br>Apache MRUnit - Unit testing MapReduce - <a \
href="http://incubator.apache.org/mrunit/" \
target="_blank">http://incubator.apache.org/mrunit/</a><br> \
</font></span></blockquote></div><br></div></div></div></div></blockquote></div><br></ \
div></div></blockquote></div><br></blockquote><br></div></div></div></blockquote></div><br>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic