[prev in list] [next in list] [prev in thread] [next in thread] 

List:       flume-user
Subject:    Re: how spooling directory source identifies the complete file
From:       SaravanaKumar TR <saran0081986 () gmail ! com>
Date:       2014-07-23 7:50:59
Message-ID: CAK4iNRFwx7WHEyxiKuvcVp9HQxCPZhaqu1_+H8C8hTXW8AoOLQ () mail ! gmail ! com
[Download RAW message or body]

thanks a lot.

This answer sounds perfect for my question.Let me have a try with mv
instead of cp.


On Wed, Jul 23, 2014 at 1:16 PM, Needham, Guy <Guy.Needham@virginmedia.co.uk
> wrote:

> Hi Saravana,
> 
> Flume will check the size and the time of the last edit to the file when
> it starts reading it and when it has finished reading. If the two sets of
> values differ between the start and end of the file reading process, Flume
> will fail noisily. This means that you must move a fully written file to
> the directory or it will not be ingested into your workflow. If you're
> running it on a unix system, you can't use a cp command to drop the file
> into the directory as cp uses incremental writes whereas mv will move the
> file in one go.
> 
> 
> Regards,
> Guy Needham | Data Discovery
> Virgin Media | Enterprise Data, Design & Management
> Bartley Wood Business Park, Hook, Hampshire RG27 9UP
> D 01256 75 3362
> I welcome VSRE emails. Learn more at http://vsre.info/
> 
> 
> ------------------------------
> *From:* SaravanaKumar TR [mailto:saran0081986@gmail.com]
> *Sent:* 23 July 2014 06:38
> *To:* user@flume.apache.org
> *Subject:* Re: how spooling directory source identifies the complete file
> 
> Thanks Ashish , I already referred to this info.
> 
> But I couldn't see any explanation in flume user guide about how flume
> differentiates between copy-in progress file and fully copied file.
> 
> 
> On Wed, Jul 23, 2014 at 10:59 AM, Ashish <paliwalashish@gmail.com> wrote:
> 
> > This is specified in Flume's User Guide
> > 
> > "Unlike the Exec source, this source is reliable and will not miss
> > data, even if Flume is restarted or killed. In exchange for this
> > reliability, only immutable, uniquely-named files must be dropped into the
> > spooling directory. Flume tries to detect these problem conditions and will
> > fail loudly if they are violated:
> > 
> > 1. If a file is written to after being placed into the spooling
> > directory, Flume will print an error to its log file and stop processing.
> > 2. If a file name is reused at a later time, Flume will print an
> > error to its log file and stop processing.
> > 
> > To avoid the above issues, it may be useful to add a unique identifier
> > (such as a timestamp) to log file names when they are moved into the
> > spooling directory."
> > 
> > 
> > On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <
> > saran0081986@gmail.com> wrote:
> > 
> > > Hi Jeff,
> > > 
> > > Thanks of your comments.But what I am really looking for is  ,
> > > consider we are copying a file of 1 GB to spool directory , if suppose copy
> > > is in progress , how flume recognize that the complete file is copied into
> > > the spool directory and the file is ready for processing ?
> > > 
> > > how flume make sure it doesnt start processing the partially copied
> > > file.
> > > 
> > > 
> > > On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jlord@cloudera.com> wrote:
> > > 
> > > > I believe the way this works is that flume creates a meta directory to
> > > > track which file is being read.
> > > > In the event of a restart of the agent the entire file will be re-read
> > > > which will create some duplicate events.
> > > > 
> > > > 
> > > > https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
> > > >  
> > > > 
> > > > On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <
> > > > saran0081986@gmail.com> wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I am planning to use spooling directory to move logfiles in hdfs
> > > > > sink.
> > > > > 
> > > > > I like to know how flume identifies the file we are moving to spool
> > > > > directory is complete file or partial & its move still in progress.
> > > > > 
> > > > > if suppose a file is of large size and we started moving it to
> > > > > spooler directory , how flume identifies that the complete file is
> > > > > transferred or is still in progress.
> > > > > 
> > > > > Please help me out here.
> > > > > 
> > > > > Thanks,
> > > > > saravana
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> > --
> > thanks
> > ashish
> > 
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> > 
> 
> 
> --------------------------------------------------------------------
> Save Paper - Do you really need to print this e-mail?
> 
> Visit www.virginmedia.com for more information, and more fun.
> 
> This email and any attachments are or may be confidential and legally
> privileged
> and are sent solely for the attention of the addressee(s). If you have
> received this
> email in error, please delete it from your system: its use, disclosure or
> copying is
> unauthorised. Statements and opinions expressed in this email may not
> represent
> those of Virgin Media. Any representations or commitments in this email are
> subject to contract.
> 
> Registered office: Media House, Bartley Wood Business Park, Hook,
> Hampshire, RG27 9UP
> Registered in England and Wales with number 2591237
> 


[Attachment #3 (text/html)]

<div dir="ltr">thanks a lot.<div><br></div><div>This answer sounds perfect for my \
question.Let me have a try with mv instead of cp.</div></div><div \
class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jul 23, 2014 at 1:16 PM, \
Needham, Guy <span dir="ltr">&lt;<a href="mailto:Guy.Needham@virginmedia.co.uk" \
target="_blank">Guy.Needham@virginmedia.co.uk</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><u></u>





<div>
<div dir="ltr" align="left"><span><font face="Tahoma">Hi \
Saravana,</font></span></div> <div dir="ltr" align="left"><span><font \
face="Tahoma"></font></span>  </div> <div dir="ltr" align="left"><span><font \
face="Tahoma">Flume will check the size and the time of the last edit to the file \
when it starts reading it and when it has finished reading. If the two sets of values \
differ between  the start and end of the file reading process, Flume will fail \
noisily. This means that you must move a fully written file to the directory or it \
will not be ingested into your workflow. If you&#39;re running it on a unix system, \
you can&#39;t use a cp command to  drop the file into the directory as cp uses \
incremental writes whereas mv will move the file in one go.</font></span></div> <div> \
</div> <br>
<p><span lang="en-gb"><font face="Tahoma">Regards, </font></span><br>
<span lang="en-gb"><font face="Tahoma">Guy Needham | Data Discovery<br>
Virgin Media | Enterprise Data, Design &amp; Management<br>
Bartley Wood Business Park, Hook, Hampshire RG27 9UP<br>
D 01256 75 3362 </font></span><br>
<span lang="en-gb"><font face="Tahoma">I welcome VSRE emails. Learn more at
<a href="http://vsre.info/" target="_blank">http://vsre.info/</a> </font></span></p>
<div>  </div>
<br>
<div dir="ltr" lang="en-us" align="left">
<hr>
<font face="Tahoma"><b>From:</b> SaravanaKumar TR [mailto:<a \
href="mailto:saran0081986@gmail.com" target="_blank">saran0081986@gmail.com</a>] <br>
<b>Sent:</b> 23 July 2014 06:38<br>
<b>To:</b> <a href="mailto:user@flume.apache.org" \
target="_blank">user@flume.apache.org</a><br> <b>Subject:</b> Re: how spooling \
directory source identifies the complete file<br> </font><br>
</div><div><div class="h5">
<div></div>
<div dir="ltr">Thanks Ashish , I already referred to this info.
<div><br>
</div>
<div>But I couldn&#39;t see any explanation in flume user guide about how flume \
differentiates between copy-in progress file and fully copied file.</div> </div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Wed, Jul 23, 2014 at 10:59 AM, Ashish <span dir="ltr">
&lt;<a href="mailto:paliwalashish@gmail.com" \
target="_blank">paliwalashish@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote"> <div dir="ltr">This is specified in Flume&#39;s User Guide
<div><br>
</div>
<div>&quot;<span style="TEXT-ALIGN:justify;LINE-HEIGHT:20px;FONT-FAMILY:Times;COLOR:rgb(0,0,0);FONT-SIZE:16px">Unlike \
the Exec source, this source is reliable and will not miss data, even if Flume is \
restarted or killed. In exchange for this reliability,  only immutable, \
uniquely-named files must be dropped into the spooling directory. Flume tries to \
detect these problem conditions and will fail loudly if they are \
violated:</span></div> <ol style="FONT-FAMILY:Times;COLOR:rgb(0,0,0);FONT-SIZE:16px">
<li style="TEXT-ALIGN:justify;LINE-HEIGHT:20px">If a file is written to after being \
placed into the spooling directory, Flume will print an error to its log file and \
stop processing. </li><li style="TEXT-ALIGN:justify;LINE-HEIGHT:20px">If a file name \
is reused at a later time, Flume will print an error to its log file and stop \
processing.</li></ol> <div><span \
style="TEXT-ALIGN:justify;LINE-HEIGHT:20px;FONT-FAMILY:Times;COLOR:rgb(0,0,0);FONT-SIZE:16px">To \
avoid the above issues, it may be useful to add a unique identifier (such as a \
timestamp) to log file names when they are moved into the spooling  \
directory.</span>&quot;</div> </div>
<div class="gmail_extra">
<div>
<div><br>
<br>
<div class="gmail_quote">On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <span \
dir="ltr"> &lt;<a href="mailto:saran0081986@gmail.com" \
target="_blank">saran0081986@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote"> <div dir="ltr">Hi Jeff,
<div><br>
</div>
<div>Thanks of your comments.But what I am really looking for is   , consider we are \
copying a file of 1 GB to spool directory , if suppose copy is in progress , how \
flume recognize that the complete file is copied into the spool directory and the \
file is ready  for processing ?</div>
<div><br>
</div>
<div>how flume make sure it doesnt start processing the partially copied file.</div>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <span dir="ltr">
&lt;<a href="mailto:jlord@cloudera.com" \
target="_blank">jlord@cloudera.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote"> <div dir="ltr">I believe the way this works is that flume \
creates a meta directory to track which file is being read. <div>In the event of a \
restart of the agent the entire file will be re-read which will create some duplicate \
events.</div> <div><br>
</div>
<div><a href="https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474" \
target="_blank">https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/ \
java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474</a><br>

</div>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <span \
dir="ltr"> &lt;<a href="mailto:saran0081986@gmail.com" \
target="_blank">saran0081986@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
style="BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PADDING-LEFT:1ex" \
class="gmail_quote"> <div dir="ltr">Hi,
<div><br>
</div>
<div>I am planning to use spooling directory to move logfiles in hdfs sink.</div>
<div><br>
</div>
<div>I like to know how flume identifies the file we are moving to spool directory is \
complete file or partial &amp; its move still in progress.</div> <div><br>
</div>
<div>if suppose a file is of large size and we started moving it to spooler directory \
, how flume identifies that the complete file is transferred or is still in \
progress.</div> <div><br>
</div>
<div>Please help me out here.<br>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div>saravana</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
</div>
</div>
<span><font color="#888888">-- <br>
thanks<br>
ashish<br>
<br>
Blog: <a href="http://www.ashishpaliwal.com/blog" \
target="_blank">http://www.ashishpaliwal.com/blog</a><br> My Photo Galleries: <a \
href="http://www.pbase.com/ashishpaliwal" target="_blank"> \
http://www.pbase.com/ashishpaliwal</a> </font></span></div> </blockquote>
</div>
<br>
</div>
</div></div><p><br>
--------------------------------------------------------------------<br>
Save Paper - Do you really need to print this e-mail?</p>

<p>Visit <a href="http://www.virginmedia.com" target="_blank">www.virginmedia.com</a> \
for more information, and more fun.</p>

<p>This email and any attachments are or may be confidential and legally \
privileged<br> and are sent solely for the attention of the addressee(s). If you have \
received this<br> email in error, please delete it from your system: its use, \
disclosure or copying is<br> unauthorised. Statements and opinions expressed in this \
email may not represent<br> those of Virgin Media. Any representations or commitments \
in this email are<br> subject to contract. </p>

<p>Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, RG27 \
9UP<br> Registered in England and Wales with number 2591237</p></div>

</blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic