[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-user
Subject:    Re: Reading large AVRO files
From:       Lewis John Mcgibbney <lewis.mcgibbney () gmail ! com>
Date:       2014-05-24 17:39:41
Message-ID: CAGaRif1sPgdJzJnNg+5Qb0hQ05CKJnd0L8dKsF8TxZtHxs=hGA () mail ! gmail ! com
[Download RAW message or body]

Thank you very much Mike.
I am looking @ Avro C API right now and this is extremely helpful.
Lewis


On Sat, May 24, 2014 at 6:00 AM, Mike Stanley <mike@mikestanley.org> wrote:

> While I haven't benchmarked java performance I have looked closely at Ruby
> vs C with regards to reading large avro files.   With C - I have processed
> ~900Mb files with 25+M rows in ~42s.  And routinely process 270Mb / 7.5M
> record files with C, on average, in 15s.   These numbers were observed
> running on a Mac Book Pro 2012 model (exact specs elude me at the
> moment).   Not scientific but may help give you a ballpark of what is
> possible.
>  I am using Java. I did play with the size of the buffer reader, but I
> found that the default size of 8K gave me the best performance.
> thanks, Yael
>
>
> On Fri, May 23, 2014 at 4:14 AM, Martin Kleppmann <mkleppmann@linkedin.com
> > wrote:
>
>> Which language are you using? Afaik, most language implementations of
>> Avro only have an interface for reading one record at a time, but they do
>> buffer the input file internally, so there shouldn't be a performance
>> disadvantage to reading one record at a time.
>>
>> If you have an example that is particularly slow, you could be a great
>> help to the Avro community by getting out a profiler and finding the
>> bottleneck :)
>>
>> Thanks,
>> Martin
>>
>> On 14 May 2014, at 20:13, yael aharon <yael.aharon.m@gmail.com> wrote:
>> > I am building a java utility that reads large AVRO files and does some
>> processing. These files have millions of records in them and it can take
>> minutes to read them using DataFileReader.next().
>> > Is there a way to read more than one record at a time?
>> > thanks, Yael
>>
>>
>


-- 
*Lewis*

[Attachment #3 (text/html)]

<div dir="ltr"><div><div>Thank you very much Mike.<br></div>I am looking @ Avro C API \
right now and this is extremely helpful.<br></div>Lewis<br></div><div \
class="gmail_extra"><br><br><div class="gmail_quote">On Sat, May 24, 2014 at 6:00 AM, \
Mike Stanley <span dir="ltr">&lt;<a href="mailto:mike@mikestanley.org" \
target="_blank">mike@mikestanley.org</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><p dir="ltr">While I haven&#39;t benchmarked java performance \
I have looked closely at Ruby vs C with regards to reading large avro files.     With \
C - I have processed ~900Mb files with 25+M rows in ~42s.   And routinely process \
270Mb / 7.5M record files with C, on average, in 15s.     These numbers were observed \
running on a Mac Book Pro 2012 model (exact specs elude me at the moment).     Not \
scientific but may help give you a ballpark of what is possible.   <br>


</p><div class="HOEnZb"><div class="h5">
<div style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div \
dir="ltr">I am using Java. I did play with the size of the buffer reader, but I found \
that the default size of 8K gave me the best performance.<div>

thanks, Yael</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Fri, May 23, 2014 at 4:14 AM, Martin Kleppmann <span dir="ltr">&lt;<a \
href="mailto:mkleppmann@linkedin.com" \
target="_blank">mkleppmann@linkedin.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">


Which language are you using? Afaik, most language implementations of Avro only have \
an interface for reading one record at a time, but they do buffer the input file \
internally, so there shouldn&#39;t be a performance disadvantage to reading one \
record at a time.<br>



<br>
If you have an example that is particularly slow, you could be a great help to the \
Avro community by getting out a profiler and finding the bottleneck :)<br> <br>
Thanks,<br>
Martin<br>
<br>
On 14 May 2014, at 20:13, yael aharon &lt;<a href="mailto:yael.aharon.m@gmail.com" \
target="_blank">yael.aharon.m@gmail.com</a>&gt; wrote:<br> &gt; I am building a java \
utility that reads large AVRO files and does some processing. These files have \
millions of records in them and it can take minutes to read them using \
DataFileReader.next().<br> &gt; Is there a way to read more than one record at a \
time?<br> &gt; thanks, Yael<br>
<br>
</blockquote></div><br></div>
</div>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><font \
style="font-family:courier \
new,monospace;background-color:rgb(255,255,255);color:rgb(51,102,255)" \
size="4"><i>Lewis</i></font><span \
style="background-color:rgb(255,255,255);color:rgb(51,102,255)"></span> <br>

</div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic