'[jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-dev
Subject:    [jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file
From:       "Janosch Woschitz (JIRA)" <jira () apache ! org>
Date:       2015-08-31 10:58:46
Message-ID: JIRA.12858275.1440417270000.209893.1441018726020 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723296#comment-14723296 \
] 

Janosch Woschitz commented on AVRO-1720:
----------------------------------------

Maybe it got lost in my last message: I incorporated your proposed changes and \
uploaded a new patch (AVRO-1720-with-extended-unittests.patch).

It would be great if you could take a look upon this and decide how to proceed.

Thanks!

> Add an avro-tool to count records in an avro file
> -------------------------------------------------
> 
> Key: AVRO-1720
> URL: https://issues.apache.org/jira/browse/AVRO-1720
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Janosch Woschitz
> Priority: Minor
> Attachments: AVRO-1720-with-extended-unittests.patch, AVRO-1720.patch
> 
> 
> If you're dealing with bigger avro files (>100MB) it would be nice to have a way to \
> quickly count the amount of records contained within that file. With the current \
> state of avro-tools the only way to achieve this (to my current knowledge) is to \
> dump the data to json and count the amount of records. For bigger files this might \
> take a while due to the serialization overhead and since every record needs to be \
> looked at. I added a new tool which is optimized for counting records, it does not \
> serialize the records and reads only the block count for each block. \
> {panel:title=Naive benchmark} {noformat}
> # the input file had a size of ~300MB
> $ du -sh sample.avro 
> 323M    sample.avro
> # using the new count tool
> $ time java -jar avro-tools.jar count sample.avro
> 331439
> real    0m4.670s
> user    0m6.167s
> sys 0m0.513s
> # the current way of counting records
> $ time java -jar avro-tools.jar tojson sample.avro | wc
> 331439 54904484 1838231743
> real    0m52.760s
> user    1m42.317s
> sys 0m3.209s
> # the overhead of wc is rather minor
> $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
> real    0m47.834s
> user    0m53.317s
> sys 0m1.194s
> {noformat}
> {panel}
> This tool uses the HDFS API to handle files from any supported filesystem. I added \
> the unit tests to the already existing TestDataFileTools since it provided \
> convenient utility functions which I could reuse for my test scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic