[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-user
Subject:    Re: Schema Registry?
From:       Eric Wasserman <ewasserman () 247-inc ! com>
Date:       2013-08-22 1:41:47
Message-ID: 975796BD-D717-4A65-98CB-D96E77111C71 () 247-inc ! com
[Download RAW message or body]

Yes we have a Kafka event consumer that creates the files in HDFS. There are other \
non-Hadoop consumers as well. 

On Aug 21, 2013, at 2:23 PM, "Mark" <static.void.dev@gmail.com> wrote:

> Some final questions.
> 
> Since there is no need for the schema in each Kafka event do you just output the \
> message without the container file (file header, metadata, sync_markers)? If so, \
> how do you get this working with the Kafka hadoop consumers? Doing it this way, \
> does it require you to write your own consumer to write to hadoop? 
> Thanks
> 
> On Aug 20, 2013, at 11:01 AM, Eric Wasserman <ewasserman@247-inc.com> wrote:
> 
> > You may want to check out this Avro feature request: \
> > https://issues.apache.org/jira/browse/AVRO-1124 which has a lot of nice \
> > motivation and usage patterns. Unfortunately, its not yet a resolved request. 
> > There are really two broad use cases. 
> > 
> > 1) The data are "small" compared to the schema (perhaps because its a collection \
> > or stream of records encoded by different schemas) 2) The data are "big" compared \
> > to the schema. (very big records or lots of records that share a schema) 
> > Case (1) is often a candidate for a schema registry. Case (2) not as much.
> > 
> > Examples from my own usage:
> > 
> > For Kafka we include an MD5 digest of the writer's schema with each Message. It \
> > is serialized as a concatenation of the fixed-length MD5 and the binary \
> > Avro-encoded data. To decode we read off the MD5, look up the schema and use it \
> > to decode the remainder of the Message. [You could also segregate data written \
> > with different schemas into different Kafka topics. By knowing which topic a \
> > message is under you then arrange a way to look up the writer's schema. That lets \
> > you avoid even the cost of including the MD5 in the Messages.] 
> > In either case consumer code needs to look up the full schema from a "registry" \
> > in order to do the actual decode the Avro-encoded data. The registry serves the \
> > full schema that corresponds to the specified MD5 digest. 
> > We use a similar technique for storing MD5-tagged Avro data in "columns" of \
> > Cassandra and so on. 
> > Case (2) is pretty well handled by just embedding the full schema itself.
> > 
> > For example, for Hadoop you can just use Avro data files which include the actual \
> > schema in a header. All the record in the file then adhere to that same schema. \
> > In this case using a registry to get the writer's schema is not necessary. 
> > Note: As described in the feature request linked above, some people use a schema \
> > registry as a way of coordinating schema evolution rather than just as a way of \
> > making schema access "economical". 
> > 
> > 
> > On Aug 20, 2013, at 9:19 AM, Mark wrote:
> > 
> > > Can someone break down how message serialization would work with Avro and a \
> > > schema registry? We are planning to use Avro with Kafka and I've read instead \
> > > of adding a schema to every single event it would be wise to add some sort of \
> > > fingerprint with each message to identify which schema it should used. What I'm \
> > > having trouble understanding is, how do we read the fingerprint without a \
> > > schema? Don't we need the schema to deserialize?  Same question goes for \
> > > working with Hadoop.. how does the input format know which schema to use? 
> > > Thanks
> > 
> > 
> 
> 


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic