[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-dev
Subject:    [jira] Updated: (HADOOP-1883) Adding versioning to Record I/O
From:       "Vivek Ratan (JIRA)" <jira () apache ! org>
Date:       2007-12-31 11:46:43
Message-ID: 32490593.1199101603029.JavaMail.jira () brutus
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Vivek Ratan updated HADOOP-1883:
--------------------------------

    Status: Patch Available  (was: Open)

> Adding versioning to Record I/O
> -------------------------------
> 
> Key: HADOOP-1883
> URL: https://issues.apache.org/jira/browse/HADOOP-1883
> Project: Hadoop
> Issue Type: New Feature
> Reporter: Vivek Ratan
> Assignee: Vivek Ratan
> Attachments: 1883_patch01, 1883_patch02, 1883_patch03, example.zip, example.zip
> 
> 
> There is a need to add versioning support to Record I/O. Users frequently update \
> DDL files, usually by adding/removing fields, but do not want to change the name of \
> the data structure. They would like older & newer deserializers to read as much \
> data as possible. For example, suppose Record I/O is used to serialize/deserialize \
> log records, each of which contains a message and a timestamp. An initial data \
> definition could be as follows: {code}
> class MyLogRecord {
> ustring msg;
> long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and can \
> serialize/deserialize itself. Now, suppose newer log records additionally contain a \
> severity level. A user would want to update the definition for a log record but use \
> the same class name. The new definition would be: {code}
> class MyLogRecord {
> ustring msg;
> long timestamp;
> int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use a \
> default value for the severity field), and an old deserializer to read newer log \
> records (and skip the severity field). This requires some concept of versioning in \
> Record I/O, or rather, the additional ability to read/write type information of a \
> record. The following is a proposal to do this.  Every Record I/O Record will have \
> type information which represents how the record is structured (what fields it has, \
> what types, etc.). This type information, represented by the class \
> _RecordTypeInfo_, is itself serializable/deserializable. Every Record supports a \
> method _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are \
> expected to serialize this type information (by calling \
> _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, for \
> example, or at the beginning of a file). Using the same DDL as above, here's how we \
> could serialize log records:  {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
> MyLogRecord log = new MyLogRecord();
> // fill up the MyLogRecord object
> ...
> // serialize
> log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist, followed by \
> contents of various records, all into the same file.  Every Record also supports a \
> method that allows a user to set a filter for deserializing. A method \
> _setRTIFilter()_ takes a _RecordTypeInfo_ object as a parameter. This filter \
> represents the type information of the data that is being deserialized. When \
> deserializing, the Record uses this filter (if one is set) to figure out what to \
> read. Continuing with our example, here's how we could deserialize records: {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it. 
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
> MyLogRecord log = new MyLogRecord();
> log.read(csvIn);
> ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be in the \
> same format as it would serialize. (Note that a filter can also be provided for \
> serializing, forcing the serializer to write information in the format of the \
> filter, but there is no use case for this functionality yet).  What goes in the \
> type information for a record? The type information for each field in a Record is \
> made up of: 1. a unique field ID, which is the field name. 
> 2. a type ID, which denotes the type of the field (int, string, map, etc). 
> The type information for a composite type contains type information for each of its \
> fields. This approach is somewhat similar to the one taken by [Facebook's \
> Thrift|http://developers.facebook.com/thrift/], as well as by Google's Sawzall. The \
> main difference is that we use field names as the field ID, whereas Thrift and \
> Sawzall use user-defined field numbers. While field names take more space, they \
> have the big advantage that there is no change to support existing DDLs.  When \
> deserializing, a Record looks at the filter and compares it with its own set of \
> {field name, field type} tuples. If there is a field in the data that it doesn't \
> know about it, it skips it (it knows how many bytes to skip, based on the filter). \
> If the deserialized data does not contain some field values, the Record gives them \
> default values. Additionally, we could allow users to optionally specify default \
> values in the DDL. The location of a field in a structure does not matter. This \
> lets us support reordering of fields. Note that there is no change required to the \
> DDL syntax, and very minimal changes to client code (clients just need to \
> read/write type information, in addition to record data).  This scheme gives us an \
> addition powerful feature: we can build a generic serializer/deserializer, so that \
> users can read all kinds of data without having access to the original DDL or the \
> original stubs. As long as you know where the type information of a record is \
> serialized, you can read all kinds of data. One can also build a simple UI that \
> displays the structure of data serialized in any generic file. This is very useful \
> for handling data across lots of versions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic