'Re: Mapreduce Strings from reader, when Avro is clearly Utf8'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-user
Subject:    Re: Mapreduce Strings from reader, when Avro is clearly Utf8
From:       Marshall Bockrath-Vandegrift <llasram () gmail ! com>
Date:       2013-08-27 21:11:36
Message-ID: 878uzmsut3.fsf () zeno ! atl ! damballa
[Download RAW message or body]

Anna Lahoud <annalahoud@gmail.com> writes:

> I am experiencing a problem and I found that another user wrote in
> about this same issue in March 2013 but there were no replies to his
> question. I am really hoping that there is someone who can explain
> this or offer suggestions. I cut and paste his message in since I
> could only find it in an archive.
>
> I have Avro files that clearly contain Utf8 and if I run
> non-mapreduce, I get Utf8 out. However, with the same files, I get
> String objects back from the mapper. Help!?!?!

There are some confusing differences between the now-named "data models"
used by the `mapred` vs `mapreduce` APIs.  

The Generic{Data,Datum{Reader,Writer}} and Specific implementations
generate `Utf8` instances by default.  The Reflect implementation
generates `String` instances only(?).

In 1.7.4 and earlier: The `mapred` API defaults to using the Specific
implementations (producing `Utf8`s), but may be configured to use the
Reflect implementations via the `...mapred.AvroJob.setReflect()` method.
The `mapreduce` API uses the Reflect implementations and cannot be
configured – and thus always produces `String` instances.  So no dice.

In 1.7.5 (and I hope later): Both the APIs allow you to specify the data
model as a sub-class of `GenericData`.  For example:

    import org.apache.avro.mapreduce.AvroJob;
    ....
    AvroJob.setDataModelClass(job, GenericData.class);

So-setting the job data model should yield the `Utf8` instances you're
hoping for.

HTH,

-Marshall

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic