[prev in list] [next in list] [prev in thread] [next in thread]
List: avro-dev
Subject: [jira] [Created] (AVRO-946) GenericData.resolveUnion() performance improvement
From: "Hernan Otero (Created) (JIRA)" <jira () apache ! org>
Date: 2011-10-27 21:30:32
Message-ID: 195985141.27168.1319751032275.JavaMail.tomcat () hel ! zones ! apache ! org
[Download RAW message or body]
GenericData.resolveUnion() performance improvement
--------------------------------------------------
Key: AVRO-946
URL: https://issues.apache.org/jira/browse/AVRO-946
Project: Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.6.0
Reporter: Hernan Otero
Due to the sequential nature of today's implementation of GenericData.resolveUnion() \
(used when serializing an object):
{code}
public int resolveUnion(Schema union, Object datum) {
int i = 0;
for (Schema type : union.getTypes()) {
if (instanceOf(type, datum))
return i;
i++;
}
throw new UnresolvedUnionException(union, datum);
}
{code}
it showed up when we were doing some serialization performance analysis. A simple \
optimization can be implemented by keeping a map within the UnionSchema object (in \
fact, this could actually be a perfect hash map given the potential values in the map \
are known in advance). The optimization is obviously most notable when a Union \
within the schema contains many types (in our particular use case, more than 40 in \
some cases). In this scenario, we observed a 25% improvement by using an identity \
hash map.
Even though using an identity map provides a significant boost, we have observed an \
even further improvement (and removed some of the restrictions of relying on object \
identity) by using a perfect hash map on the schema names (an extra 15% on top of \
that in some cases). This implementation, unfortunately, is not something we could \
contribute at this point, but we thought it'd be a good idea to allow users to \
provide alternative implementations of the indexing behavior, such as adding the \
following static method to Schema:
{code}
public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
{
unionIndexCacheFactory = factory;
}
{code}
This is what the interface and identity hash map-based implementation would look \
like:
{code}
/**
* A factory interface for creating UnionTypeIndexCache instances.
*/
public static interface UnionIndexCacheFactory
{
UnionIndexCache createUnionIndexCache(List<Schema> types);
/**
* Used for caching schema indices within a union.
*/
public static interface UnionIndexCache
{
void setTypeIndex(Schema schema, int index);
int getTypeIndex(Schema schema);
}
}
private static class IdentityMapUnionIndexCacheFactory implements \
UnionIndexCacheFactory {
@Override
public UnionIndexCache createUnionIndexCache(List<Schema> types)
{
return new UnionIndexCache()
{
private final IdentityHashMap<Schema, Integer> schemaToIndex = new \
IdentityHashMap<Schema, Integer>();
@Override
public void setTypeIndex(Schema schema, int index)
{
schemaToIndex.put(schema, index);
}
@Override
public int getTypeIndex(Schema schema)
{
Integer index = schemaToIndex.get(schema);
return index == null ? -1 : index;
}
};
}
}
{code}
I will attach a patch later today or early tomorrow.
Thanks in advance,
Hernan Otero
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: \
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more \
information on JIRA, see: http://www.atlassian.com/software/jira
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic