[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-dev
Subject:    [jira] [Created] (AVRO-2147) Proto to Avro serialization is unnecessarily slow due to repeated schem
From:       "Tobi Vollebregt (JIRA)" <jira () apache ! org>
Date:       2018-02-17 20:42:00
Message-ID: JIRA.13139267.1518900065000.223595.1518900120047 () Atlassian ! JIRA
[Download RAW message or body]

Tobi Vollebregt created AVRO-2147:
-------------------------------------

             Summary: Proto to Avro serialization is unnecessarily slow due to \
repeated schema creation  Key: AVRO-2147
                 URL: https://issues.apache.org/jira/browse/AVRO-2147
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.8.2, 1.8.1
            Reporter: Tobi Vollebregt


Hi,

I discovered that proto to avro serialization is unnecessarily slow in certain cases \
due to repeated schema creation. Specifically,  this slowness  shows when serializing \
protocol buffer messages that contain nested protocol buffer messages that contain  \
enums with many possible values. Some profiling showed this is due to the {{Schema}} \
objects for the nested message/enum not being cached in this case.

An example that reproduces this is to add the following to {{test.proto}}:

{{message Foo {}}
{{   ...}}
{{    optional MessageWithLargeEnum bar = 21;}}
{{}}}
{{message MessageWithLargeEnum {}}
{{   optional LargeEnum enum = 1;}}
{{}}}
{{enum LargeEnum {}}
{{   AA = 1;}}
{{   AB = 2;}}
{{   AC = 3;}}
{{   ...}}
{{    ZZ = 676;}}
{{}}}

Then, a test like  the following  will exhibit the slow behavior:

{{@Test public void perf() throws Exception {}}
{{   Foo.Builder builder = Foo.newBuilder();}}
{{   builder.setInt32(0);}}
{{   builder.setInt64(2);}}
{{   builder.setUint32(3);}}
{{   builder.setUint64(4);}}
{{   builder.setSint32(5);}}
{{   builder.setSint64(6);}}
{{   builder.setFixed32(7);}}
{{   builder.setFixed64(8);}}
{{   builder.setSfixed32(9);}}
{{   builder.setSfixed64(10);}}
{{   builder.setFloat(1.0F);}}
{{   builder.setDouble(2.0);}}
{{   builder.setBool(true);}}
{{   builder.setString("foo");}}
{{   builder.setBytes(ByteString.copyFromUtf8("bar"));}}
{{   builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
{{   builder.addIntArray(27);}}
{{   builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
{{     builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}

{{   Foo objToConvert = builder.build();}}

{{   Schema schema = ProtobufData.get().getSchema(Foo.class);}}
{{   ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
{{   Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
{{   ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
{{   GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
{{   BinaryDecoder d = null;}}

{{   long startTime = System.nanoTime();}}
{{   for (int i = 0; i < 1000000; ++i) {}}
{{      bao.reset();}}
{{      w.write(objToConvert, e);}}
{{      e.flush();}}
{{      d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
{{      gdr.read(null, d);}}
{{   }}}
{{   long endTime = System.nanoTime();}}
{{   System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " ms");}}
{{}}}

I will attach a patch that  optimizes this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic