[prev in list] [next in list] [prev in thread] [next in thread]
List: avro-dev
Subject: [jira] [Created] (AVRO-2147) Proto to Avro serialization is unnecessarily slow due to repeated schem
From: "Tobi Vollebregt (JIRA)" <jira () apache ! org>
Date: 2018-02-17 20:42:00
Message-ID: JIRA.13139267.1518900065000.223595.1518900120047 () Atlassian ! JIRA
[Download RAW message or body]
Tobi Vollebregt created AVRO-2147:
-------------------------------------
Summary: Proto to Avro serialization is unnecessarily slow due to \
repeated schema creation Key: AVRO-2147
URL: https://issues.apache.org/jira/browse/AVRO-2147
Project: Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.8.2, 1.8.1
Reporter: Tobi Vollebregt
Hi,
I discovered that proto to avro serialization is unnecessarily slow in certain cases \
due to repeated schema creation. Specifically, this slowness shows when serializing \
protocol buffer messages that contain nested protocol buffer messages that contain \
enums with many possible values. Some profiling showed this is due to the {{Schema}} \
objects for the nested message/enum not being cached in this case.
An example that reproduces this is to add the following to {{test.proto}}:
{{message Foo {}}
{{ ...}}
{{ optional MessageWithLargeEnum bar = 21;}}
{{}}}
{{message MessageWithLargeEnum {}}
{{ optional LargeEnum enum = 1;}}
{{}}}
{{enum LargeEnum {}}
{{ AA = 1;}}
{{ AB = 2;}}
{{ AC = 3;}}
{{ ...}}
{{ ZZ = 676;}}
{{}}}
Then, a test like the following will exhibit the slow behavior:
{{@Test public void perf() throws Exception {}}
{{ Foo.Builder builder = Foo.newBuilder();}}
{{ builder.setInt32(0);}}
{{ builder.setInt64(2);}}
{{ builder.setUint32(3);}}
{{ builder.setUint64(4);}}
{{ builder.setSint32(5);}}
{{ builder.setSint64(6);}}
{{ builder.setFixed32(7);}}
{{ builder.setFixed64(8);}}
{{ builder.setSfixed32(9);}}
{{ builder.setSfixed64(10);}}
{{ builder.setFloat(1.0F);}}
{{ builder.setDouble(2.0);}}
{{ builder.setBool(true);}}
{{ builder.setString("foo");}}
{{ builder.setBytes(ByteString.copyFromUtf8("bar"));}}
{{ builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
{{ builder.addIntArray(27);}}
{{ builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
{{ builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}
{{ Foo objToConvert = builder.build();}}
{{ Schema schema = ProtobufData.get().getSchema(Foo.class);}}
{{ ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
{{ Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
{{ ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
{{ GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
{{ BinaryDecoder d = null;}}
{{ long startTime = System.nanoTime();}}
{{ for (int i = 0; i < 1000000; ++i) {}}
{{ bao.reset();}}
{{ w.write(objToConvert, e);}}
{{ e.flush();}}
{{ d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
{{ gdr.read(null, d);}}
{{ }}}
{{ long endTime = System.nanoTime();}}
{{ System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " ms");}}
{{}}}
I will attach a patch that optimizes this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic