[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-dev
Subject:    [jira] [Commented] (AVRO-1821) Avro (Java) Memory Leak in ReflectData Caching
From:       "Anbu Cheeralan (JIRA)" <jira () apache ! org>
Date:       2017-06-20 18:31:00
Message-ID: JIRA.12955792.1459783330000.61381.1497983460428 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/AVRO-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056246#comment-16056246 \
] 

Anbu Cheeralan commented on AVRO-1821:
--------------------------------------

Thanks [~nkollar]

> Avro (Java) Memory Leak in ReflectData Caching
> ----------------------------------------------
> 
> Key: AVRO-1821
> URL: https://issues.apache.org/jira/browse/AVRO-1821
> Project: Avro
> Issue Type: Bug
> Components: java
> Environment: OS X 10.11.3
> {code}java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode){code}
> Reporter: Bryan Harclerode
> Assignee: Bryan Harclerode
> Fix For: 1.8.1
> 
> Attachments: 0001-AVRO-1821-Fix-memory-leak-of-Schemas-in-ReflectData.patch
> 
> 
> I think I have encountered one of the memory leaks described by AVRO-1283 in the \
> way Java Avro implements field accessor caching in {{ReflectData}}. When a \
> reflected object is serialized, the key of {{ClassAccessorData.bySchema}} (as \
> retained by {{ReflectData.ACCESSOR_CACHE}}) retains a strong reference to the \
> schema that was used to serialize the object, but there exists no code path for \
> clearing these references after a schema will no longer be used. While in most \
> cases, a class will probably only have one schema associated with it (created and \
> cached by {{ReflectData.getSchema(Type)}}), I experienced {{OutOfMemoryError}} when \
> serializing generic classes with dynamically-generated schemas. The following is a \
> minimal example which will exhaust a 50MiB heap ({{-Xmx50m}}) after about 190K \
> iterations: {code:title=AvroMemoryLeakMinimal.java|borderStyle=solid}
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import java.util.Collections;
> import org.apache.avro.Schema;
> import org.apache.avro.io.BinaryEncoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.reflect.ReflectDatumWriter;
> public class AvroMemoryLeakMinimal {
> public static void main(String[] args) throws IOException {
> long count = 0;
> EncoderFactory encFactory = EncoderFactory.get();
> try {
> while (true) {
> // Create schema
> Schema schema = Schema.createRecord("schema", null, null, false);
> schema.setFields(Collections.<Schema.Field>emptyList());
> // serialize
> ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
> BinaryEncoder encoder = encFactory.binaryEncoder(baos, null);
> (new ReflectDatumWriter<Object>(schema)).write(new Object(), encoder);
> byte[] result = baos.toByteArray();
> count++;
> }
> } catch (OutOfMemoryError e) {
> System.out.print("Memory exhausted after ");
> System.out.print(count);
> System.out.println(" schemas");
> throw e;
> }
> }
> }
> {code}
> I was able to fix the bug in the latest 1.9.0-SNAPSHOT from git with the following \
> patch to {{ClassAccessorData.bySchema}} to use weak keys so that it properly \
> released the {{Schema}} objects if no other threads are still referencing them: \
>                 {code:title=ReflectData.java.patch|borderStyle=solid}
> --- a/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> +++ b/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> @@ -57,6 +57,7 @@ import org.apache.avro.io.DatumWriter;
> import org.apache.avro.specific.FixedSize;
> import org.apache.avro.specific.SpecificData;
> import org.apache.avro.SchemaNormalization;
> +import org.apache.avro.util.WeakIdentityHashMap;
> import org.codehaus.jackson.JsonNode;
> import org.codehaus.jackson.node.NullNode;
> 
> @@ -234,8 +235,8 @@ public class ReflectData extends SpecificData {
> private final Class<?> clazz;
> private final Map<String, FieldAccessor> byName =
> new HashMap<String, FieldAccessor>();
> -    private final IdentityHashMap<Schema, FieldAccessor[]> bySchema =
> -        new IdentityHashMap<Schema, FieldAccessor[]>();
> +    private final WeakIdentityHashMap<Schema, FieldAccessor[]> bySchema =
> +        new WeakIdentityHashMap<Schema, FieldAccessor[]>();
> 
> private ClassAccessorData(Class<?> c) {
> clazz = c;
> {code}
> Additionally, I'm not sure why an {{IdentityHashMap}} was used instead of a \
> standard {{HashMap}}, since two equivalent schemas have the same set of \
> {{FieldAccessor}}. Everything appears to work and all tests pass if I use a \
> {{WeakHashMap}} instead of an {{WeakIdentityHashMap}}, but I don't know if there \
> was some other reason object identity was important for this map. If a non-identity \
> map can be used, this will help reduce memory/CPU usage further by not regenerating \
> all the field accessors for equivalent schemas. The following unit test appears to \
> reliably catch this bug, but is non-deterministic due to the nature of garbage \
> collection (and I'm not sure there's a way around that): \
> {code:title=TestReflectData.java|borderStyle=solid} package \
> org.apache.avro.reflect; import org.apache.avro.Schema;
> import org.junit.Test;
> import java.io.IOException;
> import java.lang.reflect.Field;
> import java.util.Collections;
> import java.util.Map;
> import static org.hamcrest.Matchers.lessThan;
> import static org.junit.Assert.assertThat;
> public class TestReflectData {
> /**
> * Test if ReflectData is leaking {@link Schema} references 
> */
> @SuppressWarnings("unchecked")
> @Test public void testWeakSchemaCaching() throws IOException, NoSuchFieldException, \
> IllegalAccessException { for (int i = 0; i < 1000; i++) {
> // Create schema
> Schema schema = Schema.createRecord("schema", null, null, false);
> schema.setFields(Collections.<Schema.Field>emptyList());
> ReflectData.get().getRecordState(new Object(), schema);
> }
> // Reflect the number of schemas currently in the cache
> Field cacheField = ReflectData.class.getDeclaredField("ACCESSOR_CACHE");
> cacheField.setAccessible(true);
> Map<Class<?>, ?> ACCESSOR_CACHE = (Map) cacheField.get(null);
> Object classData = ACCESSOR_CACHE.get(Object.class);
> Field bySchemaField = classData.getClass().getDeclaredField("bySchema");
> bySchemaField.setAccessible(true);
> Map<Schema, FieldAccessor[]> accessors = (Map) bySchemaField.get(classData);
> System.gc(); // Not guaranteed reliable, but seems to be reliable enough for our \
> purposes // See if the number of schemas in the cache is less than the number we \
> generated - if so, then they are being released. assertThat("ReflectData cache \
> should release references", accessors.size(), lessThan(1000)); }
> }
> {code}
> (Added {{org.hamcrest:hamcrest-all}} dependency to test scope for the built-in \
>                 {{lessThan()}} matcher)
> ----
> The current workaround that I'm using to mitigate the leak is to cache schemas and \
> re-use older instances when I'm about to serialize an equivalent schema. Since most \
> of the generated schemas are equivalent, this limits the number of leaked schemas \
> to a handful. A more permanent workaround would be to switch to using a \
> {{GenericRecord}} instead of a generic java class for the object that is being \
> serialized, since this cuts out the use of {{ReflectData}} entirely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic