禁用Netty对象缓存机制

2017-09-13

最近我们的Collector应用出现老年代持续占用近8G，且full GC后老年代后没有任何好转的问题。排查后发现是Netty有一个Object Cache的机制，参考了一篇特别好的文章“Netty 踩坑记”，又踩了一些小坑，最终基本解决。

起因

我们的埋点收集应用Collector，近来发现高峰期总是严重的Full GC，老年代一致占用8G左右（正好是配置的总内存减去新生代的大小）。并且Full GC后老年代没有任何好转。怀疑是内存泄漏，使用jmap -histo看看：

num     #instances         #bytes  class name
----------------------------------------------
 1:       2739813     9016322960  [B
 2:      45219840     1085276160  io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
 3:      14581784      709154016  [C
 4:      14490197      347764728  java.lang.String
 5:         83484      240125528  [I
 6:        102400      182517760  [Lio.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry;
 7:        848659       60698184  [Ljava.lang.Object;
 8:       1174153       46966120  java.util.TreeMap$Entry
 9:       1523574       24377184  java.lang.Integer
10:        537647       21505880  java.util.HashMap$KeyIterator

其中第二项PoolThreadCache$MemoryRegionCache$Entry中的实例竟然有4千多万个，占用了近1G的内存。就以它入手，找到一份好文章“Netty 踩坑记”，了解了其中的缘由:

因为Netty有一个Object Pool的Cache机制，主要用来重用ByteBuf。如果使用 PooledByteBuf，在IO线程中allocateByteBuf，然后在业务线程中releaseByteBuf，那么就会产生内存泄漏。因为最后ByteBuf.release()会调用到Recycler.release()：

public final boolean recycle(T o, Handle handle) {
        DefaultHandle h = (DefaultHandle) handle;
        if (h.stack.parent != this) {
            return false;
        }
        if (o != h.value) {
            throw new IllegalArgumentException("o does not belong to handle");
        }
        h.recycle();
        return true;
}

而这个h.recycle()会把自己加入到ThreadLocal的一个Stack中去：

public void recycle() {
        Thread thread = Thread.currentThread();
        if (thread == stack.thread) {
            stack.push(this);
            return;
        }
        // we don't want to have a ref to the queue as the value in our weak map
        // so we null it out; to ensure there are no races with restoring it later
        // we impose a memory ordering here (no-op on x86)
        Map<Stack<?>, WeakOrderQueue> delayedRecycled = DELAYED_RECYCLED.get();
        WeakOrderQueue queue = delayedRecycled.get(stack);
        if (queue == null) {
            delayedRecycled.put(stack, queue = new WeakOrderQueue(stack, thread));
        }
        queue.add(this);
}

关键是这个ThreadLocal是我们的业务线程的，和原IO线程不是一个。所以：“如果要使用 PooledByteBuf，一定要注意 allocate 和 release 是同一个线程！”

修复的办法之一是禁用这个Cache的机制，可以添加启动参数io.netty.recycler.maxCapacity或设置系统变量System.setProperty("io.netty.recycler.maxCapacity", "0");

我碰到的小坑就是，我明明设置了但是貌似没任何效果。我使用的Netty版本是4.0.26.final，翻看代码，才发现：

static {
    // In the future, we might have different maxCapacity for different object types.
    // e.g. io.netty.recycler.maxCapacity.writeTask
    //      io.netty.recycler.maxCapacity.outboundBuffer
    int maxCapacity = SystemPropertyUtil.getInt("io.netty.recycler.maxCapacity.default", 0);
    if (maxCapacity <= 0) {
        // TODO: Some arbitrary large number - should adjust as we get more production experience.
        maxCapacity = 262144;
    }

    DEFAULT_MAX_CAPACITY = maxCapacity;
    if (logger.isDebugEnabled()) {
        logger.debug("-Dio.netty.recycler.maxCapacity.default: {}", DEFAULT_MAX_CAPACITY);
    }

    INITIAL_CAPACITY = Math.min(DEFAULT_MAX_CAPACITY, 256);
}

也就是说，我的这个版本要配置的参数是“io.netty.recycler.maxCapacity.default”，且实际上是配成0也不会生效，还是默认的最大值“256”。

翻看Netty Github上的代码，发现[#4147] Allow to disable recycling才使这个配置真正生效，随后这个PR合并到了4.0.30.final。

而那篇博客中的“io.netty.recycler.maxCapacity”参数是对应4.1.x以后的版本。

总之，升级了Netty版本后，添加该参数，jmap中io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry已经消失：

num     #instances         #bytes  class name
----------------------------------------------
 1:       1859867     6106441528  [B
 2:      16684162     1374674200  [C
 3:      15991878      383805072  java.lang.String
 4:        982953      270032264  [Ljava.lang.Object;
 5:        277338      102179640  [I
 6:         99840       65495040  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
 7:       2040514       65296448  io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
 8:       1412136       56485440  java.util.TreeMap$Entry
 9:       2326912       55845888  [Lorg.apache.kafka.common.Node;
10:       1582306       50633792  java.util.HashMap$Node

开心！