https://www.cnblogs.com/my_life/articles/5556939.html
Selector空轮询处理
在NIO中通过Selector的轮询当前是否有IO事件,根据JDK NIO api描述,Selector的select方法会一直阻塞,直到IO事件达到或超时,但是在Linux平台上这里有时会出现问题,在某些场景下select方法会直接返回,即使没有超时并且也没有IO事件到达,这就是著名的epoll bug,这是一个比较严重的bug,它会导致线程陷入死循环,会让CPU飙到100%,极大地影响系统的可靠性,到目前为止,JDK都没有完全解决这个问题。
但是Netty有效的规避了这个问题,经过实践证明,epoll bug已Netty框架解决,Netty的处理方式是这样的:
记录select空转的次数,定义一个阀值,这个阀值默认是512,可以在应用层通过设置系统属性io.netty.selectorAutoRebuildThreshold传入,当空转的次数超过了这个阀值,重新构建新Selector,将老Selector上注册的Channel转移到新建的Selector上,关闭老Selector,用新的Selector代替老Selector,详细实现可以查看NioEventLoop中的selector和rebuildSelector方法:
for (;;) {
long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
if (timeoutMillis <= 0) {
if (selectCnt == 0) {
selector.selectNow();
selectCnt = 1;
}
break;
}
int selectedKeys = selector.select(timeoutMillis);
selectCnt ++;
if (selectedKeys != 0 || oldWakenUp || wakenUp.get() || hasTasks()) {
// Selected something,
// waken up by user, or
// the task queue has a pending task.
break;
}
if (selectedKeys == 0 && Thread.interrupted()) {
// Thread was interrupted so reset selected keys and break so we not run into a busy loop.
// As this is most likely a bug in the handler of the user or it's client library we will
// also log it.
//
// See https://github.com/netty/netty/issues/2426
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely because " +
"Thread.currentThread().interrupt() was called. Use " +
"NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
}
selectCnt = 1;
break;
}
if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 &&
selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
// The selector returned prematurely many times in a row.
// Rebuild the selector to work around the problem.
logger.warn(
"Selector.select() returned prematurely {} times in a row; rebuilding selector.",
selectCnt);
rebuildSelector();
selector = this.selector;
// Select again to populate selectedKeys.
selector.selectNow();
selectCnt = 1;
break;
}
currentTimeNanos = System.nanoTime();
}
public void rebuildSelector() {
if (!inEventLoop()) {
execute(new Runnable() {
@Override
public void run() {
rebuildSelector();
}
});
return;
}
final Selector oldSelector = selector;
final Selector newSelector;
if (oldSelector == null) {
return;
}
try {
newSelector = openSelector();
} catch (Exception e) {
logger.warn("Failed to create a new Selector.", e);
return;
}
// Register all channels to the new Selector.
int nChannels = 0;
for (;;) {
try {
for (SelectionKey key: oldSelector.keys()) {
Object a = key.attachment();
try {
if (!key.isValid() || key.channel().keyFor(newSelector) != null) {
continue;
}
int interestOps = key.interestOps();
key.cancel();
key.channel().register(newSelector, interestOps, a);
nChannels ++;
} catch (Exception e) {
logger.warn("Failed to re-register a Channel to the new Selector.", e);
if (a instanceof AbstractNioChannel) {
AbstractNioChannel ch = (AbstractNioChannel) a;
ch.unsafe().close(ch.unsafe().voidPromise());
} else {
@SuppressWarnings("unchecked")
NioTask<SelectableChannel> task = (NioTask<SelectableChannel>) a;
invokeChannelUnregistered(task, key, e);
}
}
}
} catch (ConcurrentModificationException e) {
// Probably due to concurrent modification of the key set.
continue;
}
break;
}
selector = newSelector;
try {
// time to close the old selector as everything else is registered to the new one
oldSelector.close();
} catch (Throwable t) {
if (logger.isWarnEnabled()) {
logger.warn("Failed to close the old Selector.", t);
}
}
logger.info("Migrated " + nChannels + " channel(s) to the new Selector.");
}
防止线程跑飞
线程是多路复用器的核心,所有IO事件执行的载体,一旦线程出现异常线程跑飞(run方法执行结束),那么可能会导致整个多路复用器不可用,导致挂载在多路复用器上的连接不可用,进而大量的业务请求失败。由于Netty中的同时处理IO事件和非IO事件逻辑,所以线程不仅仅要处理IO异常,业务测触发的异常也需要被正确的处理,一旦处理不当,会导致线程跑飞。Netty的处理是在run方法中catch所有的Throwable即所有的Exception和Error,不做任何处理,休眠1s继续执行循环,休眠1s的目的是为了防止捕获异常之后继续执行再次进入该异常形成死循环。实现代码在NioEventLoop的run方法中:
@Override
protected void run() {
for (;;) {
oldWakenUp = wakenUp.getAndSet(false);
try {
...
} catch (Throwable t) {
logger.warn("Unexpected exception in the selector loop.", t);
// Prevent possible consecutive immediate failures that lead to
// excessive CPU consumption.
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// Ignore.
}
}
}
}
连接中断处理
在客户端和服务端建立起连接之后,如果连接发生了意外中断,Netty也会及时释放连接句柄资源(因为TCP是全双工协议,通信双方都需要关闭和释放Socket句柄才不会发生句柄的泄漏,如不经过特殊处理是会发生句柄泄露的),原理如下:
在读取数据时会调用io.netty.buffer.AbstractByteBuf.writeBytes(ScatteringByteChannel,
int),然后调用io.netty.buffer.ByteBuf.setBytes(int, ScatteringByteChannel,
int),setBytes方法调用nio.channel.read,如果当前连接已经意外中断,会收到JDK
NIO层抛出的ClosedChannelException异常,setBytes方法捕获该异常之后直接返回-1,
在NioByteUnsafe.read方法中,发现当前读取到的字节长度为-1,即调用io.netty.channel.nio.AbstractNioByteChannel.NioByteUnsafe.closeOnRead(ChannelPipeline)方法,然后调用io.netty.channel.AbstractChannel.AbstractUnsafe.close(ChannelPromise)关闭连接释放句柄资源。参考相关的代码:
//NioByteUnsafe.read方法
public void read() {
...
boolean close = false;
try {
...
do {
...
int localReadAmount = doReadBytes(byteBuf);
if (localReadAmount <= 0) {
...
close = localReadAmount < 0;
break;
}
...
} while (...);
...
if (close) {
closeOnRead(pipeline);
close = false;
}
} catch (Throwable t) {
...
} finally {
...
}
}
//NioSocketChannel.doReadBytes方法
protected int doReadBytes(ByteBuf byteBuf) throws Exception {
return byteBuf.writeBytes(javaChannel(), byteBuf.writableBytes());
}
//AbstractByteBuf.writeBytes方法
public int writeBytes(ScatteringByteChannel in, int length) throws IOException {
ensureWritable(length);
int writtenBytes = setBytes(writerIndex, in, length);
if (writtenBytes > 0) {
writerIndex += writtenBytes;
}
return writtenBytes;
}
//UnpooledHeapByteBuf.setBytes方法
public int setBytes(int index, ScatteringByteChannel in, int length) throws IOException {
ensureAccessible();
try {
return in.read((ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length));
} catch (ClosedChannelException e) {
return -1;
}
}
流量整形
一般大型的系统都包含多个模块,在部署时不同的模块可能部署在不同的机器上,比如我司的项目,至少5个部件起,少了都不好意思拿出去见人。这种情况下系统运行时会涉及到大量的上下游部件的通信,但是由于不同服务器无论是从硬件配置,还是系统模块的业务特性都会存在差异,这就导致到服务器的处理能力,以及不同时间段服务器的负载都是有差异的,这就可能会导致问题:上下游消息的传递速度和下游部件的消息处理速度失去平衡,下游部件接收到的消息量远远超过了它的处理能力,导致大量的业务无法被及时的处理,甚至可能导致下游服务器被压垮。
在Netty框架中提供了流量整形处理机制来应付这种场景,通过控制服务器单位时间内发送/接收消息的字节数来使上下游服务器处理相对平衡的状态。Netty中的流量整形包含了两种:一种是针对单个连接的流量整形,另一种是针对全局即所有连接的流量整形。这两种方式的流量整形原理是类似的,只是流量整形器的作用域不同,一个是全局的,一个是连接建立后创建,连接关闭后被回收。GlobalTrafficShapingHandler处理全局流量整形,ChannelTrafficShapingHandler处理单链路流量整形,流量整形处理有三个重要的参数:
- writeLimit:每秒最多可以写多个字节的数据。
- readLimit:每秒最多可以读多少个字节的数据。
- checkInterval:流量检查的间隔时间,默认1s。
以读操作为例,流量整形的工作过程大致如下:
- 启动一个定时任务,每隔checkInterval毫秒执行一次,在任务中清除累加的读写字节数还原成0,更新上次流量整形检查时间。
- 执行读操作,触发channelRead方法,记录当前已读取的字节数并且和上次流量整形检查之后的所有读操作读取的字节数进行累加。
- 根据时间间隔和已读取的流量数计算当前流量判断当前读取操作是否已导致每秒读取的字节数超过了阀值readLimit,计算公式是:(bytes * 1000 / limit - interval) / 10 * 10,其中,bytes是上次流量整形检查之后的所有读操作累计读取的字节数,limit 就是readLimit,interval是当前时间距上次检查经过的时间毫秒数,如果该公式计算出来的值大于固定的阀值10,那么说明流量数已经超标,那么把该读操作放到延时任务中处理,延时的毫秒数就是上面那个公式计算出来的值。
下面是相关的代码:
//AbstractTrafficShapingHandler.channelRead方法
public void channelRead(final ChannelHandlerContext ctx, final Object msg) throws Exception {
long size = calculateSize(msg);
long curtime = System.currentTimeMillis();
if (trafficCounter != null) {
//增加字节累计数
trafficCounter.bytesRecvFlowControl(size);
if (readLimit == 0) {
// no action
ctx.fireChannelRead(msg);
return;
}
// compute the number of ms to wait before reopening the channel
long wait = getTimeToWait(readLimit,
trafficCounter.currentReadBytes(),
trafficCounter.lastTime(), curtime);
if (wait >= MINIMAL_WAIT) { // At least 10ms seems a minimal
// time in order to
// try to limit the traffic
if (!isSuspended(ctx)) {
ctx.attr(READ_SUSPENDED).set(true);
// Create a Runnable to reactive the read if needed. If one was create before it will just be
// reused to limit object creation
Attribute<Runnable> attr = ctx.attr(REOPEN_TASK);
Runnable reopenTask = attr.get();
if (reopenTask == null) {
reopenTask = new ReopenReadTimerTask(ctx);
attr.set(reopenTask);
}
ctx.executor().schedule(reopenTask, wait,
TimeUnit.MILLISECONDS);
} else {
// Create a Runnable to update the next handler in the chain. If one was create before it will
// just be reused to limit object creation
Runnable bufferUpdateTask = new Runnable() {
@Override
public void run() {
ctx.fireChannelRead(msg);
}
};
ctx.executor().schedule(bufferUpdateTask, wait, TimeUnit.MILLISECONDS);
return;
}
}
}
ctx.fireChannelRead(msg);
}
//AbstractTrafficShapingHandler.getTimeToWait方法
private static long getTimeToWait(long limit, long bytes, long lastTime, long curtime) {
long interval = curtime - lastTime;
if (interval <= 0) {
// Time is too short, so just lets continue
return 0;
}
return (bytes * 1000 / limit - interval) / 10 * 10;
}
private static class TrafficMonitoringTask implements Runnable {
...
@Override
public void run() {
if (!counter.monitorActive.get()) {
return;
}
long endTime = System.currentTimeMillis();
//还原累计字节数,lastTime等变量
counter.resetAccounting(endTime);
if (trafficShapingHandler1 != null) {
trafficShapingHandler1.doAccounting(counter);
}
counter.scheduledFuture = counter.executor.schedule(this, counter.checkInterval.get(),
TimeUnit.MILLISECONDS);
}
}
Netty4的内存池集大家之精华,参考了各路英雄豪杰的优秀思想,它参考了slab分配,Buddy(伙伴)分配。接触过memcached的应该了解slab分配,它的思路是把内存分割成大小不等的内存块,用户线程请求内存时根据请求的内存大小分配最贴近size的内存块,
在减少内存碎片的同时又能很好的避免内存浪费。
Buddy分配是在分配的过程中把一些内存块等量分割,回收时合并,尽可能保证系统中有足够大的连续内存。
线程私有分配
为了避免线程竞争,内存分配优先在线程内分配
全局分配
内存池的初始阶段,线程是没有内存缓存的,所以最开始的内存分配都需要在全局分配区进行分配
内存释放
前面已经提到了,内存池不会预置内存块到线程缓存中,在线程申请到内存使用完成之后归还内存时优先把内存块缓存到线程中,除非该内存块不适合缓存在线程中(内存太大),当当前线程内存分配动作非常活跃时,这样会明显的提高分配效率,但是当它不活跃时对内存又是极大的浪费,所以内存池会监控该线程,随时做好把内存从线程缓存中删除的准备,详见MemoryRegionCache类的trim方法代码:
private void trim() {
int free = size() - maxEntriesInUse;
entriesInUse = 0;
maxEntriesInUse = 0;
if (free <= maxUnusedCached) {
return;
}
int i = head;
for (; free > 0; free--) {
if (!freeEntry(entries[i])) {
// all freed
return;
}
i = nextIdx(i);
}
}