1. 理论
零拷贝是服务器网络编程的关键,任何性能优化都离不开。在 Java 程序员的世界,常用的零拷贝有 mmap(内存映射) 和 sendFile。所谓的零拷贝不是说不拷贝,是不存在CPU拷贝,DMA拷贝是不可避免的。也就是从操作系统的角度来说,内核缓存区之间没有数据是重复的(只有kernel buffer 有一份数据)
零拷贝不仅带来更少的数据复制,还能带来其他的性能优势,例如更少的上下文切换,更少的CPU缓存伪共享以及无CPU校验和计算。
2. 原来BIO测试
代码如下:
import java.io.File; import java.io.RandomAccessFile; import java.net.ServerSocket; import java.net.Socket; public class NIOSocket { public static void main(String[] args) throws Exception { // 建立 socket ServerSocket serverSocket = new ServerSocket(8080); System.out.println("serverSocket init 8080 ==== "); while (true) { Thread.sleep(1 * 1000); // 接到连接之后写回去数据 Socket socket = serverSocket.accept(); System.out.println("socket; " + socket.getRemoteSocketAddress()); // 读取文件的数据 File file = new File("index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); byte[] arr = new byte[(int) file.length()]; raf.read(arr); socket.getOutputStream().write(arr); } } }
1. index.html 内容如下:
[root@192 zerocopy]# cat index.html
index hello
2. linux 上面用strace 测试
[root@192 zerocopy]# strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket
serverSocket init 8080 ====
3. 查看out文件
[root@192 zerocopy]# ll total 2568 -rw-r--r--. 1 root root 12 Jul 23 04:25 index.html -rw-r--r--. 1 root root 1389 Jul 24 20:17 NIOSocket.class -rw-r--r--. 1 root root 916 Jul 24 20:17 NIOSocket.java -rw-r--r--. 1 root root 12828 Jul 24 20:21 out.51588 -rw-r--r--. 1 root root 1278064 Jul 24 20:21 out.51589 -rw-r--r--. 1 root root 13861 Jul 24 20:22 out.51590 -rw-r--r--. 1 root root 1614 Jul 24 20:21 out.51591 -rw-r--r--. 1 root root 1558 Jul 24 20:21 out.51592 -rw-r--r--. 1 root root 1145 Jul 24 20:21 out.51593 -rw-r--r--. 1 root root 9446 Jul 24 20:22 out.51594 -rw-r--r--. 1 root root 1190 Jul 24 20:21 out.51595 -rw-r--r--. 1 root root 387057 Jul 24 20:22 out.51596
4. nc 进行连接测试
[root@192 zerocopy]# nc localhost 8080:
index hello
5. 查看out.51589文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 5 。。。 bind(5, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 listen(5, 50) = 0 。。。 。。。 accept(5, {sa_family=AF_INET6, sin6_port=htons(53122), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 6 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 stat64("index.html", {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 read(7, "index hello ", 12) = 12 。。。 。。。 send(6, "index hello ", 12, 0) = 12 。。。
6. 我们会调用 read 方法读取 index.html 的内容—— 变成字节数组,然后调用 write 方法,将 index.html 字节流写到 socket 中,那么,我们调用这两个方法,在 OS 底层发生的操作如下:
上半部分表示用户态和内核态的上下文切换,下半部分表示数据复制操作:
1. read 调用导致用户态到内核态的一次变化,同时,第一次复制开始:DMA(Direct Memory Access,直接内存存取,即不使用 CPU 拷贝数据到内存,而是 DMA 引擎传输数据到内存,用于解放 CPU) 引擎从磁盘读取 index.html 文件,并将数据放入到内核缓冲区。
2. 发生第二次数据拷贝,即:将内核缓冲区的数据拷贝到用户缓冲区,同时,发生了一次用内核态到用户态的上下文切换。
3. 发生第三次数据拷贝,我们调用 write 方法,系统将用户缓冲区的数据拷贝到 Socket 缓冲区。此时,又发生了一次用户态到内核态的上下文切换。
4. 第四次拷贝,数据异步的从 Socket 缓冲区,使用 DMA 引擎拷贝到网络协议引擎。这一段,不需要进行上下文切换。
5. write 方法返回,再次从内核态切换到用户态。
如上操作经历了4次拷贝,2次DMA拷贝,2次CPU拷贝。 并且经历了4次状态切换。优化就需要内核继续发展,增加更高效的命令。
3. map 优化
mmap 通过内存映射,将文件映射到内核缓冲区,同时,用户空间可以共享内核空间的数据。这种方式的I/O原理就是将用户缓冲区(user buffer)的内存地址和内核缓冲区(kernel buffer)的内存地址做一个映射,也就是说系统在用户态可以直接读取并操作内核空间的数据。这样,在进行网络传输时,就可以减少内核空间到用户空间的拷贝次数。
MMAP(2) Linux Programmer's Manual MMAP(2) NAME mmap, munmap - map or unmap files or devices into memory SYNOPSIS #include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length); See NOTES for information on feature test macro requirements. DESCRIPTION mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping. 。。。
RETURN VALUE On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. On success, munmap() returns 0, on failure -1, and errno is set (probably to EINVAL).
mmap 的过程如下:
user buffer 和 kernel buffer 共享 index.html。如果你想把硬盘的 index.html 传输到网络中,再也不用拷贝到用户空间,再从用户空间拷贝到 Socket 缓冲区。
现在,你只需要从内核缓冲区拷贝到 Socket 缓冲区即可,这将减少一次内存拷贝(从 4 次变成了 3 次),但不减少上下文切换次数。
4. sendFile 优化
linux2.1 提供了 sendFile 函数,其基本原理如下:数据根本不经过用户态,直接从内核缓冲区进入到 Socket Buffer,同时,由于和用户态完全无关,就减少了一次上下文切换。
结构图如下:
进行 sendFile 系统调用时,数据被 DMA 引擎从文件复制到内核缓冲区,然后调用 write 方法时,从内核缓冲区进入到 Socket,这时,是没有上下文切换的,因为在一个用户空间。
最后,数据从 Socket 缓冲区进入到协议栈。
此时,数据经过了 3 次拷贝,3次上下文切换。
5. sendFile 继续优化
Linux 在 2.4 版本中,做了一些修改,避免了从内核缓冲区拷贝到 Socket buffer 的操作,直接拷贝到协议栈,从而再一次减少了数据拷贝。
查看sendfile如下:
SENDFILE(2) Linux Programmer's Manual SENDFILE(2) NAME sendfile - transfer data between file descriptors SYNOPSIS #include <sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); DESCRIPTION sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space. in_fd should be a file descriptor opened for reading and out_fd should be a descriptor opened for writing. If offset is not NULL, then it points to a variable holding the file offset from which sendfile() will start reading data from in_fd. When sendfile() returns, this variable will be set to the offset of the byte following the last byte that was read. If offset is not NULL, then sendfile() does not modify the current file offset of in_fd; otherwise the current file offset is adjusted to reflect the number of bytes read from in_fd. If offset is NULL, then data will be read from in_fd starting at the current file offset, and the file offset will be updated by the call. count is the number of bytes to copy between the file descriptors. The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket). In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately. RETURN VALUE If the transfer was successful, the number of bytes written to out_fd is returned. On error, -1 is returned, and errno is set appropriately.
具体如下图:
index.html 要从文件进入到网络协议栈,只需 2 次拷贝:第一次使用 DMA 引擎从文件拷贝到内核缓冲区,第二次从内核缓冲区将数据拷贝到网络协议栈;内核缓存区只会拷贝一些 offset 和 length 信息到 SocketBuffer,基本无消耗。也就是说也存在CPU拷贝,只是拷贝的只有文件的地址、偏移量等信息,可以忽略不计。
mmap 和 sendFile 的区别:
1. mmap 适合小数据量读写,sendFile 适合大文件传输。
2. mmap 需要 4 次上下文切换,3 次数据拷贝;sendFile 需要 3 次上下文切换,最少 2 次数据拷贝。
3. sendFile 可以利用 DMA 方式,减少 CPU 拷贝,mmap 则不能(必须从内核拷贝到 Socket 缓冲区)。
6. NIO测试
1. 代码如下:
import java.io.File; import java.io.IOException; import java.io.RandomAccessFile; import java.net.InetSocketAddress; import java.net.ServerSocket; import java.nio.channels.FileChannel; import java.nio.channels.ServerSocketChannel; import java.nio.channels.SocketChannel; public class NIOSocket { public static void main(String[] args) throws Exception { ServerSocketChannel serverSocketChannel = ServerSocketChannel.open(); ServerSocket serverSocket = serverSocketChannel.socket(); serverSocket.bind(new InetSocketAddress(8080)); serverSocket.setReuseAddress(true); System.out.println("serverSocketChannel init 8080 !!!"); while (true) { try { SocketChannel socketChannel = serverSocketChannel.accept(); System.out.println("客户端连接成功: " + socketChannel.getRemoteAddress()); // 输出的文件 File file = new File("index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); FileChannel channel = raf.getChannel(); long size = channel.size(); System.out.println("ready reansfer to !"); channel.transferTo(0, size, socketChannel); } catch (IOException e) { e.printStackTrace(); } } } }
2. nc 进行测试
[root@192 zerocopy]# nc localhost 8080
index hello
3. 主线程查看日志
strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket serverSocketChannel init 8080 !!! 客户端连接成功: /0:0:0:0:0:0:0:1:53126 ready reansfer to !
4. 查看out 文件
[root@192 zerocopy]# ll total 3064 -rw-r--r--. 1 root root 12 Jul 23 04:25 index.html -rw-r--r--. 1 root root 1824 Jul 24 21:55 NIOSocket.class -rw-r--r--. 1 root root 1410 Jul 24 21:54 NIOSocket.java -rw-r--r--. 1 root root 13099 Jul 24 21:58 out.52182 -rw-r--r--. 1 root root 1490301 Jul 24 21:58 out.52183 -rw-r--r--. 1 root root 56765 Jul 24 21:58 out.52184 -rw-r--r--. 1 root root 1688 Jul 24 21:58 out.52185 -rw-r--r--. 1 root root 1626 Jul 24 21:58 out.52186 -rw-r--r--. 1 root root 5189 Jul 24 21:58 out.52187 -rw-r--r--. 1 root root 19040 Jul 24 21:58 out.52188 -rw-r--r--. 1 root root 1175 Jul 24 21:58 out.52189 -rw-r--r--. 1 root root 1509105 Jul 24 21:58 out.52190 -rw-r--r--. 1 root root 7216 Jul 24 21:58 out.52221
5. 查看 52183 文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 。。。 bind(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), s in6_scope_id=0}, 28) = 0 listen(4, 50) 。。。 accept(4, {sa_family=AF_INET6, sin6_port=htons(53126), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0 ), sin6_scope_id=0}, [28]) = 6 fcntl64(6, F_GETFL) = 0x2 (flags O_RDWR) 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 。。。 sendfile64(6, 7, [0] => [12], 12) = 12 。。。
可以看到最终是调用了sendfile64 函数进行输出。也就是传递的是地址以及两个fd和大小。
NIO的零拷贝由transferTo()方法实现。transferTo()方法将数据从FileChannel对象传送到可写的字节通道(如Socket Channel等)。在内部实现中,由native方法transferTo0()来实现,它依赖底层操作系统的支持。在UNIX和Linux系统中,调用这个方法将会引起sendfile()系统调用。
sun.nio.ch.FileChannelImpl#transferTo0 签名如下:
// Transfers from src to dst, or returns -2 if kernel can't do that private native long transferTo0(FileDescriptor src, long position, long count, FileDescriptor dst);
使用场景一般是:
1. 较大,读写较慢,追求速度
2. 内存不足,不能加载太大数据
3. 带宽不够,即存在其他程序或线程存在大量的IO操作,导致带宽本来就小
以上都建立在不需要进行数据文件操作的情况下,如果既需要这样的速度,也需要进行数据操作怎么办?那么使用NIO的直接内存!
补充: NIO直接内存修改数据
import java.io.File; import java.io.IOException; import java.io.RandomAccessFile; import java.net.InetSocketAddress; import java.net.ServerSocket; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.nio.channels.ServerSocketChannel; import java.nio.channels.SocketChannel; public class NIOSocket { public static void main(String[] args) throws Exception { ServerSocketChannel serverSocketChannel = ServerSocketChannel.open(); ServerSocket serverSocket = serverSocketChannel.socket(); serverSocket.bind(new InetSocketAddress(8080)); serverSocket.setReuseAddress(true); System.out.println("serverSocketChannel init 8080 !!!"); while (true) { try { SocketChannel socketChannel = serverSocketChannel.accept(); System.out.println("客户端连接成功: " + socketChannel.getRemoteAddress()); // 输出的文件 File file = new File("D:/index.html"); RandomAccessFile raf = new RandomAccessFile(file, "rw"); FileChannel channel = raf.getChannel(); System.out.println("channel.map start ... "); MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, channel.size() + 2); // 添加一个字符进去,这里对镜像的修改直接生效到节点文件中了! buffer.putChar((int) (channel.size() - 2), 'C'); System.out.println("channel.map end ... "); long size = channel.size(); System.out.println("ready reansfer to !"); channel.transferTo(0, size, socketChannel); } catch (IOException e) { e.printStackTrace(); } } } }
测试如下:
(1) strace 启动程序
strace -ff -o out ../jdk8/jdk1.8.0_291/bin/java NIOSocket
serverSocketChannel init 8080 !!!
(2) nc 连接
[root@192 zerocopy]# nc localhost 8080
index hello
C
(3) 查看out文件
。。。 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 。。。 bind(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 listen(4, 50) 。。。 accept(4, {sa_family=AF_INET6, sin6_port=htons(53132), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 6 fcntl64(6, F_GETFL) = 0x2 (flags O_RDWR) 。。。 open("index.html", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7 fstat64(7, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0 。。。 mmap2(NULL, 14, PROT_READ|PROT_WRITE, MAP_SHARED, 7, 0) = 0xf7714000 。。。 sendfile64(6, 7, [0] => [14], 14) = 14
查看mmap2命令如下:
MMAP2(2) Linux Programmer's Manual MMAP2(2) NAME mmap2 - map files or devices into memory SYNOPSIS #include <sys/mman.h> void *mmap2(void *addr, size_t length, int prot, int flags, int fd, off_t pgoffset); DESCRIPTION This is probably not the system call you are interested; instead, see mmap(2), which describes the glibc wrapper function that invokes this system call. The mmap2() system call provides the same interface as mmap(2), except that the final argument specifies the offset into the file in 4096-byte units (instead of bytes, as is done by mmap(2)). This enables applications that use a 32-bit off_t to map large files (up to 2^44 bytes). RETURN VALUE On success, mmap2() returns a pointer to the mapped area. On error -1 is returned and errno is set appro‐ priately.
直接内存(mmap技术)将文件直接映射到内核空间的内存,返回一个操作地址(address),它解决了文件数据需要拷贝到JVM才能进行操作的问题。而是直接在内核空间直接进行操作,省去了内核空间拷贝到用户空间这一步操作。
NIO的直接内存是由MappedByteBuffer实现的。核心即是map()方法,该方法把文件映射到内存中,获得内存地址addr,然后通过这个addr构造MappedByteBuffer类,以暴露各种文件操作API。
由于MappedByteBuffer申请的是堆外内存,因此不受Minor GC控制,只能在发生Full GC时才能被回收。而==DirectByteBuffer==改善了这一情况,它是MappedByteBuffer类的子类,同时它实现了DirectBuffer接口,维护一个Cleaner对象来完成内存回收。因此它既可以通过Full GC来回收内存,也可以调用clean()方法来进行回收。
NIO的MappedByteBuffer还有一个兄弟叫做HeapByteBuffer。顾名思义,它用来在堆中申请内存,本质是一个数组。由于它位于堆中,因此可受GC管控,易于回收。
补充: read/write 和 recv/send 的区别
read()函数是负责从fd中读取内容。当读成功时,read()返回实际所读的字节数,如果返回的值是0,表示已经读到文件的结束了,小于0表示出现了错误。
write()函数将buf中的nbytes字节内容写入文件描述符fd,成功时返回写的字节数,失败时返回-1并设置errno变量。
recv和send函数提供了和read和write差不多的功能,针对是读、写操作是socket的fd文件描述符,不过它们提供了第四个参数来 flage 控制读写操作。
linux下面man 2 cmd 查看各个命令如下:
1. read
READ(2) Linux Programmer's Manual READ(2) NAME read - read from a file descriptor SYNOPSIS #include <unistd.h> ssize_t read(int fd, void *buf, size_t count); DESCRIPTION read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf. On files that support seeking, the read operation commences at the current file offset, and the file off‐ set is incremented by the number of bytes read. If the current file offset is at or past the end of file, no bytes are read, and read() returns zero. If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects. If count is greater than SSIZE_MAX, the result is unspecified. RETURN VALUE On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. In this case it is left unspecified whether the file position (if any) changes.
2. write
WRITE(2) Linux Programmer's Manual WRITE(2) NAME write - write to a file descriptor SYNOPSIS #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descrip‐ tor fd. The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).) For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the current file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. POSIX requires that a read(2) which can be proved to occur after a write() has returned returns the new data. Note that not all file systems are POSIX conforming. RETURN VALUE On success, the number of bytes written is returned (zero indicates nothing was written). On error, -1 is returned, and errno is set appropriately.
3. recv
RECV(2) Linux Programmer's Manual RECV(2) NAME recv, recvfrom, recvmsg - receive a message from a socket SYNOPSIS #include <sys/types.h> #include <sys/socket.h> ssize_t recv(int sockfd, void *buf, size_t len, int flags); ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen); ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags); DESCRIPTION The recvfrom() and recvmsg() calls are used to receive messages from a socket, and may be used to receive data on a socket whether or not it is connection-oriented. RETURN VALUE These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error. The return value will be 0 when the peer has performed an orderly shutdown.
4. send
NAME send, sendto, sendmsg - send a message on a socket SYNOPSIS #include <sys/types.h> #include <sys/socket.h> ssize_t send(int sockfd, const void *buf, size_t len, int flags); ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen); ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags); DESCRIPTION The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket.