zoukankan      html  css  js  c++  java
  • [SPDK/NVMe存储技术分析]005 DPDK概述

    : 之所以要中英文对照翻译下面的文章,是因为SPDK严重依赖于DPDK的实现。

    Introduction to DPDK: Architecture and Principles

    Linux network stack performance has become increasingly relevant over the past few years. This is perfectly understandable: the amount of data that can be transferred over a network and the corresponding workload has been growing not by the day, but by the hour.

    Not even the widespread use of 10 GE network cards has resolved this issue; this is because a lot of bottlenecks that prevent packets from being quickly processed are found in the Linux kernel itself.

    There have been many attempts to circumvent these bottlenecks with techniques called kernel bypasses (a short description can be found here). They let you process packets without involving the Linux network stack and make it so that the application running in the user space communicates directly with networking device. We’d like to discuss one of these solutions, the Intel DPDK (Data Plane Development Kit), in today’s article.
    尝试绕过这些瓶颈的技术有很多,统称为kernel bypass(简短的描述戳这里)。kernel bypass技术让编程人员处理数据包,而不卷入Linux网络栈,在用户空间中运行的应用程序能够直接与网络设备打交道。在本文中,我们将讨论众多的kernel bypass解决方案中的一种,那就是Intel的DPDK(数据平面开发套件)。

    A lot of posts have already been published about the DPDK and in a variety of languages. Although many of these are fairly informative, they don’t answer the most important questions: How does the DPDK process packets and what route does the packet take from the network device to the user?

    Finding the answers to these questions was not easy; since we couldn’t find everything we needed in the official documentation, we had to look through a myriad of additional materials and thoroughly review their sources. But first thing’s first: before talking about the DPDK and the issues it can help resolve, we should review how packets are processed in Linux.

    Processing Packets in Linux: Main Stages | Linux中的数据包处理的主要几个阶段

    When a network card first receives a packet, it sends it to a receive queue, or RX. From there, it gets copied to the main memory via the DMA (Direct Memory Access) mechanism.

    Afterwards, the system needs to be notified of the new packet and pass the data onto a specially allocated buffer (Linux allocates these buffers for every packet). To do this, Linux uses an interrupt mechanism: an interrupt is generated several times when a new packet enters the system. The packet then needs to be transferred to the user space.

    One bottleneck is already apparent: as more packets have to be processed, more resources are consumed, which negatively affects the overall system performance.

    As we've already said, these packets are saved to specially allocated buffers - more specifically, the sk_buff struct. This struct is allocated for each packet and becomes free when a packet enters the user space. This operation consumes a lot of bus cycles (i.e. cycles that transfer data from the CPU to the main memory).
    我们在前面已经说过,这些数据包被保存在专门分配的缓冲区中-更具体地说就是sk_buff结构体。系统给每一个数据包都分配一个这样的结构体,一但数据包到达用户空间,该结构体就被系统给释放掉。这种操作消耗大量的总线周期(bus cycle即是把数据从CPU挪到内存的周期)。

    There is another problem with the sk_buff struct: the Linux network stack was originally designed to be compatible with as many protocols as possible. As such, metadata for all of these protocols is included in the sk_buff struct, but that’s simply not necessary for processing specific packets. Because of this overly complicated struct, processing is slower than it could be.
    与sk_buff struct密切相关的另一个问题是:设计Linux网络协议栈的初衷是尽可能地兼容更多的协议。因此,所有协议的元数据都包含在sk_buff struct中,但是,处理特定的数据包的时候这些(与特定数据包无关的协议元数据)根本不需要。因而处理速度就肯定比较慢,由于这个结构体过于复杂。

    Another factor that negatively affects performance is context switching. When an application in the user space needs to send or receive a packet, it executes a system call. The context is switched to kernel mode and then back to user mode. This consumes a significant amount of system resources.

    To solve some of these problems, all Linux kernels since version 2.6 have included NAPI (New API), which combines interrupts with requests. Let’s take a quick look at how this works.

    The network card first works in interrupt mode, but as soon as a packet enters the network interface, it registers itself in a poll queue and disables the interrupt. The system periodically checks the queue for new devices and gathers packets for further processing. As soon as the packets are processed, the card will be deleted from the queue and interrupts are again enabled.

    This has been just a cursory description of how packets are processed. A more detailed look at this process can be found in an article series from Private Internet Access. However, even a quick glance is enough to see the problems slowing down packet processing. In the next section, we’ll describe how these problems are solved using DPDK.
    这只是对数据包如何被处理的粗略的描述。有关数据包处理过程的详细描述请参见Private Internet Access的系列文章。然而,就是这么一个快速一瞥也足以让我们看到数据包处理被减缓的问题。在下一节中,我们将描述使用DPDK后,这些问题是如何被解决掉的。

    DPDK: How It Works | DPDK 是如何工作的

    General Features | 一般特性

    Let's look at the following illustration:

    On the left you see the traditional way packets are processed, and on the right - with DPDK. As we can see, the kernel in the second example doesn’t step in at all: interactions with the network card are performed via special drivers and libraries.

    If you've already read about DPDK or have ever used it, then you know that the ports receiving incoming traffic on network cards need to be unbound from Linux (the kernel driver). This is done using the dpdk_nic_bind (or dpdk-devbind) command, or ./dpdk_nic_bind.py in earlier versions.

    How are ports then managed by DPDK? Every driver in Linux has bind and unbind files. That includes network card drivers:
    DPDK是如何管理网口的?每一个Linux内核驱动都有bind和unbind文件。 当然包括网卡驱动:

    ls /sys/bus/pci/drivers/ixgbe
    bind  module  new_id  remove_id  uevent  unbind

    To unbind a device from a driver, the device's bus number needs to be written to the unbind file. Similarly, to bind a device to another driver, the bus number needs to be written to its bind file. More detailed information about this can be found here.

    The DPDK installation instructions tell us that our ports need to be managed by the vfio_pci, igb_uio, or uio_pci_generic driver. (We won’t be geting into details here, but we suggested interested readers look at the following articles on kernel.org: 1 and 2.)
    DPDK的安装指南告诉我们ports需要被vfio_pci, igb_uio或uio_pci_generic驱动管理。(细节这里就不谈了,但建议有兴趣的读者阅读kernel.org上的文章:1和2)

    These drivers make it possible to interact with devices in the user space. Of course they include a kernel module, but that's just to initialize devices and assign the PCI interface.

    All further communication between the application and network card is organized by the DPDK poll mode driver (PMD). DPDK has poll mode drivers for all supported network cards and virtual devices.

    The DPDK also requires hugepages be configured. This is required for allocating large chunks of memory and writing data to them. We can say that hugepages does the same job in DPDK that DMA does in traditional packet processing.
    大内存页(hug pages)的配置对DPDK来说是必须的。这是因为需要分配大块内存并向大块内存中写入数据。可以这么说,数据包处理的活,传统的方式是使用直接内存访问(DMA)来干,而DPDK使用大内存页(huge pages)来完成。

    We'll discuss all of its nuances in more detail, but for now, let's go over the main stages of packet processing with the DPDK:

    1. Incoming packets go to a ring buffer (we'll look at its setup in the next section). The application periodically checks this buffer for new packets.
      传入的数据包被放到环形缓冲区(ring buffer)中去。应用程序周期性检查这个缓冲区(ring buffer)以获取新的数据包。
    2. If the buffer contains new packet descriptors, the application will refer to the DPDK packet buffers in the specially allocated memory pool using the pointers in the packet descriptors.
      如果ring buffer包含有新的数据包描述符,应用程序就使用数据包描述符所包含的指针去做处理,该指针指向的是DPDK数据包缓冲区,该缓冲区位于专门的内存池中。
    3. If the ring buffer does not contain any packets, the application will queue the network devices under the DPDK and then refer to the ring again.
      如果ring buffer中不包含任何数据包描述符,应用程序就会在DPDK中将网络设备排队,然后再次指向ring。

    Let's take a closer look at the DPDK's internal structure.

    EAL: Environment Abstraction | 环境抽象层

    The EAL, or Environment Abstraction Layer, is the main concept behind the DPDK.

    The EAL is a set of programming tools that let the DPDK work in a specific hardware environment and under a specific operating system. In the official DPDK repository, libraries and drivers that are part of the EAL are saved in the rte_eal directory.

    Drivers and libraries for Linux and the BSD system are saved in this directory. It also contains a set of header files for various processor architectures: ARM, x86, TILE64, and PPC64.
    为Linux和BSD系统写的库和驱动就保存在这个目录。同时还包含了一系列针对不同的处理器架构的头文件,不同的处理器包括ARM, x86, TILE64和PPC64。

    We access software in the EAL when we compile the DPDK from the source code:

    make config T=x86_64-native-linuxapp-gcc

    One can guess that this command will compile DPDK for Linux in an x86_64 architecture.

    The EAL is what binds the DPDK to applications. All of the applications that use the DPDK (see here for examples) must include the EAL's header files.

    The most commonly of these include:

    • rte_lcore.h -- manages processor cores and sockets; 管理处理器核和socket;
    • rte_memory.h -- manages memory; 管理内存;
    • rte_pci.h -- provides the interface access to PCI address space; 提供访问PCI地址空间的接口;
    • rte_debug.h -- provides trace and debug functions (logging, dump_stack, and more); 提供trace和debug函数(logging, dump_stack, 和更多);
    • rte_interrupts.h -- processes interrupts. 中断处理。

    More details on this structure and EAL functions can be found in the official documentation.

    Managing Queues: rte_ring | 队列管理

    As we've already said, packets received by the network card are sent to a ring buffer, which acts as a receiving queue. Packets received in the DPDK are also sent to a queue implemented on the rte_ring library. The library's description below comes from information gathered from the developer's guide and comments in the source code.
    我们在前面已经说过了,网卡接收到的数据包被发送到环形缓冲区(ring buffer),该环形缓冲区充当接收队列的角色。DPDk接收到的数据包也被发送到用rte_ring函数库实现的队列中去。注意下面的函数库描述拉源于开发指南和源代码注释。

    The rte_ring was developed from the FreeBSD ring buffer. If you look at the source code, you'll see the following comment: Derived from FreeBSD's bufring.c.
    rte_ring是来自于FreeBSD的ring buffer。如果你阅读源代码,就会看见后面的注释: 来自于FreeBSD的bufring.c。

    The queue is a lockless ring buffer built on the FIFO (First In, First Out) principle. The ring buffer is a table of pointers for objects that can be saved to the memory. Pointers can be divided into four categories: prod_tail, prod_head, cons_tail, cons_head.
    DPDK的队列是一个无锁的环形缓冲区,基于FIFO(先进先出原理)构建。ring buffer本质上是一张表,表里的每一个元素是可以保存在内存中的对象的指针。指针分为4类: prod_tail, prod_head, cons_tail, 和cons_head。

    Prod is short for producer, and cons for consumer. The producer is the process that writes data to the buffer at a given time, and the consumer is the process that removes data from the buffer.

    The tail is where writing takes place on the ring buffer. The place the buffer is read from at a given time is called the head.

    The idea behind the process for adding and removing elements from the queue is as follows: when a new object is added to the queue, the ring->prod_tail indicator should end up pointing to the location where ring->prod_head previously pointed to.
    在给队列添加一个元素和从队列中移除一个元素的过程中, 藏在其背后的思想是:当一个新的对象被添加到队列中,rihg->prod_tail应该最终指向ring->prod_head在之前指向的位置。

    This is just a brief description; a more detailed account of how the ring buffer scripts work can be found in the developer's manual on the DPDK site.
    这里只是做一个简短的描述。有关ring buffer是如何编排其工作的详细说明请参见DPDK网站的开发者手册。

    This approach has a number of advantages. Firstly, data is written to the buffer extremely quickly. Secondly, when adding or removing a large number of objects from the queue, cache misses occur much less frequently since pointers are saved in a table.

    The drawback to DPDK's ring buffer is its fixed size, which cannot be increased on the fly. Additionally, much more memory is spent working with the ring structure than in a linked queue since the buffer always uses the the maximum number of pointers.
    DPDK的ring buffer的缺点是ring buffer的长度是固定的,不能够在运行时间动态地修改。另外,与链式队列相比,在ring结构中使用的内存比较多,因为ring buffer总是使用支持的对象指针数量的最大值。

    Memory Management: rte_mempool | 内存管理

    We mentioned above that DPDK requires hugepages. The installation instructions recommend creating 2MB hugepages.

    These pages are combined in segments, which are then divided into zones. Objects that are created by applications or other libraries, like queues and packet buffers, are placed in these zones.

    These objects include memory pools, which are created by the rte_mempool library. These are fixed size object pools that use rte_ring for storing free objects and can be identified by a unique name.

    Memory alignment techniques can be implemented to improve performance.

    Even though access to free objects is designed on a lockless ring buffer, consumption of system resources may still be very high. As multiple cores have access to the ring, a compare-and-set (CAS) operation usually has to be performed each time it is accessed.
    尽管访问自由对象被设计在一个无锁的ring buffer中,但是系统资源消耗可能还是很大。 由于这个ring被多个核访问,在每一次访问ring的时候,通常不得不执行CAS原子操作。

    To prevent bottlenecking, every core is given an additional local cache in the memory pool. Using the locking mechanism, cores can fully access the free object cache. When the cache is full or entirely empty, the memory pool exchanges data with the ring buffer. This gives the core access to frequently used objects.
    为了防止瓶颈发生,在内存池中给每一个CPU核配备额外的本地缓存。通过使用锁机制,多个CPU核能够完全访问自由对象缓存。当缓存满了或者完全空了,内存池与ring buffer进行数据交换。这使得CPU核能够访问那些被频繁使用的对象。

    Buffer Management: rte_mbuf | 缓冲区管理

    In the Linux network stack, all network packets are represented by the the sk_buff data structure. In DPDK, this is done using the rte_mbuf struct, which is described in the rte_mbuf.h header file.

    The buffer management approach in DPDK is reminiscent of the approach used in FreeBSD: instead of one big sk_buff struct, there are many smaller rte_mbuf buffers. The buffers are created before the DPDK application is launched and are saved in memory pools (memory is allocated by rte_mempool).

    In addition to its own packet data, each buffer contains metadata (message type, length, data segment starting address). The buffer also contains pointers for the next buffer. This is needed when handling packets with large amounts of data. In cases like these, packets can be combined (as is done in FreeBSD; more detailed information about this can be found here).

    Other Libraries: General Overview | 鸟瞰其他库

    In previous sections, we talked about the most basic DPDK libraries. There's a great deal of other libraries, but one article isn't enough to describe them all. Thus, we'll be limiting ourselves to just a brief overview.

    With the LPM library, DPDK runs the Longest Prefix Match (LPM) algorithm, which can be used to forward packets based on their IPv4 address. The primary function of this library is to add and delete IP addresses as well as to search for new addresses using the LPM algorithm.

    A similar function can be performed for IPv6 addresses using the LPM6 library.

    Other libraries offer similar functionality based on hash functions. With rte_hash, you can search through a large record set using a unique key. This library can be used for classifying and distributing packets, for example.
    其他库基于hash函数提供类似的功能。例如:使用rte_hash, 可以通过使用一个独一无二的key来搜索大记录集。这个库可用来分类和分发数据包。

    The rte_timer library lets you execute functions asynchronously. The timer can run once or periodically.

    Conclusion | 总结

    In this article we went over the internal device and principles of DPDK. This is far from comprehensive though; the subject is too complex and extensive to fit in one article. So sit tight, we will continue this topic in a future article, where we'll discuss the practical aspects of using DPDK.

    We'd be happy to answer your questions in the comments below. And if you've had any experience using DPDK, we'd love to hear your thoughts and impressions.

    For anyone interested in learning more, please visit the following links:

    • http://dpdk.org/doc/guides/prog_guide/ — a detailed (but confusing in some places) description of all the DPDK libraries;
    • https://www.net.in.tum.de/fileadmin/TUM/NET/NET-2014-08-1/NET-2014-08-1_15.pdf — a brief overview of DPDK's capabilities and comparison with other frameworks (netmap and PF_RING);
    • http://www.slideshare.net/garyachy/dpdk-44585840 — an introductory presentation to DPDK for beginners;
    • http://www.it-sobytie.ru/system/attachments/files/000/001/102/original/LinuxPiter-DPDK-2015.pdf — a presentation explaining the DPDK structure.

    Andrej Yemelianov 24 November 2016 Tags: DPDK, linux, network, network stacks, packet processing

    Everybody thinks of changing humanity, and nobody thinks of changing himself. | 每个人都在想着改变别人,但没有人想去改变自己。
  • 相关阅读:
    Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
    git push 时 fatal: Unable to create 'D:/phpStudy/WWW/green_tree/.git/index.lock': File exists.解决办法
    git push 提示 Everything up-to-date
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 2 bytes)
  • 原文地址:https://www.cnblogs.com/vlhn/p/7754940.html
Copyright © 2011-2022 走看看