zoukankan      html  css  js  c++  java
  • [quote] Rough Notes on Linux Networking Stack

    From bing cache of blogspot

    cache URL: http://cc.bingj.com/cache.aspx?q=linux+netif+to+networking+stack&d=4954074211159852&mkt=en-US&setlang=en-US&w=i5H3dp0S6HHM2iHbpuKi2URA4nak-r4J

    original URL:  http://linuxinme.blogspot.com/2007/08/rough-notes-on-linux-networking-stack.html

    Rough Notes on Linux Networking Stack

     

    Table of Contents

    1. Existing Optimizations
    2. Packet Copies
    3. ICMP Ping/Pong : Function Calls
    4. Transmit Interrupts and Flow Control
    5. NIC driver callbacks and ifconfig
    6. Protocol Structures in the Kernel
    7. skb_clone() vs. skb_copy()
    8. NICs and Descriptor Rings
    9. How much networking work does the ksoftirqd do?
    10. Packet Requeues in Qdiscs
    11. Links
    12. Specific TODOs
    References

    1. Existing Optimizations

    A great deal of thought has gone into Linux networking
    implementation and many optmizations have made their way to the
    kernel over the years. Some prime examples include:
    * NAPI - Receive interrupts are coalesced to reduce changes of a
    livelock. Thus, now each packet receive does not generate an
    interrupt. Required modifications to device driver interface.
    Has been in the stable kernels since 2.4.20.
    * Zero-Copy TCP - Avoids the overhead of kernel-to-userspace and
    userspace-to-kernel packet copying.
    http://builder.com.com/5100-6372-1044112.html describes this is
    some detail.

    2. Packet Copies

    When a packet is received, the device uses DMA to put it in main
    memory (let's ignore non-DMA or non-NAPI code and drivers). An skb
    is constructed by the poll() function of the device driver. After
    this point, the same skb is used throughout the networking stack,
    i.e., the packet is almost never copied within the kernel (it is
    copied when delivered to user-space).

    This design is borrowed from BSD and UNIX SVR4 - the idea is to
    allocate memory for the packet only once. The packet has 4 primary
    pointers - head, end, data, tail into the packet data (character
    buffer). head points to the beginning of the packet - where the link
    layer header starts. end points to the end of the packet. data
    points to the location the current networking layer can start
    reading from (i.e., it changes as the packet moves up from the link
    layer, to IP, to TCP). Finally, tail is where the current protocol
    layer can begin writing data to (see alloc_skb(), which sets head,
    data, tail to the beginning of allocated memory block and end to
    data + size).

    Other implementations refer to head, end, data, tail as base, limit,
    read, write respectively.

    There are some instances where a packet needs to be duplicated. For
    example, when running tcpdump the packet needs to be sent to the
    userspace process as well as to the normal IP handler. Actually, int
    this case too, a copy can be avoided since the contents of the
    packet are not being modified. So instead of duplicating the packet
    contents, skb_clone() is used to increase the reference count of a
    packet. skb_copy() on the other hand actually duplicates the
    contents of the packet and creates a completely new skb.

    See also: http://oss.sgi.com/archives/netdev/2005-02/msg00125.html

    A related question: When a packet is received, are the tail and end
    pointers equal? Answer: NO. This is because memory for packets
    received is allocated before the packet is received, and the address
    and size of this memory is communicated to the NIC using receive
    descriptors - so that when it is actually received the NIC can use
    DMA to transfer the packet to main memory. The size allocated for a
    received packet is a function of the MTU of the device. The size of
    an Ethernet frame actually received could be anything less than the
    MTU. Thus, tail of a received packet will point to the end of the
    received data while end will point to the end of the memory
    allocated for the packet.

    3. ICMP Ping/Pong : Function Calls

    Code path (functions called) when an ICMP ping is received (and
    corresponding pong goes out), for linux 2.6.9: First the packet is
    received by the NIC and it's interrupt handler will ultimately call
    net_rx_action() to be called (NAPI, [1]). This will call the device
    driver's poll function which will submit packets (skb's) to the
    networking stack via netif_receive_skb. The rest is outlined below:
    1. ip_rcv() --> ip_rcv_finish()
    2. dst_input() --> skb->dst->input = ip_local_deliver()
    3. ip_local_deliver() --> ip_local_deliver_finish()
    4. ipprot->handler = icmp_rcv()
    5. icmp_pointers[ICMP_ECHO].handler == icmp_echo() -- At this point
    I guess you could say that the "receive" path is complete, the
    packet has reached the top. Now the outbound (down the stack)
    journey begins)
    6. icmp_reply() -- Might want to look into the checks this function
    does
    7. icmp_push_reply()
    8. ip_push_pending_frames()
    9. dst_output() --> skb->dst->output = ip_output()
    10. ip_output() --> ip_finish_output() --> ip_finish_output2()
    11. dst->neighbour->output ==

    4. Transmit Interrupts and Flow Control

    Transmit interrupts are generated after every packet transmission
    and this is key to flow control. However, this does have significant
    performance implications under heavy transmit-related I/O (imagine a
    packet forwarder where the number of transmitted packets is equal to
    the number of received oned). Each device provides a means to slow
    down transmit (Tx) interrupts. For example, Intel's e1000 driver
    exposes "TxIntDelay" that allows transmit interrupts to be delayed
    in units of 1.024 microseconds. The default value is 64, thus eavy
    under heavy transmissions an interrupt's are spaced 65.536
    microseconds apart. Imagine the number of transmissions that can
    take place in this time.

    5. NIC driver callbacks and ifconfig

    Interfaces are configured using the ifconfig command. Many of these
    commands will result in a function of the NIC driver being called.
    For example, ifconfig eth0 up should result in the device driver's
    open() function being called (open is a member of struct
    net_device). ifconfig communicates with the kernel through ioctl()
    on any socket. The requests are a struct ifreq (see
    /usr/include/net/if.h and
    http://linux.about.com/library/cmd/blcmdl7_netdevice.htm. Thus,
    ifconfig eth0 up will result in the following:
    1. A socket (of any kind) is opened using socket()
    2. A struct ifreq is prepared with ifr_ifname set to "eth0"
    3. An ioctl() with request SIOCGIFFLAGS is done to get the current
    flags and then the IFF_UP and IFF_RUNNING flags are set with
    another ioctl() (with request SIOCSIFFLAGS).
    4. Now we're inside the kernel. sock_ioctl() is called, which in
    turn calls dev_ioctl() (see net/socket.c and net/core/dev.c)
    5. dev_ioctl() --> ... --> dev_open() --> driver's open()
    implementation.

    6. Protocol Structures in the Kernel

    There are various structs in the kernel which consist of function
    pointers for protocol handling. Different structures correspond to
    different layers of protocols as well as whether the functions are
    for synchronous handling (e.g., when recv(), send() etc. system
    calls are made) or asynchronous handling (e.g., when a packet
    arrives at the interface and it needs to be handled). Here is what I
    have gathered about the various structures so far:
    * struct packet_type - includes instantiations such as
    ip_packet_type, ipv6_packet_type etc. These provide low-level,
    asynchronos packet handling. When a packet arrives at the
    interface, the driver ultimately submits it to the networking
    stack by a call to netif_receive_skb(), which iterates to the
    list of registered packet handlers and submits the skb to them.
    For example, ip_packet_type.func = ip_rcv, so ip_rcv() is where
    one can say the IP protocol first receives a packet that has
    arrived at the interface. Packet-types are registred with the
    networking stack by a call to dev_add_pack().
    * struct net_proto_family - includes instantiations such as
    inet_family_ops, packet_family_ops etc. Each net_proto_family
    structure handles one type of address family (PF_INET etc.).
    This structure is associated with a BSD socket (struct socket)
    and not the networking layer representation of sockets (struct
    sock). It essentially provdides a create() function which is
    called in response to the socket() system call. The
    implementation of create() for each family typically allocates
    the struct sock and also associates other synchronous operations
    (see struct proto_ops below) with the socket. To cut a long
    story short - net_proto_family provides the protocol-specific
    part of the socket() system call. (NOTE: Not all BSD sockets
    will have a networking socket associated with it. For example,
    unix sockets (the PF_UNIX address family).
    unix_family_ops.create = unix_create does not allocate a struct
    sock). The net_proto_family structure is registered with the
    networking stack by a call to sock_register().
    * struct proto_ops - includes instantiations such as
    inet_stream_ops, inet_dgram_ops, packet_ops etc. These provide
    implementations of networking layer synchronous calls
    (connect(), bind(), recvmsg(), ioctl() etc. system calls). The
    ops member of the BSD socket structure (struct socket) points to
    the proto_ops associated with the socket. Unlike the above two
    structures, there is no function that explicitly registers a
    struct proto_ops with the networking stack. Instead, the
    create() implementation of struct net_proto_family just sets the
    ops field of the BSD socket to the appropriate proto_ops
    structure.
    * struct proto - includes instantiations such as tcp_prot,
    udp_prot, raw_prot. These provide protocol handlers inside a
    network family. It seems that currently this means only over-IP
    protocols as I could find only the above three instantiations.
    These also provide implementations for synchronous calls. The
    sk_prot field of the networking socket (struct sock) points to
    such a structure. The sk_prot field would get set by the create
    function in struct net_proto_family and the functions provided
    will be called by the implementations of functions in the struct
    proto_ops structure. For example, inet_family_ops.create =
    inet_create allocates a struct sock and would set sk_prot =
    udp_prot in reponse to a socket(PF_INET, SOCK_DGRAM, 0); system
    call. A recvfrom() system call made on the socket would then
    invoke inet_dgram_ops.recvmsg = sock_common_recvmsg, which calls
    sk_prot->recvmsg = udp_recvmsg. Like proto_ops, struct protos
    aren't explicitly "registered" with the networking stack using a
    function, but are "regsitered" by the BSD socket create()
    implementation in the struct net_proto_family.
    * struct net_protocol - includes instantiations such as
    tcp_protocol, udp_protocol, icmp_protocol etc. These provide
    asynchronous packet receive routines for IP protocols. Thus,
    this structure is specific to the inet-family of protocols.
    Handlers are registered using inet_add_protocol(). This
    structure is used by the IP-layer routines to hand off to a
    layer 4 protocol. Specifically, the IP handler (ip_rcv()) will
    invoke ip_local_deliver_finish() for packets that are to be
    delivered to the local host. ip_local_deliver_finish() uses a
    hash table (inet_protos) to decide which function to pass the
    packet to based on the protocol field in the IP header. The hash
    table is populated by the call to inet_add_protocol().

    7. skb_clone() vs. skb_copy()

    When a packet needs to be delivered to two separate handlers (for
    example, the IP layer and tcpdump), then it is "cloned" by
    incrementing the reference count of the packet instead of being
    "copied". Now, though the two handlers are not expected to modify
    the packet contents, they can change the data pointer. So, how do we
    ensure that processing by one of the handlers doesn't mess up the
    data pointer for the other?

    A. Umm... skb_clone means that there are separate head, tail, data,
    end etc. pointers. The difference between skb_copy() and skb_clone()
    is precisely this - the former copies the packet completely, while
    the latter uses the same packet data but separate pointers into the
    packet.

    8. NICs and Descriptor Rings

    NOTE: Using the Intel e1000, driver source version 5.6.10.1, as an
    example. Each transmission/reception has a descriptor - a "handle"
    used to access buffer data somewhat like a file descriptor is a
    handle to access file data. The descriptor format would be NIC
    dependent as the hardware understands and reads/writes to the
    descriptor. The NIC maintains a circular ring of descriptors, i.e.,
    the number of descriptors for TX and RX is fixed (TxDescriptors,
    RxDescriptors module parameters for the e1000 kernel module) and the
    descriptors are used like a circular queue.

    Thus, there are three structures:
    * Descriptor Ring (struct e1000_desc_ring) - The list of
    descriptors. So, ring[0], ring[1] etc. are individual
    descriptors. The ring is typically allocated just once and thus
    the DMA mapping of the ring is "consistent". Each descriptor in
    the ring will thus have a fixed DMA and memory address. In the
    e1000, the device registers TDBAL, TDBAH, TDLEN stand for
    "Transmit Descriptors Base Address Low", "High" and "Length" (in
    bytes of all descriptors). Similarly, there are RDBAL, RDBAH,
    RDLEN
    * Descriptors (struct e1000_rx_desc and struct e1000_tx_desc) -
    Essentially, this stores the DMA address of the buffer which
    contains actual packet data, plus some other accounting
    information such as the status (transmission successsful?
    receive complete? etc.), errors etc.
    * Buffers - Now actual data cannot have a "consistent" DMA
    mapping, meaning we cannot ensure that all skbuffs for a
    particular device always have some specific memory addresses
    (those that have been setup for DMA). Instead, "streaming" DMA
    mappings need to be used. Each descriptor thus contains the DMA
    address of a buffer that has been setup for streaming mapping.
    The hardware uses that DMA address to pickup a packet to be sent
    or to place a received packet. Once the kernel's stack picks up
    the buffer, it can allocate new resources (a new buffer) and
    tell the NIC to use that buffer next time by setting up a new
    streaming mapping and putting the new DMA handle in the
    descriptor.
    The e1000 uses a struct e1000_buffer as a wrapper around the
    actual buffer. The DMA mapping however is setup only for
    skb->data, i.e., where raw packet data is to be placed.

    9. How much networking work does the ksoftirqd do?

    Consider what the NET_RX_SOFTIRQ does:
    1. Each softirq invokation (do_softirq()) processes up to
    net.core.netdev_max_backlog x MAX_SOFTIRQ_RESTART packets, if
    available. The default values lead to 300 x 10 = 3000 pkts.
    2. Every interrupt calls do_softirq() when exitting (irq_exit()) -
    including the timer interrupt and NMIs too?
    3. Default transmit/receive ring sizes on the NIC are less than
    3000 (the e1000 for example defaults to 256 and can have at most
    4096 descriptors on its ring)

    Thus, the number of times ksoftirqd will be switched in/out depends
    on how much processing is done by do_softirq() invokations on
    irq_exit(). If the softirq handling on interrupt is able to clean up
    the NIC ring faster than a new packet comes in, then ksoftirqd won't
    be doing anything. Specifically, if the inter-packet-gap is greater
    than the time it takes to pick-up and process a single packet from
    the NIC, then ksoftirq will not be scheduled (and if the number of
    descriptors on the NIC is less than 3000).

    Without going into details, some quick experimental verification:
    Machine A continuously generates UDP packets for Machine B which is
    running an "sink" application, i.e., it just loops on a recvfrom().
    When the size of the packet sent from A was 60 bytes (and
    inter-packet gap averaged 1.5µs), then the ksoftirqd thread on B
    observed a total of 375 context swithces (374 involuntary and 1
    voluntary). When the packet size was 1280 bytes (and now
    inter-packet gap increased almost 7 times to 10µs) then the
    ksoftirqd thread was NEVER scheduled (0 context switches). The
    single voluntary context switch in the former case probably happened
    after all packets were processed (i.e., the sender stopped sending
    and the receiver processed all that it got).

    10. Packet Requeues in Qdiscs

    The queueing discipline (struct Qdisc) provides a requeue().
    Typically, packets are dequeued from the qdisc and submitted to the
    device driver (the hard_start_xmit function in struct net_device).
    However, at times it is possible that the device driver is "busy",
    so the dequeued packet must be "requeued". "Busy" here means that
    the xmit_lock of the device was held. It seems that this lock is
    acquired at two places: (1) qdisc_restart() and (2) dev_watchdog().
    The former handles packet dequeueing from the qdisc, acquiring the
    xmit_lock and then submitting the packet to the device driver
    (hard_start_xmit()) or alternatively requeuing the packet if the
    xmit_lock was already held by someone else. The latter is invoked
    asynchronously, periodically - its part of the watchdog timer
    mechanism.

    My understanding is that two threads cannot be in qdisc_restart()
    for the same qdisc at the same time, however the xmit_lock may have
    been acquired by the watchdog timer function causing a requeue.

    11. Links

    This is just a dump of links that might be useful.
    * http://www.spec.org and SpecWeb http://www.spec.org/web99/
    * linux-net and netdev mailing lists:
    http://www.ussg.iu.edu/hypermail/linux/net/ and
    http://oss.sgi.com/projects/netdev/archive/
    * Linux Traffic Control HOWTO

    12. Specific TODOs

    * Study watchdog timer mechanism and figure out how flow control
    is implemented in the receive and transmit side

    References

    [3] Beyond Softnet. Jamal Hadi Salim, Robert Olsson, and Alexey
    Kuznetsov. Nov 2001. USENIX. 5. .

    [5] A Map of the Networking Code in Linux Kernel 2.4.20. Miguel Rio,
    Mathieu Goutelle, Tom Kelly, Richard Hugh-Jones, Jean-Phillippe
    Martin-Flatin, and Yee-Ting Li. Mar 2004.

    [4] Understanding the Linux Kernel. Daniel P. Bovet and Marco
    Cesati. O'Reilly & Associates. 2nd Edition. 81-7366-589-3.

  • 相关阅读:
    Google's Machine Learning Crash Course #01# Introducing ML & Framing & Fundamental terminology
    MySQL Crash Course #09# Chapter 17. Combining Queries: UNION
    MySQL笔记(二)数据库对象的创建和管理
    浅谈TCP/IP网络编程中socket的行为
    linux网络编程中的shutdown()与close()函数
    c++11中的线程、锁和条件变量
    多线程TcpServer
    TCP网络库:Acceptor、TcpServer、TcpConnection
    epoll 的accept , read, write
    线程安全函数
  • 原文地址:https://www.cnblogs.com/lake-of-embedded-system/p/3573440.html
Copyright © 2011-2022 走看看