网络设备是完成用户数据包在网络媒介上发送和接收的设备,它将上层协议传递下来的数据包以特定的媒介访问控制方式进行发送,并将接收到的数据包传递给上层协议。
Linux系统对网络设备驱动定义了4个层次,这4个层次分别为:
1)网络协议接口层;
2)网络设备接口层;
3)提供实际功能的设备驱动功能层;
4)网络设备与媒介层。
一、Linux网络设备驱动的结构
Linux网络设备驱动程序的体系结构如下图所示,从上到下可以分为4层,依次为网络协议接口层、网络设备接口层、提供实际功能的设备驱动功能层以及网络设备与媒介层,各层作用如下:
网络协议接口层: 向网络层协议提供统一的数据包收发接口,不论上层协议是ARP,还是IP,都通过dev_queue_xmit()函数发送数据,并通过netif_rx()函数接收数据。这一层的存在使得上层协议独立于具体的设备。
网络设备接口层: 向协议接口层提供统一的用于描述具体网络设备属性和操作的结构体net_device,该结构体是设备驱动功能层中各函数的容器。网络设备接口层从宏观上规划了具体操作硬件的设备驱动功能层的结构。
设备驱动功能层: 设备驱动功能层的各函数是网络设备接口层net_device数据结构的具体成员,是驱使网络设备硬件完成相应动作的程序,它通过hard_start_xmit()函数启动发送操作,并通过网络设备上的中断触发接收操作。
网络设备与媒介层: 完成数据包发送和接收的物理实体,包括网络适配器和具体的传输媒介,网络适配器被设备驱动功能层中的函数在物理上驱动。对于Linux系统而言,网络设备和媒介都可以是虚拟的。
在设计具体的网络设备驱动程序时,我们需要完成的主要工作是编写设备驱动功能层的相关函数以填充net_device数据结构的内容并将net_device注册进内核。
1.网络协议接口层
网络协议接口层最主要的功能是给上层协议提供透明的数据包发送和接收接口。当上层ARP或IP需要发送数据包时,它将调用网络协议接口层的dev_queue_xmit()函数发送该数据包,同时需传递给该函数一个指向 struct sk_buff
数据结构的指针。dev_queue_xmit( ) 函数的原型为:
int dev_queue_xmit(struct sk_buff *skb);
同样的,上层对数据包的接收也通过向netif_rx( )函数传递一个 struct sk_buff 数据结构的指针来完成。netif_rx( ) 函数的原型为:
int netif_rx(struct sk_buff *skb);
sk_buff结构体非常重要,它定义于 include/linux/skbuff.h 文件中, 含义为“套接字缓冲区”, 用于在Linux网络子系统中的各层之间传递数据,是Linux网络子系统数据传递的 “中枢神经”。
当发送数据包时,Linux内核的网络处理模块必须建立一个包含要传输的数据包的 sk_buff, 然后将 sk_buff 递交给下层,各层在 sk_buff中添加不同的协议头直至交给网络设备发送。同理, 当网络设备从网络媒介上接收到数据包后,它必须
将接收到的数据转换为 sk_buff 数据结构并传递给上层, 各层剥去相应的协议头直至交给用户。
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived/left * @rbnode: RB tree node, alternative to next/prev for netem/tcp * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @ignore_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @hash: the packet hash * @queue_mapping: Queue mapping for multiqueue devices * @xmit_more: More SKBs are pending for this queue * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_hash: indicate hash is a canonical 4-tuple hash over transport * ports. * @sw_hash: indicates hash was computed in software stack * @wifi_acked_valid: wifi_acked was set * @wifi_acked: whether frame was acked on wifi or not * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @napi_id: id of the NAPI struct this skb came from * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows * @vlan_proto: vlan encapsulation protocol * @vlan_tci: vlan tag control information * @inner_protocol: Protocol (encapsulation) * @inner_transport_header: Inner transport layer header (encapsulation) * @inner_network_header: Network layer header (encapsulation) * @inner_mac_header: Link layer header (encapsulation) * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */ struct sk_buff { union { struct { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; union { ktime_t tstamp; struct skb_mstamp skb_mstamp; }; }; struct rb_node rbnode; /* used in netem & tcp stack */ }; struct sock *sk; struct net_device *dev; /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst; void (*destructor)(struct sk_buff *skb); #ifdef CONFIG_XFRM struct sec_path *sp; #endif #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct; #endif #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct nf_bridge_info *nf_bridge; #endif unsigned int len, data_len; __u16 mac_len, hdr_len; /* Following fields are _not_ copied in __copy_skb_header() * Note that queue_mapping is here mostly to fill a hole. */ kmemcheck_bitfield_begin(flags1); __u16 queue_mapping; __u8 cloned:1, nohdr:1, fclone:2, peeked:1, head_frag:1, xmit_more:1; /* one bit hole */ kmemcheck_bitfield_end(flags1); /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ /* private: */ __u32 headers_start[0]; /* public: */ /* if you move pkt_type around you also must adapt those constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_TYPE_MAX (7 << 5) #else #define PKT_TYPE_MAX 7 #endif #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset) __u8 __pkt_type_offset[0]; __u8 pkt_type:3; __u8 pfmemalloc:1; __u8 ignore_df:1; __u8 nfctinfo:3; __u8 nf_trace:1; __u8 ip_summed:2; __u8 ooo_okay:1; __u8 l4_hash:1; __u8 sw_hash:1; __u8 wifi_acked_valid:1; __u8 wifi_acked:1; __u8 no_fcs:1; /* Indicates the inner headers are valid in the skbuff. */ __u8 encapsulation:1; __u8 encap_hdr_csum:1; __u8 csum_valid:1; __u8 csum_complete_sw:1; __u8 csum_level:2; __u8 csum_bad:1; #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif __u8 ipvs_property:1; __u8 inner_protocol_type:1; __u8 remcsum_offload:1; /* 3 or 5 bit hole */ #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */ #endif #endif union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority; int skb_iif; __u32 hash; __be16 vlan_proto; __u16 vlan_tci; #if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS) union { unsigned int napi_id; unsigned int sender_cpu; }; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif union { __u32 mark; __u32 dropcount; __u32 reserved_tailroom; }; union { __be16 inner_protocol; __u8 inner_ipproto; }; __u16 inner_transport_header; __u16 inner_network_header; __u16 inner_mac_header; __be16 protocol; __u16 transport_header; __u16 network_header; __u16 mac_header; /* private: */ __u32 headers_end[0]; /* public: */ /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; };
如下图所示,尤其值得注意的是 head 和 end 指向缓冲区的头部和尾部,而 data 和 tail 指向实际数据的头部和尾部。每一层会在 head 和 data 之间填充协议头,或者在 tail 和 end 之间添加新的协议数据。
下面分析下套接字缓冲区涉及的操作函数,Linux套接字缓冲区支持分配、释放、变更等功能函数。
(1) 分配
Linux 内核中用于分配套接字缓冲区的函数有:
struct sk_buff *alloc_skb(unsigned int len, gfp_t priority); struct sk_buff *dev_alloc_skb(unsigned int len);
alloc_skb( )函数分配一个套接字缓冲区和一个数据缓冲区,参数 len 为数据缓冲区的空间大小,通常以L1_CACHE_BYTES字节(对于 ARM 为32)对齐,参数priority为内存分配的优先级。
dev_alloc_skb( )函数以 GFP_ATOMIC 优先级进行 skb 的分配,原因是该函数经常在设备驱动的接收中断里被调用。
(2)释放
Linux内核中用于释放套接字缓冲区的函数有:
void kfree_skb(struct sk_buff *skb); void dev_kfree_skb(struct sk_buff *skb); void dev_kfree_skb_irq(struct sk_buff *skb); void dev_kfree_skb_any(struct sk_buff *skb);
上述函数用于释放被alloc_skb( )函数分配的套接字缓冲区和数据缓冲区。
Linux内核内部使用 kfree_skb( ) 函数,但在网络设备驱动程序中最好用 dev_kfree_skb( )、dev_kfree_skb_irq( ) 或 dev_kfree_skb_any( )函数进行套接字缓冲区的释放。
dev_kfree_skb( )用于非中断上下文, dev_kfree_skb_irq( )用于中断上下文,dev_kfree_skb_any( )在中断和非中断上下文中皆可使用,它其实是做一个简单地上下文判断,然后再调用__dev_kfree_skb_irq( ) 或者 dev_kfree_skb( ),代码实现如下:
void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason) { if (in_irq() || irqs_disabled()) __dev_kfree_skb_irq(skb, reason); else dev_kfree_skb(skb); }
(3)变更
在Linux内核中可以用如下函数在缓冲区尾部增加数据:
unsigned char *skb_put(struct sk_buff *skb, unsigned int len) ;
它会导致 skb->tail 后移 len (skb->tail += len),而skb->len会增加 len 的大小( skb->len += len )。通常在设备驱动的接收数据处理中会调用此函数。
在Linux内核中可以用如下函数在缓冲区开头增加数据:
unsigned char *skb_push(struct sk_buff *skb, unsigned len);
它会导致 skb->data 前移 len (skb->data -= len ),而 skb->len会增加 len 的大小(skb->len += len)。与 该函数功能相反的函数是 skb_pull( ),它可以在缓冲区开头移除数据,执行的动作是skb->len -= len、 skb->data += len。
对于一个空的缓冲区而言,调用如下函数可以调整缓冲区的头部:
static inline void skb_reserve(struct sk_buff *skb, int len);
它会将 skb->data 和 skb->tail 同时后移 len,执行 skb->data += len、skb->tail += len。内核里存在许多类似代码:
skb = alloc_skb(len + headspace, GFP_KERNEL); skb_reserve(skb, headspace); skb_put(skb, len); memcpy_fromfs(skb->data, data, len); pass_to_m_protocol(skb);
上述代码先分配一个全新的 sk_buff,接着调用 skb_reserve( )腾出头部空间,之后调用 skb_put( )腾出数据空间,然后把数据复制进来,最后把 sk_buff 传给协议栈。
2.网络设备接口层
网络设备接口层的主要功能是为千变万化的网络设备定义统一、抽象的数据结构 net_device 结构体,实现多种硬件在软件层次上的统一。
net_device 结构体在内核中指代一个网络设备,它定义于include/linux/netdevice.h 中,网络设备驱动程序只需通过填充 net_device 的具体成员并注册 net_device 即可实现硬件操作函数与内核的挂接。
net_device 是一个巨大的结构体,包含网络设备的属性描述和操作接口,下面介绍一些其中的关键成员。
(1)全局信息
char name[IFNAMESIZ];
name 是网络设备的名称。
(2)硬件信息
unsigned long mem_end; unsigned long mem_start;
mem_start 和 mem_end 分别定义了设备所使用的共享内存的起始和结束地址。
unsigned long base_addr; unsigned char irq; unsigned char if_port; unsigned char dma;
base_addr 为网络设备I/O基地址。
irq为设备使用的中断
if_port 指定多端口设备使用哪一个端口,该字段仅针对多端口设备。例如,如果设备同时支持 IF_PORT_10BASE2(同轴电缆) 和 IF_PORT_10BASET(双绞线),则可使用该字段。
dma指定分配给设备的DMA通道。
(3)接口信息
unsigned short hard_header_len;
hard_header_len 是网络设备的硬件头长度,在以太网设备的初始化函数中,该成员被赋为 ETH_HLEN,即14。
unsigned short type;
type是接口的硬件类型。
unsigned mtu;
mtu 指最大传输单元(MTU)。
unsigned char *dev_addr;
用于存放设备的硬件地址,驱动可能提供了设置 MAC 地址的接口,这会导致用户设置的 MAC 地址等存入该成员,如 drivers/net/ethernet/moxa/moxart_ether.c 中的 moxart_set_mac_address( ) 函数所示。
static int moxart_set_mac_address(struct net_device *ndev, void *addr) { struct sockaddr *address = addr; if (!is_valid_ether_addr(address->sa_data)) return -EADDRNOTAVAIL; memcpy (ndev->dev_addr, address->sa_data, ndev->addr_len); moxart_update_mac_address(ndev); return 0; }
上述代码完成了 memcpy() 以及最终硬件上的 MAC 地址变更。
unsigned short flags;
flags 指网络接口标志, 以 IFF_(Interface Flags)开头, 部分标志由内核来管理,其他的在接口初始化时被设置以说明设备接口的能力和特性。接口标志包括 IFF_UP(当设备被激活并可以开始发送数据包时,内核设置该标志)、 IFF_AUTOMEDIA(设备可在多种媒介间切换)、IFF_BROADCAST(允许广播)、IFF_DEBUG(调试模式,可用于控制 printk 调用的详细程度)、IFF_LOOPBACK(回环) 、IFF_MULTICAST(允许组播)、IFF_NOARP(接口不能执行ARP)和 IFF_POINTOPOINT(接口连接到点对点链路)等。
(4)设备操作函数
const struct net_device_ops *netdev_ops;
该结构体是网络设备的一系列硬件操作行数的集合,它也定义于 include/linux/netdevice.h 中,这个结构体很大,如下:
/* * This structure defines the management hooks for network devices. * The following hooks can be defined; unless noted otherwise, they are * optional and can be filled with a null pointer. * * int (*ndo_init)(struct net_device *dev); * This function is called once when network device is registered. * The network device can use this to any late stage initializaton * or semantic validattion. It can fail with an error code which will * be propogated back to register_netdev * * void (*ndo_uninit)(struct net_device *dev); * This function is called when device is unregistered or when registration * fails. It is not called if init fails. * * int (*ndo_open)(struct net_device *dev); * This function is called when network device transistions to the up * state. * * int (*ndo_stop)(struct net_device *dev); * This function is called when network device transistions to the down * state. * * netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb, * struct net_device *dev); * Called when a packet needs to be transmitted. * Must return NETDEV_TX_OK , NETDEV_TX_BUSY. * (can also return NETDEV_TX_LOCKED iff NETIF_F_LLTX) * Required can not be NULL. * * u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb, * void *accel_priv, select_queue_fallback_t fallback); * Called to decide which queue to when device supports multiple * transmit queues. * * void (*ndo_change_rx_flags)(struct net_device *dev, int flags); * This function is called to allow device receiver to make * changes to configuration when multicast or promiscious is enabled. * * void (*ndo_set_rx_mode)(struct net_device *dev); * This function is called device changes address list filtering. * If driver handles unicast address filtering, it should set * IFF_UNICAST_FLT to its priv_flags. * * int (*ndo_set_mac_address)(struct net_device *dev, void *addr); * This function is called when the Media Access Control address * needs to be changed. If this interface is not defined, the * mac address can not be changed. * * int (*ndo_validate_addr)(struct net_device *dev); * Test if Media Access Control address is valid for the device. * * int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); * Called when a user request an ioctl which can't be handled by * the generic interface code. If not defined ioctl's return * not supported error code. * * int (*ndo_set_config)(struct net_device *dev, struct ifmap *map); * Used to set network devices bus interface parameters. This interface * is retained for legacy reason, new devices should use the bus * interface (PCI) for low level management. * * int (*ndo_change_mtu)(struct net_device *dev, int new_mtu); * Called when a user wants to change the Maximum Transfer Unit * of a device. If not defined, any request to change MTU will * will return an error. * * void (*ndo_tx_timeout)(struct net_device *dev); * Callback uses when the transmitter has not made any progress * for dev->watchdog ticks. * * struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev, * struct rtnl_link_stats64 *storage); * struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); * Called when a user wants to get the network device usage * statistics. Drivers must do one of the following: * 1. Define @ndo_get_stats64 to fill in a zero-initialised * rtnl_link_stats64 structure passed by the caller. * 2. Define @ndo_get_stats to update a net_device_stats structure * (which should normally be dev->stats) and return a pointer to * it. The structure may be changed asynchronously only if each * field is written atomically. * 3. Update dev->stats asynchronously and atomically, and define * neither operation. * * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid); * If device support VLAN filtering this function is called when a * VLAN id is registered. * * int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid); * If device support VLAN filtering this function is called when a * VLAN id is unregistered. * * void (*ndo_poll_controller)(struct net_device *dev); * * SR-IOV management functions. * int (*ndo_set_vf_mac)(struct net_device *dev, int vf, u8* mac); * int (*ndo_set_vf_vlan)(struct net_device *dev, int vf, u16 vlan, u8 qos); * int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate, * int max_tx_rate); * int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting); * int (*ndo_get_vf_config)(struct net_device *dev, * int vf, struct ifla_vf_info *ivf); * int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state); * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); * int (*ndo_setup_tc)(struct net_device *dev, u8 tc) * Called to setup 'tc' number of traffic classes in the net device. This * is always called from the stack with the rtnl lock held and netif tx * queues stopped. This allows the netdevice to perform queue management * safely. * * Fiber Channel over Ethernet (FCoE) offload functions. * int (*ndo_fcoe_enable)(struct net_device *dev); * Called when the FCoE protocol stack wants to start using LLD for FCoE * so the underlying device can perform whatever needed configuration or * initialization to support acceleration of FCoE traffic. * * int (*ndo_fcoe_disable)(struct net_device *dev); * Called when the FCoE protocol stack wants to stop using LLD for FCoE * so the underlying device can perform whatever needed clean-ups to * stop supporting acceleration of FCoE traffic. * * int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid, * struct scatterlist *sgl, unsigned int sgc); * Called when the FCoE Initiator wants to initialize an I/O that * is a possible candidate for Direct Data Placement (DDP). The LLD can * perform necessary setup and returns 1 to indicate the device is set up * successfully to perform DDP on this I/O, otherwise this returns 0. * * int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); * Called when the FCoE Initiator/Target is done with the DDPed I/O as * indicated by the FC exchange id 'xid', so the underlying device can * clean up and reuse resources for later DDP requests. * * int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid, * struct scatterlist *sgl, unsigned int sgc); * Called when the FCoE Target wants to initialize an I/O that * is a possible candidate for Direct Data Placement (DDP). The LLD can * perform necessary setup and returns 1 to indicate the device is set up * successfully to perform DDP on this I/O, otherwise this returns 0. * * int (*ndo_fcoe_get_hbainfo)(struct net_device *dev, * struct netdev_fcoe_hbainfo *hbainfo); * Called when the FCoE Protocol stack wants information on the underlying * device. This information is utilized by the FCoE protocol stack to * register attributes with Fiber Channel management service as per the * FC-GS Fabric Device Management Information(FDMI) specification. * * int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); * Called when the underlying device wants to override default World Wide * Name (WWN) generation mechanism in FCoE protocol stack to pass its own * World Wide Port Name (WWPN) or World Wide Node Name (WWNN) to the FCoE * protocol stack to use. * * RFS acceleration. * int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb, * u16 rxq_index, u32 flow_id); * Set hardware filter for RFS. rxq_index is the target queue index; * flow_id is a flow ID to be passed to rps_may_expire_flow() later. * Return the filter ID on success, or a negative error code. * * Slave management functions (for bridge, bonding, etc). * int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev); * Called to make another netdev an underling. * * int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev); * Called to release previously enslaved netdev. * * Feature/offload setting functions. * netdev_features_t (*ndo_fix_features)(struct net_device *dev, * netdev_features_t features); * Adjusts the requested feature flags according to device-specific * constraints, and returns the resulting flags. Must not modify * the device state. * * int (*ndo_set_features)(struct net_device *dev, netdev_features_t features); * Called to update device configuration to new features. Passed * feature set might be less than what was returned by ndo_fix_features()). * Must return >0 or -errno if it changed dev->features itself. * * int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[], * struct net_device *dev, * const unsigned char *addr, u16 vid, u16 flags) * Adds an FDB entry to dev for addr. * int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[], * struct net_device *dev, * const unsigned char *addr, u16 vid) * Deletes the FDB entry from dev coresponding to addr. * int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb, * struct net_device *dev, struct net_device *filter_dev, * int idx) * Used to add FDB entries to dump requests. Implementers should add * entries to skb and update idx with the number of entries. * * int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh, * u16 flags) * int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq, * struct net_device *dev, u32 filter_mask) * int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh, * u16 flags); * * int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier); * Called to change device carrier. Soft-devices (like dummy, team, etc) * which do not represent real hardware may define this to allow their * userspace components to manage their virtual carrier state. Devices * that determine carrier state from physical hardware properties (eg * network cables) or protocol-dependent mechanisms (eg * USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function. * * int (*ndo_get_phys_port_id)(struct net_device *dev, * struct netdev_phys_item_id *ppid); * Called to get ID of physical port of this device. If driver does * not implement this, it is assumed that the hw is not able to have * multiple net devices on single physical port. * * void (*ndo_add_vxlan_port)(struct net_device *dev, * sa_family_t sa_family, __be16 port); * Called by vxlan to notiy a driver about the UDP port and socket * address family that vxlan is listnening to. It is called only when * a new port starts listening. The operation is protected by the * vxlan_net->sock_lock. * * void (*ndo_del_vxlan_port)(struct net_device *dev, * sa_family_t sa_family, __be16 port); * Called by vxlan to notify the driver about a UDP port and socket * address family that vxlan is not listening to anymore. The operation * is protected by the vxlan_net->sock_lock. * * void* (*ndo_dfwd_add_station)(struct net_device *pdev, * struct net_device *dev) * Called by upper layer devices to accelerate switching or other * station functionality into hardware. 'pdev is the lowerdev * to use for the offload and 'dev' is the net device that will * back the offload. Returns a pointer to the private structure * the upper layer will maintain. * void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv) * Called by upper layer device to delete the station created * by 'ndo_dfwd_add_station'. 'pdev' is the net device backing * the station and priv is the structure returned by the add * operation. * netdev_tx_t (*ndo_dfwd_start_xmit)(struct sk_buff *skb, * struct net_device *dev, * void *priv); * Callback to use for xmit over the accelerated station. This * is used in place of ndo_start_xmit on accelerated net * devices. * netdev_features_t (*ndo_features_check) (struct sk_buff *skb, * struct net_device *dev * netdev_features_t features); * Called by core transmit path to determine if device is capable of * performing offload operations on a given packet. This is to give * the device an opportunity to implement any restrictions that cannot * be otherwise expressed by feature flags. The check is called with * the set of features that the stack has calculated and it returns * those the driver believes to be appropriate. * * int (*ndo_switch_parent_id_get)(struct net_device *dev, * struct netdev_phys_item_id *psid); * Called to get an ID of the switch chip this port is part of. * If driver implements this, it indicates that it represents a port * of a switch chip. * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state); * Called to notify switch device port of bridge port STP * state change. */ struct net_device_ops { int (*ndo_init)(struct net_device *dev); void (*ndo_uninit)(struct net_device *dev); int (*ndo_open)(struct net_device *dev); int (*ndo_stop)(struct net_device *dev); netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev); u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb, void *accel_priv, select_queue_fallback_t fallback); void (*ndo_change_rx_flags)(struct net_device *dev, int flags); void (*ndo_set_rx_mode)(struct net_device *dev); int (*ndo_set_mac_address)(struct net_device *dev, void *addr); int (*ndo_validate_addr)(struct net_device *dev); int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); int (*ndo_set_config)(struct net_device *dev, struct ifmap *map); int (*ndo_change_mtu)(struct net_device *dev, int new_mtu); int (*ndo_neigh_setup)(struct net_device *dev, struct neigh_parms *); void (*ndo_tx_timeout) (struct net_device *dev); struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev, struct rtnl_link_stats64 *storage); struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid); int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid); #ifdef CONFIG_NET_POLL_CONTROLLER void (*ndo_poll_controller)(struct net_device *dev); int (*ndo_netpoll_setup)(struct net_device *dev, struct netpoll_info *info); void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif #ifdef CONFIG_NET_RX_BUSY_POLL int (*ndo_busy_poll)(struct napi_struct *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); int (*ndo_set_vf_vlan)(struct net_device *dev, int queue, u16 vlan, u8 qos); int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate, int max_tx_rate); int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting); int (*ndo_get_vf_config)(struct net_device *dev, int vf, struct ifla_vf_info *ivf); int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state); int (*ndo_set_vf_port)(struct net_device *dev, int vf, struct nlattr *port[]); int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); int (*ndo_setup_tc)(struct net_device *dev, u8 tc); #if IS_ENABLED(CONFIG_FCOE) int (*ndo_fcoe_enable)(struct net_device *dev); int (*ndo_fcoe_disable)(struct net_device *dev); int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid, struct scatterlist *sgl, unsigned int sgc); int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid, struct scatterlist *sgl, unsigned int sgc); int (*ndo_fcoe_get_hbainfo)(struct net_device *dev, struct netdev_fcoe_hbainfo *hbainfo); #endif #if IS_ENABLED(CONFIG_LIBFCOE) #define NETDEV_FCOE_WWNN 0 #define NETDEV_FCOE_WWPN 1 int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif #ifdef CONFIG_RFS_ACCEL int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb, u16 rxq_index, u32 flow_id); #endif int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev); int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev); netdev_features_t (*ndo_fix_features)(struct net_device *dev, netdev_features_t features); int (*ndo_set_features)(struct net_device *dev, netdev_features_t features); int (*ndo_neigh_construct)(struct neighbour *n); void (*ndo_neigh_destroy)(struct neighbour *n); int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[], struct net_device *dev, const unsigned char *addr, u16 vid, u16 flags); int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[], struct net_device *dev, const unsigned char *addr, u16 vid); int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb, struct net_device *dev, struct net_device *filter_dev, int idx); int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh, u16 flags); int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq, struct net_device *dev, u32 filter_mask); int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh, u16 flags); int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier); int (*ndo_get_phys_port_id)(struct net_device *dev, struct netdev_phys_item_id *ppid); void (*ndo_add_vxlan_port)(struct net_device *dev, sa_family_t sa_family, __be16 port); void (*ndo_del_vxlan_port)(struct net_device *dev, sa_family_t sa_family, __be16 port); void* (*ndo_dfwd_add_station)(struct net_device *pdev, struct net_device *dev); void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv); netdev_tx_t (*ndo_dfwd_start_xmit) (struct sk_buff *skb, struct net_device *dev, void *priv); int (*ndo_get_lock_subclass)(struct net_device *dev); netdev_features_t (*ndo_features_check) (struct sk_buff *skb, struct net_device *dev, netdev_features_t features); #ifdef CONFIG_NET_SWITCHDEV int (*ndo_switch_parent_id_get)(struct net_device *dev, struct netdev_phys_item_id *psid); int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state); #endif };
ndo_open( ) 函数的作用是打开网络接口设备,获得设备需要的I/O地址、IRQ、DMA通道等。stop( )函数的作用是停止网络接口设备,与 open( ) 函数的作用相反。
int(*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);
ndo_start_xmit( ) 函数会启动数据包的发送, 当系统调用驱动程序的 xmit 函数时,需要向其传入一个 sk_buff 指针,使得驱动程序能获取从上层传递下来的数据包。
void (*ndo_tx_timeout) (struct net_device *dev);
当数据包的发送超时时,ndo_tx_timeout( )函数会被调用,该函数需采取重新启动数据包发送过程 或 重新启动硬件等措施来恢复网络设备到正常状态。
struct net_device_stats* (*ndo_get_stats) (struct net_device *dev);
ndo_get_stats( ) 函数用于获得网络设备的状态信息,它返回一个 net_device_stats结构体指针。net_device_stats 结构体保存了详细的网络设备流量统计信息,如发送和接收的数据包数、字节数等。
int (*ndo_do_ioctl) (struct net_device *dev, struct ifreq *ifr, int cmd); int (*ndo_set_config) (struct net_device *dev, struct ifmap *map); int (*ndo_set_mac_address) (struct net_device *dev, void *addr);
ndo_do_ioctl( ) 函数用于进行设备特定的 I/O 控制。
ndo_set_config( ) 函数用于配置接口,也可用于改变设备的 I/O 地址和中断号。
ndo_set_mac_address( ) 函数用于设置设备的 MAC 地址。
除了 netdev_ops 以外, 在 net_device 中还存在类似于 ethtool_ops、header_ops这样的操作集:
const struct ethtool_ops *ethtool_ops; const struct header_ops *header_ops;
ethtool_ops 成员函数与用户空间 ethtool 工具的各个命令选项对应, ethtool提供了网卡及网卡驱动管理能力,能够为 Linux 网络开发人员和管理人员提供对网卡硬件、驱动程序和网络协议栈的设置、查看以及调试等功能。
header_ops 对于硬件头部操作,主要是完成创建硬件头部和从给定的 sk_buff 分析出硬件头部等操作。
(5)辅助成员
unsigned long trans_start; unsigned long last_rx;
trans_start 记录最后的数据包开始发送时的时间戳, last_rx 记录最后一次接收到数据包时的时间戳,这两个时间戳记录的都是jiffies,驱动程序应维护这两个成员。
通常情况下,网络设备驱动以中断方式接收数据包,而 poll_controller( ) 则采用纯轮询方式,另外一种数据接收方式是 NAPI(New API),其数据接收流程为 “接收中断来临 -> 关闭接收中断 ->以轮询方式接收所有数据包直到收空->开启接收中断->接收中断来临······”。内核中提供了如下与 NAPI 相关的 API:
static inline void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll) (struct napi_struct *, int), int weight); static inline void netif_napi_del(struct napi_struct *napi);
以上两个函数分别用于初始化和移除一个 NAPI, netif_napi_add( ) 的 poll 参数是 NAPI 要调度执行的轮询函数。
static inline void napi_enable(struct napi_struct *n); static inline void napi_disable(struct napi_struct *n);
以上两个函数分别用于使能和禁止 NAPI 调度。
static inline int napi_schedule_prep(struct napi_struct *n);
该函数用于检查 NAPI 是否可以调度, 而 napi_schedule( ) 函数用于调度轮询实例的运行,其原型为:
static inline void napi_schedule(struct napi_struct *n);
在 NAPI 处理完成的时候应该调用:
static inline void napi_complete(struct napi_struct *n);
3.设备驱动功能层
net_device 结构体的成员(属性和 net_device_ops 结构体中的函数指针)需要被设备驱动功能层赋予具体的数值和函数。对于具体的设备xxx,应该编写相应的设备驱动功能层的函数,这些函数形如 xxx_open( )、xxx_stop( )、xxx_tx( )、 xxx_hard_header( )、
xxx_get_stats( ) 和 xxx_tx_timeout( )等。
由于网络数据包的接收可由中断引发,设备驱动功能层中的另一个主体部分将是中断处理函数,它负责读取硬件上接收到的数据包并传送给上层协议,因此可能包含 xxx_interrupt( ) 和 xxx_rx( ) 函数,前者完成中断类型判断等基本工作,后者则需要完成数据包的生成及将其递交给上层等复杂工作。
对于特定的设备,我们还可以定义相关的私有数据和操作,并封装为一个私有信息结构体 xxx_private,让其指针赋值给 net_device 的私有成员。 在 xxx_private 结构体中可包含设备的特殊属性和操作、自旋锁与信号量、定时器以及统计信息等,由我们自定义。
在驱动中,要用到私有数据的时候,则使用在 netdevice.h 中定义的接口:
static inline void *netdev_priv(const struct net_device *dev);
例如在驱动 drivers/net/ethernet/davicom/dm900.c 的 dm9000_probe( ) 函数中,使用 alloc_etherdev(sizeof(struct board_info)) 分配网络设备, board_info 结构体就成了这个网络设备的私有数据,在其他函数中可以简单地提取这个私有数据,例如:
static int dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev) { unsigned long flags; board_info_t *db = netdev_priv(dev); ``` }