各协议族传输层使用各自的传输控制块存放套接口所要求的信息。TCP传输控制块、UDP传输控制块、原始IP传输控制块等
Linux内核的传输控制块定义是很巧妙的---依据协议族和传输层协议的特点。分层次地定义了多个结构用来组成传输控制块。IPv4协议族包含sock_common、sock、inet_sock、inet_connection_sock、tcp_sock、request_sock、inet_request_sock、tcp_request_sock、inet_timewait_sock、tcp_timewait_sock、udp_sock、raw_sock结构。
sock_common
该结构是传输控制块信息的最小集合,由sock和inet_timewait_sock结构前面同样部分单独构成。因此仅仅用来构成这两种结构
sock
该结构是比較通用的网络层描写叙述块,构成传输控制块的基础,与详细的协议族无关。它描写叙述了各个协议族传输层协议的功能信息,因此不能直接作为传输控制块来使用。不同协议族的传输层在使用该结构时都会对其进行扩展。来适合各自的传输特性,比如,inet_sock结构由sock结构及其它特性组成,构成IPv4协议族传输控制块的基础
inet_sock
该结构是比較通用的IPv4协议族描写叙述块,包括IPv4协议族基础传输层,即UDP、TCP以及原始传输控制块共同拥有的信息
inet_connection_sock
该结构是支持面向连接特性的描写叙述块,构成IPv4协议族TCP控制块的基础。在inet_sock结构的基础上增加了支持连接的特性
tcp_sock
该结构即TCP传输控制块,支持完整的TCP特性,包括了TCP为各连接维护的全部节点信息
inet_timewait_sock
该结构是支持面向连接特性的TCP_TIME_WAIT状态的描写叙述,是构成tcp_timewait_sock的基础
tcp_timewait_sock
该结构是TCP_TIME_WAIT状态描写叙述块,是一种比較特殊的传输控制块,当TCP状态为TCP_TIME_WAIT时。tcp_sock结构会蜕变为tcp_timewait_sock结构
udp_sock
该结构是UDP传输控制块。支持UDP的完整特性。UDP须要的信息基本都在inet_sock结构中描写叙述。
基本传输控制块和IPv4专用的传输控制块以及传输层通用的函数涉及下面文件:
include/net/sock.h 定义主要的传输控制块结构、宏和函数原型
include/net/inet_sock.h 定义IPv4专用的传输控制块
net/core/sock.c 实现传输层通用的函数
net/socket.c 实现套接口层的调用。
传输控制块的内存管理
传输控制块的分配和释放
sk_alloc()
在创建套接口时,TCP、UDP和原始IP会分配一个传输控制块。
分配传输控制块的函数为sk_alloc()。
当传输控制块生命结束以后,通过sk_free()将其释放。
/** * sk_alloc - All socket objects are allocated here * @net: the applicable net namespace * @family: protocol family * @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc) * @prot: struct proto associated with this new sock instance */ struct sock *sk_alloc(struct net *net, int family, gfp_t priority, struct proto *prot) { struct sock *sk; sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family); if (sk) { sk->sk_family = family; /* * See comment in struct sock definition to understand * why we need sk_prot_creator -acme */ sk->sk_prot = sk->sk_prot_creator = prot; sock_lock_init(sk); sock_net_set(sk, get_net(net)); atomic_set(&sk->sk_wmem_alloc, 1); } return sk; }sk_free()
sk_free()通经常使用于释放指定的传输控制块,通常由sock_put()调用,当指定的控制块的引用计数为0时才会调用此函数进行释放操作。
static void __sk_free(struct sock *sk) { struct sk_filter *filter; if (sk->sk_destruct) sk->sk_destruct(sk); filter = rcu_dereference(sk->sk_filter); if (filter) { sk_filter_uncharge(sk, filter); rcu_assign_pointer(sk->sk_filter, NULL); } sock_disable_timestamp(sk, SOCK_TIMESTAMP); sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE); if (atomic_read(&sk->sk_omem_alloc)) printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected. ", __func__, atomic_read(&sk->sk_omem_alloc)); put_net(sock_net(sk)); sk_prot_free(sk->sk_prot_creator, sk); } void sk_free(struct sock *sk) { /* * We substract one from sk_wmem_alloc and can know if * some packets are still in some tx queue. * If not null, sock_wfree() will call __sk_free(sk) later */ if (atomic_dec_and_test(&sk->sk_wmem_alloc)) __sk_free(sk); }
普通发送缓存区分配
sock_alloc_send_skb()
主要为UDP和RAW套接口分配用于输出的SKB。与sock_wmalloc()相比,在分片过程中考虑的细节比較多,支持检測传输控制块已经发生的错误、检測关闭套接口的标志、堵塞等。实际上是直接调用sock_alloc_send_pskb()实现的。
/* * Generic send/receive buffer handlers */ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len, unsigned long data_len, int noblock, int *errcode) { struct sk_buff *skb; gfp_t gfp_mask; long timeo; int err; gfp_mask = sk->sk_allocation; if (gfp_mask & __GFP_WAIT) gfp_mask |= __GFP_REPEAT; timeo = sock_sndtimeo(sk, noblock); while (1) { err = sock_error(sk); if (err != 0) goto failure; err = -EPIPE; if (sk->sk_shutdown & SEND_SHUTDOWN) goto failure; if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) { skb = alloc_skb(header_len, gfp_mask); if (skb) { int npages; int i; /* No pages, we're done... */ if (!data_len) break; npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT; skb->truesize += data_len; skb_shinfo(skb)->nr_frags = npages; for (i = 0; i < npages; i++) { struct page *page; skb_frag_t *frag; page = alloc_pages(sk->sk_allocation, 0); if (!page) { err = -ENOBUFS; skb_shinfo(skb)->nr_frags = i; kfree_skb(skb); goto failure; } frag = &skb_shinfo(skb)->frags[i]; frag->page = page; frag->page_offset = 0; frag->size = (data_len >= PAGE_SIZE ? PAGE_SIZE : data_len); data_len -= PAGE_SIZE; } /* Full success... */ break; } err = -ENOBUFS; goto failure; } set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); err = -EAGAIN; if (!timeo) goto failure; if (signal_pending(current)) goto interrupted; timeo = sock_wait_for_wmem(sk, timeo); } skb_set_owner_w(skb, sk); return skb; interrupted: err = sock_intr_errno(timeo); failure: *errcode = err; return NULL; } struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, int noblock, int *errcode) { return sock_alloc_send_pskb(sk, size, 0, noblock, errcode); }
发送缓存的分配与释放
sock_wmalloc()
sock_wmalloc的作用也是分配发送缓存。在TCP中,仅仅是在构造SYN+ACK时使用,发送用户数据时通常sk_stream_alloc_pskb()分配发送缓存。
/* * Allocate a skb from the socket's send buffer. */ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, gfp_t priority) { if (force || atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) { struct sk_buff *skb = alloc_skb(size, priority); if (skb) { skb_set_owner_w(skb, sk); return skb; } } return NULL; }skb_set_owner_w()
每一个用于输出的SKB都要关联到一个传输控制块上,这样能够调整该传输控制块为发送而分配的全部SKB数据区的总大小,并设置此SKB的销毁函数。
/* * Queue a received datagram if it will fit. Stream and sequenced * protocols can't normally use this as they need to fit buffers in * and play with them. * * Inlined as it's very short and called for pretty much every * packet ever received. */ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk) { skb_orphan(skb); skb->sk = sk; skb->destructor = sock_wfree; /* * We used to take a refcount on sk, but following operation * is enough to guarantee sk_free() wont free this sock until * all in-flight packets are completed */ atomic_add(skb->truesize, &sk->sk_wmem_alloc); }
sock_wfree()
sock_wfree()通常设置到用于输出SKB的销毁函数接口上,当释放该SKB时被调用,用于更新所属传输控制块为发送而分配的全部SKB数据区的总大小,调用sk_write_space接口来唤醒因等待本套接口而处于睡眠状态的进程,递减对所属传输控制块的引用
/* * Simple resource managers for sockets. */ /* * Write buffer destructor automatically called from kfree_skb. */ void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; unsigned int len = skb->truesize; if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { /* * Keep a reference on sk_wmem_alloc, this will be released * after sk_write_space() call */ atomic_sub(len - 1, &sk->sk_wmem_alloc); sk->sk_write_space(sk); len = 1; } /* * if sk_wmem_alloc reaches 0, we must finish what sk_free() * could not do because of in-flight packets */ if (atomic_sub_and_test(len, &sk->sk_wmem_alloc)) __sk_free(sk); }
接收缓存的分配与释放
用于输入的SKB都是在驱动层通过dev_alloc_skb()或alloc_skb()进行分配的。在传递至传输层曾经,并不属于哪个详细的传输控制块。可是一旦进入传输层。便须要设置该SKB的宿主。
skb_set_owner_r()
当UDP数据报的SKB传递并加入到UDP传输控制块的接收队列中。便会调用skb_set_owner_r()设置该SKB的宿主,并设置此SKB的销毁函数,还要更新接收队列中全部报文数据的总长度。
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb_orphan(skb); skb->sk = sk; skb->destructor = sock_rfree; atomic_add(skb->truesize, &sk->sk_rmem_alloc); sk_mem_charge(sk, skb->truesize); }
异步I/O机制
尽快堵塞和非堵塞操作同select方法的结合对于查询设备在大多数情况下是有效的,但在某些情况下还不能全然有效地解决这个问题。
比如一个进程,在低优先级上运行一个较长的计算循环。可是须要尽可能快地处理输入数据。假设进程通过响应外设获取数据。当新数据可用时它应当立马知道。通常应用程序可调用select()有规律地检查数据,可是,假设更迅速地处理外设数据。就能够使用异步通知的方法,使应用程序接收一个信号,而不须要主动查询。
用户程序必须运行2个步骤使能来自输入文件的异步通知。首先,它们指定一个进程作为文件的拥有者。当一个进程使用fcntl系统调用发出F_SETOWN命令。这个拥有者进程的ID被保存在filp->f_owner中供以后使用。通过这一步,内核便知道通知的对象。为了真正使能异步通知,用户程序必须通过fcntl的F_SETFL命令在设备中设置FASYNC标志。
在这两个调用运行后。处理异步IO的进程可接管SIGIO信号,此后。不管新数据何时到达。信号都会发送给存储与filp->f_owner中的进程。
比如,以下的用户程序中的代码实现了向当前进程发送标准输入文件的异步通知:
signal(SIGIO, &input_handler);
fcntl(STDIN_FILENO, F_SETOWN, getpid());
oflags = fcntl(STDIN_FILENO, F_GETFL);
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC)
sk_wake_async()
用来将SIGIO或SIGURG信号发送给在该套接口上的进程。通知该进程能够对该文件进行读或写。
/* This function may be called only under socket lock or callback_lock */ int sock_wake_async(struct socket *sock, int how, int band) { if (!sock || !sock->fasync_list) return -1; switch (how) { case SOCK_WAKE_WAITD: if (test_bit(SOCK_ASYNC_WAITDATA, &sock->flags)) break; goto call_kill; case SOCK_WAKE_SPACE: if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags)) break; /* fall through */ case SOCK_WAKE_IO: call_kill: __kill_fasync(sock->fasync_list, SIGIO, band); break; case SOCK_WAKE_URG: __kill_fasync(sock->fasync_list, SIGURG, band); } return 0; } static inline void sk_wake_async(struct sock *sk, int how, int band) { if (sk->sk_socket && sk->sk_socket->fasync_list) sock_wake_async(sk->sk_socket, how, band); }how
enum {
SOCK_WAKE_IO, 检測标识应用程序通过recv等调用时,是否在等待数据的接收
SOCK_WAKE_WAITD, 检測传输控制块的发送队列是否以前达到上限
SOCK_WAKE_SPACE, 不做不论什么检測,直接向等待进程发送SIGIO信号
SOCK_WAKE_URG, 向等待进程发送SIGURG信号
};
band
/*
* SIGPOLL si_codes
*/
#define POLL_IN (__SI_POLL|1)/* data input available */
#define POLL_OUT (__SI_POLL|2)/* output buffers available */
#define POLL_MSG (__SI_POLL|3)/* input message available */
#define POLL_ERR (__SI_POLL|4)/* i/o error */
#define POLL_PRI (__SI_POLL|5)/* high priority input available */
#define POLL_HUP (__SI_POLL|6)/* device disconnected */
sock_def_wakeup()
用于唤醒传输控制块的sk_sleep队列上的睡眠进程,是传输控制块默认的唤醒等待该套接口的函数。
该函数设置到传输控制块的sk_state_change接口上。通常当传输控制块的状态发生变化时被调用。
/* * Default Socket Callbacks */ static void sock_def_wakeup(struct sock *sk) { read_lock(&sk->sk_callback_lock); if (sk_has_sleeper(sk)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); }
接收到FIN段后通知进程
在TCP中还有些地方会通知套接口的fasync_list队列上的进程。
比方,当TCP接收到FIN段后。假设此时套接口未在DEAD状态,则唤醒等待该套接口的进程。
假设在发送接收方向都进行了关闭。或者此时该传输控制块处于CLOSE状态,则通知异步等待该套接口的进程,该连接已经终止,否则通知进程连接能够进行写操作。
/* * Process the FIN bit. This now behaves as it is supposed to work * and the FIN takes effect when it is validly part of sequence * space. Not before when we get holes. * * If we are ESTABLISHED, a received fin moves us to CLOSE-WAIT * (and thence onto LAST-ACK and finally, CLOSE, we never enter * TIME-WAIT) * * If we are in FINWAIT-1, a received FIN indicates simultaneous * close and we go into CLOSING (and later onto TIME-WAIT) * * If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT. */ static void tcp_fin(struct sk_buff *skb, struct sock *sk, struct tcphdr *th) { struct tcp_sock *tp = tcp_sk(sk); inet_csk_schedule_ack(sk); sk->sk_shutdown |= RCV_SHUTDOWN; sock_set_flag(sk, SOCK_DONE); switch (sk->sk_state) { case TCP_SYN_RECV: case TCP_ESTABLISHED: /* Move to CLOSE_WAIT */ tcp_set_state(sk, TCP_CLOSE_WAIT); inet_csk(sk)->icsk_ack.pingpong = 1; break; case TCP_CLOSE_WAIT: case TCP_CLOSING: /* Received a retransmission of the FIN, do * nothing. */ break; case TCP_LAST_ACK: /* RFC793: Remain in the LAST-ACK state. */ break; case TCP_FIN_WAIT1: /* This case occurs when a simultaneous close * happens, we must ack the received FIN and * enter the CLOSING state. */ tcp_send_ack(sk); tcp_set_state(sk, TCP_CLOSING); break; case TCP_FIN_WAIT2: /* Received a FIN -- send ACK and enter TIME_WAIT. */ tcp_send_ack(sk); tcp_time_wait(sk, TCP_TIME_WAIT, 0); break; default: /* Only TCP_LISTEN and TCP_CLOSE are left, in these * cases we should never reach this piece of code. */ printk(KERN_ERR "%s: Impossible, sk->sk_state=%d ", __func__, sk->sk_state); break; } /* It _is_ possible, that we have something out-of-order _after_ FIN. * Probably, we should reset in this case. For now drop them. */ __skb_queue_purge(&tp->out_of_order_queue); if (tcp_is_sack(tp)) tcp_sack_reset(&tp->rx_opt); sk_mem_reclaim(sk); if (!sock_flag(sk, SOCK_DEAD)) { sk->sk_state_change(sk); /* Do not send POLL_HUP for half duplex close. */ if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP); else sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); } }
sock_fasync()
实现了对套接口的异步通知队列添加和删除的更新操作。由于它在进程上下文中或在软中断被调用。因此,在訪问异步通知列表时须要上锁,对套接口上锁。对传输控制块上sk_callback_lock锁
/* * Update the socket async list * * Fasync_list locking strategy. * * 1. fasync_list is modified only under process context socket lock * i.e. under semaphore. * 2. fasync_list is used under read_lock(&sk->sk_callback_lock) * or under socket lock. * 3. fasync_list can be used from softirq context, so that * modification under socket lock have to be enhanced with * write_lock_bh(&sk->sk_callback_lock). * --ANK (990710) */ static int sock_fasync(int fd, struct file *filp, int on) { struct fasync_struct *fa, *fna = NULL, **prev; struct socket *sock; struct sock *sk; if (on) { fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL); if (fna == NULL) return -ENOMEM; } sock = filp->private_data; sk = sock->sk; if (sk == NULL) { kfree(fna); return -EINVAL; } lock_sock(sk); spin_lock(&filp->f_lock); if (on) filp->f_flags |= FASYNC; else filp->f_flags &= ~FASYNC; spin_unlock(&filp->f_lock); prev = &(sock->fasync_list); for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev) if (fa->fa_file == filp) break; if (on) { if (fa != NULL) { write_lock_bh(&sk->sk_callback_lock); fa->fa_fd = fd; write_unlock_bh(&sk->sk_callback_lock); kfree(fna); goto out; } fna->fa_file = filp; fna->fa_fd = fd; fna->magic = FASYNC_MAGIC; fna->fa_next = sock->fasync_list; write_lock_bh(&sk->sk_callback_lock); sock->fasync_list = fna; write_unlock_bh(&sk->sk_callback_lock); } else { if (fa != NULL) { write_lock_bh(&sk->sk_callback_lock); *prev = fa->fa_next; write_unlock_bh(&sk->sk_callback_lock); kfree(fa); } } out: release_sock(sock->sk); return 0; }