# 理解 select、poll 和 epoll

2017-09-11

## 概述

IO 多路复用是指多个描述符复用同一个进程/线程。使用 IO 多路复用，我们可以在一个进程/线程同时监听多个描述符并查看是否有任何一个描述符就绪。在这里，描述符不限于套接字。只要有需要同时处理多个描述符的场景，就可以考虑使用 IO 多路复用。

IO 多路复用本质上还是同步IO，进程调用 select、poll 后，会进入阻塞状态，直到有描述符就绪才返回（不考虑超时和异常）。当有描述符就绪时，我们需要再调用 read 或者 write 等 IO 函数进行 IO。相比于阻塞 IO，IO 多路复用需要多一次系统调用（先调用 select、poll 再调用 read、write），但是允许我们等待多个描述符就绪。

IO 多路复用通常与非阻塞 IO 一起使用，原因在于：

• If multiple processes (or threads) are performing I/O on the same open file descriptions, then, from a particular process’s point of view, a descriptor’s readiness may change between the time the descriptor was notified as being ready and the time of the subsequent I/O call. Consequently, a blocking I/O call could block, thus preventing the process from monitoring other file descriptors.
• For socket, after we are informed a file descriptor for a stream socket is ready for writing, if we write a large enough block of data in a single write() or send(), then the call will nevertheless block.
• In rare cases, level-triggered APIs such as select() and poll() can return spurious readiness notifications—they can falsely inform us that a file descriptor is ready. This could be caused by a kernel bug or be expected behavior in an uncommon scenario.

Linux 提供了三类 IO 多路复用函数，分别是 select、poll 和 epoll 以及对应的 POSIX 标准实现 。以下描述 select、poll 和 epoll。

## select

select 函数定义为：

#include <sys/time.h>     /* For portability */
#include <sys/select.h>

int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);


• nfds 的值等于三个集合中最大描述符 + 1。比如如果我们关注了三个描述符：1, 4 和 5，那么 nfds 的值为 6。之所以要设为最大描述符 + 1，是因为描述符从 0 开始，并且 select 会检查从 0 到 nfds -1 的描述符。nfds 作用参考：What’s the purpose of the first argument to select system call?
• 当 readfds 中的描述符可读时返回，为 NULL 表示不关心
• 当 writefds 中的描述符可写时返回，为 NULL 表示不关心
• 当 exceptfds 中的描述符出现异常时返回，为 NULL 表示不关心
• timeout 表示 select 每次调用最长等待时间，为 NULL 表示永久等待

fd_set 底层是一个整数数组，整数的每一位对应一个描述符。比如，使用32位的整数，数组的第一个数对应第 0 个到第 31 个描述符；第二个数对应第 32 个到第 63 个描述符。在 Linux 实现中，fd_set 是一个 long 数组：

// sys/select.h
/* The fd_set member is required to be an array of longs.  */

#define __NFDBITS	(8 * (int) sizeof (__fd_mask))

typedef struct
{
#ifdef __USE_XOPEN
# define __FDS_BITS(set) ((set)->fds_bits)
#else
# define __FDS_BITS(set) ((set)->__fds_bits)
#endif
} fd_set;


__FD_SETSIZE 定义为：

//
/* Number of descriptors that can fit in an fd_set'.  */
#define __FD_SETSIZE		1024


void FD_CLR(int fd, fd_set *set);/* removes the file descriptor fd from the set pointed to by fdset. */
int  FD_ISSET(int fd, fd_set *set);/* returns true if the file descriptor fd is a member of the set pointed to by fdset. */
void FD_SET(int fd, fd_set *set); /* adds the file descriptor fd to the set pointed to by fdset. */
void FD_ZERO(fd_set *set); /* initializes the set pointed to by fdset to be empty. */

// Macro defintions
/*
#define	FD_ZERO(fdsetp)		__FD_ZERO (fdsetp)
#define	FD_SET(fd, fdsetp)	__FD_SET (fd, fdsetp)
#define	FD_CLR(fd, fdsetp)	__FD_CLR (fd, fdsetp)
#define	FD_ISSET(fd, fdsetp)	__FD_ISSET (fd, fdsetp)
*/


select 会一直阻塞，直到：

• at least one of the file descriptors specified in readfds, writefds, or exceptfds becomes ready;
• the call is interrupted by a signal handler;
• the amount of time specified by timeout has passed.

• A return value of –1 indicates that an error occurred. Possible errors include EBADF and EINTR . EBADF indicates that one of the file descriptors in readfds, writefds, or exceptfds is invalid (e.g., not currently open). EINTR , indicates that the call was interrupted by a signal handler. (Note that: select() is never automatically restarted if interrupted by a signal handler.)
• A return value of 0 means that the call timed out before any file descriptor became ready. In this case, each of the returned file descriptor sets will be empty.
• A positive return value indicates that one or more file descriptors is ready. The return value is the number of ready descriptors. In this case, each of the returned file descriptor sets must be examined (using FD_ISSET() ) in order to find out which I/O events occurred. If the same file descriptor is specified in more than one of readfds, writefds, and exceptfds, it is counted multiple times if it is ready for more than one event. In other words, select() returns the total number of file descriptors marked as ready in all three returned sets.

## 描述符就绪

IO 多路复用函数一般只在有描述符就绪时才返回，而 socket、普通文件、FIFO 和管道等有不同的就绪条件。

### socket 就绪

• The number of bytes of data in the socket receive buffer is greater than or equal to the current size of the low-water mark for the socket receive buffer. A read operation on the socket will not block and will return a value greater than 0 (i.e., the data that is ready to be read).
• The read half of the connection is closed (i.e., a TCP connection that has received a FIN). A read operation on the socket will not block and will return 0 (i.e., EOF).
• The socket is a listening socket and the number of completed connections is nonzero.
• A socket error is pending. A read operation on the socket will not block and will return an error (−1) with errno set to the specific error condition. These pending errors can also be fetched and cleared by calling getsockopt and specifying the SO_ERROR socket option.

• The number of bytes of available space in the socket send buffer is greater than or equal to the current size of the low-water mark for the socket send buffer and either: (i) the socket is connected, or (ii) the socket does not require a connection (e.g., UDP). This means that if we set the socket to non-blocking, a write operation will not block and will return a positive value (e.g., the number of bytes accepted by the transport layer).
• The write half of the connection is closed. A write operation on the socket will generate SIGPIPE.
• A socket using a non-blocking connect has completed the connection, or the connect has failed.
• A socket error is pending. A write operation on the socket will not block and will return an error (−1) with errno set to the specific error condition. These pending errors can also be fetched and cleared by calling getsockopt with the SO_ERROR socket option.

• A state change occurs on a pseudoterminal slave connected to a master that is in packet mode.
• Out-of-band data is received on a stream socket.
• 注意，当 socket 出错时，既会被标记为可读，又会被标记为可读
• socket 可读和可写时都提到了 low-water mark。对于读，low-water mark 定义了传输层每次向上层应用传递的最少数据大小，可通过 SO_RCVLOWAT 设置，对于 TCP 和 UDP socket，默认值为 1；对于写，low-water mark 定义了上层应用每次向 socket 写入的最少数据大小，可通过 SO_SNDLOWAT 设置，对于 TCP 和 UDP socket，默认值为 2048。参考：Low-water marks

### 普通文件描述符就绪

• A read() will always immediately return data, end-of-file, or an error (e.g., the file was not opened for reading).
• A write() will always immediately transfer data or fail with some error.

## poll

poll 函数定义为：

#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);


• fds 是指向数组首元素的指针
• nfds 指定该数组元素个数
• timeout 指定 poll 返回前等待时间

pollfd 定义为：

struct pollfd
{
int fd;	/* File descriptor to poll.  */
short int events; /* Types of events poller cares about.  */
short int revents; /* Types of events that actually occurred.  */
};


• Although defined as separate bits, POLLIN and POLLRDNORM are synonymous.
• Although defined as separate bits, POLLOUT and POLLWRNORM are synonymous.
• POLLRDBAND is generally unused; that is, it is ignored in the events field and not set in revents.
• Although set for sockets in certain circumstances, POLLWRBAND conveys no useful information. (There are no circumstances in which POLLWRBAND is set when POLLOUT and POLLWRNORM are not also set.)
• The _XOPEN_SOURCE feature test macro must be defined in order to obtain the definitions of the constants POLLRDNORM , POLLRDBAND , POLLWRNORM , and POLLWRBAND from .
• POLLRDHUP is a Linux-specific flag available since kernel 2.6.17. In order to obtain this definition from , the _GNU_SOURCE feature test macro must be defined.
• POLLNVAL is returned if the specified file descriptor was closed at the time of the poll() call.

timeout 参数取值定义为：

• If timeout equals –1, block until one of the file descriptors listed in the fds array is ready (as defined by the corresponding events field) or a signal is caught.
• If timeout equals 0, do not block, just perform a check to see which file descriptors are ready.
• If timeout is greater than 0, block for up to timeout milliseconds, until one of the file descriptors in fds is ready, or until a signal is caught.

poll 的返回值定义为：

• A return value of –1 indicates that an error occurred. One possible error is EINTR , indicating that the call was interrupted by a signal handler.
• A return of 0 means that the call timed out before any file descriptor became ready.
• A positive return value indicates that one or more file descriptors are ready. The returned value is the number of pollfd structures in the fds array that have a nonzero revents field. (Note that there is slightly different meaning of a positive return value from select() and poll(). The select() system call counts a file descriptor multiple times if it occurs in more than one returned file descriptor set. The poll() system call returns a count of ready file descriptors, and a file descriptor is counted only once, even if multiple bits are set in the corresponding revents field.)

## 对比 select() 与 poll()

### 实现

Within the Linux kernel, select() and poll() both employ the same set of kernel- internal poll routines. These poll routines are distinct from the poll() system call itself. Each routine returns information about the readiness of a single file descrip- tor. This readiness information takes the form of a bit mask whose values corre- spond to the bits returned in the revents field by the poll() system call (Table 63-2). The implementation of the poll() system call involves calling the kernel poll routine for each file descriptor and placing the resulting information in the corresponding revents field.

#define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR) /* Ready for reading */
#define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR) /* Ready for writing */
#define POLLEX_SET (POLLPRI) /* Exceptional condition */


### API 区别

• The use of the fd_set data type places an upper limit ( FD_SETSIZE ) on the range of file descriptors that can be monitored by select(). By default, this limit is 1024 on Linux, and changing it requires recompiling the application. By contrast, poll() places no intrinsic limit on the range of file descriptors that can be monitored.
• Because the fd_set arguments of select() are value-result, we must reinitialize them if making repeated select() calls from within a loop. By using separate events (input) and revents (output) fields, poll() avoids this requirement.
• The timeout precision afforded by select() (microseconds) is greater than that afforded by poll() (milliseconds). (The accuracy of the timeouts of both of these system calls is nevertheless limited by the software clock granularity.)
• If one of the file descriptors being monitored was closed, then poll() informs us exactly which one, via the POLLNVAL bit in the corresponding revents field. By contrast, select() merely returns –1 with errno set to EBADF , leaving us to determine which file descriptor is closed by checking for an error when performing an I/O system call on the descriptor.

### 性能对比

With select(), we pass one or more file descriptor sets and an integer, nfds, which is one greater than the maximum file descriptor to be examined in each set. The nfds argument has the same value, regardless of whether we are monitoring all file descriptors in the range 0 to (nfds – 1) or only the descriptor (nfds – 1). In both cases, the kernel must examine nfds elements in each set in order to check exactly which file descriptors are to be monitored. By contrast, when using poll(), we specify only the file descriptors of interest to us, and the kernel checks only those descriptors.

### select 和 poll 的局限

• On each call to select() or poll(), the kernel must check all of the specified file descriptors to see if they are ready. When monitoring a large number of file descriptors that are in a densely packed range, the time required for this operation greatly outweighs the time required for the next two operations.
• In each call to select() or poll(), the program must pass a data structure to the kernel describing all of the file descriptors to be monitored, and, after checking the descriptors, the kernel returns a modified version of this data structure to the program. (Furthermore, for select(), we must initialize the data structure before each call.) For poll(), the size of the data structure increases with the number of file descriptors being monitored, and the task of copying it from user to kernel space and back again consumes a noticeable amount of CPU time when monitoring many file descriptors. For select(), the size of the data structure is fixed by FD_SETSIZE , regardless of the number of file descriptors being monitored.
• After the call to select() or poll(), the program must inspect every element of the returned data structure to see which file descriptors are ready.

## epoll

epoll 是 Linux 特有的 API，在 Linux 2.6 被引入。epoll 克服了 select() 和 poll() 的局限，还提供边缘触发机制。

### 水平触发和边缘触发

I/O 模型 水平触发 边缘触发
select(), poll()

epoll

### API

epoll 核心数据结构是 epoll instance，通过一个文件描述符引用。该文件描述符不用于做 IO，而是作为内核数据结构的一个句柄，并由该内核数据结构完成以下事情：

• recording a list of file descriptors that this process has declared an interest in monitoring—the interest list;
• maintaining a list of file descriptors that are ready for I/O—the ready list. (The membership of the ready list is a subset of the interest list)

epoll API 包含三类系统调用：

• epoll_create()、epoll_create1()
• epoll_ctl()
• epoll_wait()

### 创建 epoll 实例

epoll_create() 和 epoll_create1() 可以创建一个 epoll 实例，定义分别为：

#include <sys/epoll.h>

int epoll_create(int size);
int epoll_create1(int flags);


epoll_create 和 epoll_create1 都返回一个文件描述符，该描述符作用为：

This file descriptor is used to refer to the epoll instance in other epoll system calls. When the file descriptor is no longer required, it should be closed in the usual way, using close(). When all file descriptors referring to an epoll instance are closed, the instance is destroyed and its associated resources are released back to the system. (Multiple file descriptors may refer to the same epoll instance as a consequence of calls to fork() or descriptor duplication using dup() or similar.)

epoll_create1 在 2.6.27 被引入，如果 flags 为 0，那么 epoll_create1 完全等价于 epoll_create。除此之外， flags 目前有效取值只有 EPOLL_CLOEXEC，这会告诉内核对于新的文件描述符设置 close-on-exec 标志（FD_CLOEXEC）。FD_CLOEXEC 标志等价于 O_CLOEXEC，作用为：

Using the O_CLOEXEC flag allows a program to avoid additional fcntl() F_SETFD and F_SETFD operations to set the close-on-exec flag. It is also necessary in multithreaded programs to avoid the race conditions that could occur using the latter technique. These races can occur when one thread opens a file descriptor and then tries to mark it close-on-exec at the same time as another thread does a fork() and then an exec() of an arbitrary program. (Suppose that the second thread manages to both fork() and exec() between the time the first thread opens the file descriptor and uses fcntl() to set the close-on-exec flag.) Such races could result in open file descriptors being unintentionally passed to unsafe programs.

### 修改 epoll 队列

epoll_ctl() 可以修改 epoll 实例的 intereset list。定义为：

#include <sys/epoll.h>

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);


• epfd 是 epoll_create 或者 epoll_create1 返回的描述符，引用一个 epoll 实例
• fd 是需要被修改的描述符，可以为 pipe，FIFO，socket，POSIX message queue，inotify instance, terminal，device 甚至另一个 epoll 描述符。但是不能为一个普通文件或者文件夹的描述符（否则会产生 EPERM）
• op 指定了对 fd 的动作，可能取值和含义分别如下：
• EPOLL_CTL_ADD Add the file descriptor fd to the interest list for epfd. The set of events that we are interested in monitoring for fd is specified in the buffer pointed to event. If we attempt to add a file descriptor that is already in the interest list, epoll_ctl() fails with the error EEXIST .
• EPOLL_CTL_MOD Modify the events setting for the file descriptor fd, using the information specified in the buffer pointed to by event. If we attempt to modify the settings of a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT .
• EPOLL_CTL_DEL Remove the file descriptor fd from the interest list for epfd. The event argument is ignored for this operation. If we attempt to remove a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT . Closing a file descriptor automatically removes it from all of the epoll interest lists of which it is a member.

event 参数是一个 epoll_event 类型的的指针，epoll_event 定义为：

struct epoll_event
{
uint32_t events; /* epoll events (bit mask) */
epoll_data_t data; /* User data */
};


typedef union epoll_data
{
void *ptr;  /* Pointer to user-defined data */
int fd;   /* File descriptor */
uint32_t u32; /* 32-bit integer */
uint64_t u64; /* 64-bit integer */
} epoll_data_t;


event 参数指定了描述符 fd 的设置，含义如下：

• The events subfield is a bit mask specifying the set of events that we are interested in monitoring for fd.
• The data subfield is a union, one of whose members can be used to specify information that is passed back to the calling process (via epoll_wait()) if fd later becomes ready.

### 监听事件

epoll_wait() 用于监听事件，当有描述符就绪时返回。定义为：

#include <sys/epoll.h>

int epoll_wait(int epfd, struct epoll_event *events,
int maxevents, int timeout);
`

• A return value of –1 indicates that an error occurred and errno will be set to indicate the error.
• A return value of 0 indicates no file descriptors were ready within the interval specified by timeout.
• A positive value indicates the number of items that have been placed in the array events.

epoll 是线程安全的：

In a multithreaded program, it is possible for one thread to use epoll_ctl() to add file descriptors to the interest list of an epoll instance that is already being monitored by epoll_wait() in another thread. These changes to the interest list will be taken into account immediately, and the epoll_wait() call will return readiness information about the newly added file descriptors.

### epoll events

epoll events(bit mask) 类似于 poll 使用的 event bit，定义为： 唯一的例外是 EPOLLET 和 EPOLLONESHOT。

### epoll 与 select、poll 对比

• Each time we call select() or poll(), we pass a data structure to the kernel that identifies all of the file descriptors that are to be monitored, and, on return, the kernel passes back a data structure describing the readiness of all of these descriptors. By contrast, with epoll, we use epoll_ctl() to build up a data struc- ture in kernel space that lists the set of file descriptors to be monitored. Once this data structure has been built, each later call to epoll_wait() doesn’t need to pass any information about file descriptors to the kernel, and the call returns information about only those descriptors that are ready.
• On each call to select() or poll(), the kernel must check all of the file descriptors specified in the call. By contrast, when we mark a descriptor to be monitored with epoll_ctl(), the kernel records this fact in a list associated with the underlying open file description, and whenever an I/O operation that makes the file descriptor ready is performed, the kernel adds an item to the ready list for the epoll descriptor. (An I/O event on a single open file description may cause multiple file descriptors associated with that description to become ready.) Subse- quent epoll_wait() calls simply fetch items from the ready list.

### 例子

How to use epoll? A complete example in C

## 参考

1. Unix Network Programming(Volume 1 Thrid Edition).
2. The Linux Programming InTerface