Author: JD Retail Shi Chaoyang

Before talking about the IO multiplexing model, let’s have a general understanding of the Linux file system. In the Linux system, whether it is your mouse, keyboard, or printer, or even the socket client connected to the local machine, they all exist in the system in the form of file descriptors, etc., so you can Say, everything is a file. Let’s take a look at the file descriptor description defined by the system:





As can be seen from the above list, file descriptors 0, 1, and 2 have been occupied by the system. When the system starts, these three descriptors exist. Where 0 represents standard input, 1 represents standard output, and 2 represents error output. When we create a new file descriptor, it will be incremented on the basis of 2. It can be said that the file descriptor is a system index created to manage the opened file, and it represents the identity ID of the file. For Windows, you can think of it as similar to a handle, which makes it easier to understand.

Since there are already a lot of articles describing the principle of linux files on the Internet, I will not repeat them here. Interested students can read them from Wikipedia. Since this piece of content is relatively complicated and does not belong to the popular content of this article, readers are advised to do their own research. Here I highly recommend Mr. Ma Bingbing to explain the linux file system very well.

select model

This model is one of the earliest models of IO multiplexing. It has been decades ago, but there are still many applications still using this method, which shows that it is immortal. Let’s first look at its specific definition (from the second type of man document):

int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, struct timeval *timeout);

The specific parameters are explained here:

Parameter one: nfds, that is, maxfd, the largest file descriptor is incremented by one. The reason why the maximum descriptor is passed here is to limit the traversal range when traversing fd_set.

Parameter two: readfds, a set of readable file descriptors.

Parameter three: writefds, set of writable file descriptors.

Parameter four: errorfds, abnormal file descriptor collection.

Parameter five: timeout, timeout time. Returns if no descriptors were detected to be triggered within this time.

The following macro processing can operate on the fd_set collection (accurately speaking, bitmap, if a descriptor is changed, it will be processed at the index corresponding to the descriptor 1):

FD_CLR(inr fd,fd_set* set) is used to clear the bit describing the fd in the phrase set, that is, the index value of fd in the bitmap structure is set to 0.

FD_ISSET(int fd, fd_set *set) is used to test whether the related fd in the description phrase set is true, that is, whether a certain bit in the bitmap structure is 1.

FD_SET (int fd, fd_set*set) is used to set the bit that describes the relevant fd in the phrase set, that is, set a certain bit in the bitmap structure to 1, and the index value is fd.

FD_ZERO (fd_set *set) is used to clear all bits describing the phrase set, that is, to clear all bitmap structures to zero.

Let’s first look at a sample code that uses the select model on the server side:

//创建server端套接字,获取文件描述符
    int listenfd = socket(PF_INET,SOCK_STREAM,0);
    if(listenfd < 0) return -1;
    //绑定服务器
    bind(listenfd,(struct sockaddr*)&address,sizeof(address));
    //监听服务器
    listen(listenfd,5); 
    struct sockaddr_in client;
    socklen_t addr_len = sizeof(client);
    //接收客户端连接
    int connfd = accept(listenfd,(struct sockaddr*)&client,&addr_len);
    //读缓冲区
    char buff[1024]; 
    //读文件操作符
    fd_set read_fds;  
    while(1)
    {
        memset(buff,0,sizeof(buff));
        //注意:每次调用select之前都要重新设置文件描述符connfd,因为文件描述符表会在内核中被修改
        FD_ZERO(&read_fds);
        FD_SET(connfd,&read_fds);
        //注意:select会将用户态中的文件描述符表放到内核中进行修改,内核修改完毕后再返回给用户态,开销较大
        ret = select(connfd+1,&read_fds,NULL,NULL,NULL);
        if(ret < 0)
        {
            printf("Fail to select!\n");
            return -1;
        }
        //检测文件描述符表中相关请求是否可读
        if(FD_ISSET(connfd, &read_fds))
        {
            ret = recv(connfd,buff,sizeof(buff)-1,0);
            printf("receive %d bytes from client: %s \n",ret,buff);
        }
    }

I have added more detailed comments to the above code, and it should be easy for everyone to understand. To put it bluntly, the general process is actually as follows:

First, create a socket socket. After the creation is complete, you will get the file descriptor of this socket.

Then, bind to the specified address for listening. In this way, the server starts and listens on a specific port.

After that, use the open accept method to monitor the client’s connection request. Once there is a client connection, the connection file descriptor of the current client connection will be obtained.

After the two parties establish a connection, data can be exchanged. It should be noted that at the beginning of the loop, it is necessary to reset the file descriptor of the current connection every time, because the file descriptor table has been modified in the kernel, and if it is not reset, it will cause an abnormal situation .

After resetting the file descriptor, you can use the select function to poll which file descriptors are ready from the file descriptor table. At this time, the system will send the file descriptor table of the user mode to the kernel mode for adjustment, that is, set the ready file descriptor table, and then send it to the application in the user mode.

The user polls the file descriptor through the FD_ISSET method, and if the data is readable, just read the data.

For example, assuming that 3 clients are connected at this time, and the file descriptors of the connection are 4, 8, and 12 respectively, then the approximate structure of the read_fds file descriptor table (bitmap structure) is 00010001000100000….0, Since the length of the read_fds file descriptor is 1024 bits, a maximum of 1024 connections are allowed.





When selecting, it involves the conversion between user mode and kernel mode, so the overall conversion method is as follows:





Therefore, in summary, select is relatively efficient and stable as a whole, but there are also many problems, which further limit its performance:

  1. The file descriptor table is a bitmap structure with a length limit of 1024.

  2. fdset cannot be reused and must be recreated each cycle.

  3. Frequent copying of user mode and kernel mode results in high performance overhead.

  4. The file descriptor table needs to be traversed, and the polling time complexity of O(n).

poll model

Considering several limitations of the select model, it was later improved. This is the poll model. Since it is an improved version of the select model, it must have its bright spots. Let’s take a look. Of course, this time we still read the second type of linux man document first, because this is the official document, which has the most precise definition.

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

In fact, in terms of operating mechanism, the function of poll is basically the same as that of select, which is to wait and detect that a set of file descriptors is ready, and then perform subsequent IO processing. The only difference is that in select, the bitmap structure is used, and the length is limited to a file descriptor table of 1024 bits, while the poll model uses the array fds of the pollfd structure. It is precisely because the poll model uses an array structure, There is no 1024 length limit, so that it can withstand higher concurrency.

The content of the pollfd structure is as follows:

struct pollfd {
    int   fd;         /* 文件描述符 */
    short events;     /* 关心的事件 */
    short revents;    /* 实际返回的事件 */
};

As can be seen from the above structure, fd obviously refers to the file descriptor, that is, when the client connects, fd will save the generated file descriptor here; and events refers to the events that the user wants to pay attention to; revents It refers to the event actually returned, which is filled and returned by the system kernel. If the current fd file descriptor has a state change, the value of revents will change accordingly.

The events list is as follows:





The list of revents events is as follows:





As can be seen from the list, revents contains events. Let’s take a look at an example:

//创建server端套接字,获取文件描述符
    int listenfd = socket(PF_INET,SOCK_STREAM,0);
    if(listenfd < 0) return -1;
    //绑定服务器
    bind(listenfd,(struct sockaddr*)&address,sizeof(address));
    //监听服务器
    listen(listenfd,5); 
    struct pollfd pollfds[1];
    socklen_t addr_len = sizeof(client);
    //接收客户端连接
    int connfd = accept(listenfd,(struct sockaddr*)&client,&addr_len);
    //放入fd数组
    pollfds[0].fd = connfd;
    pollfds[0].events = POLLIN;
    //读缓冲区
    char buff[1024]; 
    //读文件操作符
    fd_set read_fds;  
    while(1)
    {
        memset(buff,0,sizeof(buff));
        /**
         ** SELECT模型专用
         ** 注意:每次调用select之前都要重新设置文件描述符connfd,因为文件描述符表会在内核中被修改
         ** FD_ZERO(&read_fds);
         ** FD_SET(connfd,&read_fds);
        ** 注意:select会将用户态中的文件描述符表放到内核中进行修改,内核修改完毕后再返回给用户态,开销较大
        ** ret = select(connfd+1,&read_fds,NULL,NULL,NULL);
        **/
        ret = poll(pollfds, 1, 1000);
        if(ret < 0)
        {
            printf("Fail to poll!\n");
            return -1;
        }
        /**
         ** SELECT模型专用
         ** 检测文件描述符表中相关请求是否可读
         ** if(FD_ISSET(connfd, &read_fds))
         ** {
         **   ret = recv(connfd,buff,sizeof(buff)-1,0);
         **   printf("receive %d bytes from client: %s \n",ret,buff);
         ** }
         **/
        //检测文件描述符数组中相关请求
        if(pollfds[0].revents & POLLIN){
            pollfds[0].revents = 0;
            ret = recv(connfd,buff,sizeof(buff)-1,0);
            printf("receive %d bytes from client: %s \n",ret,buff);
        }
    }

Since in the source code, I made more detailed comments and listed all the differences from the select model, so I won’t explain it in detail here. Generally speaking, the poll model is better than the select model, and some restrictions are removed, but the following problems are still unavoidable:

  1. User mode and kernel mode still need to switch frequently, because the assignment of revents is carried out in the kernel mode, and then pushed to the user mode, similar to select, the overall overhead is relatively large.

  2. The array still needs to be traversed, and the time complexity is O(N).

epoll model

If the select model and poll model are early products, there are many unsatisfactory performances, then the epoll model added after linux 2.6 completely solves the performance problem, making a single machine withstand millions of concurrency issues in one fell swoop becomes extremely easy. Now it can be said that only some simple setting changes are needed, and then combined with the performance of epoll, it is easy to achieve one million concurrency on a single machine. At the same time, due to the overall optimization of epoll, the previous performance-consuming problems no longer become a fetter, so it has also become the preferred model for network communication on the Linux platform.

Before explaining, it is still the linux man document town building: linux man epoll 4 types of documents linux man epoll 7 types of documents, read the two documents together, you will have a general understanding of epoll. The difference from the select and poll mentioned above is that both of them belong to the system call function, but epoll is not. It is a data structure that exists in the kernel. It can be solved by combining the three functions epoll_create, epoll_ctl and epoll_wait Data structures are manipulated.

Speaking of the epoll_create function, its function is to create an epoll data structure instance in the kernel, and then return the file descriptor of this instance in the system. The composition of this epoll data structure is actually a linked list structure, which we call interest list, which will register the file descriptor of the connected client.

Its simplified working mechanism is as follows:





Speaking of the epoll_ctl function, its function is to add, delete, modify and check the epoll instance. Some are similar to our commonly used CRUD operations. The object operated by this function is actually the epoll data structure. When a new client connects, it will register the client in the interest list in epoll. This operation is realized by adding the EPOLL_CTL_ADD tag; when the existing client drops When going offline or taking the initiative to go offline, he will remove the offline client from the interest list of epoll. This operation is realized by adding the EPOLL_CTL_DEL tag; when there is a change in the file descriptor of the client, he will add the The corresponding file descriptor is updated, and this operation is realized by adding EPOLL_CTL_MOD; when there are clients in the interest list that are ready for IO operations, he will take these clients out and put them in a new ready inside the list.

Its simplified working mechanism is as follows:





Speaking of the epoll_wait function, its function is to scan the ready list and process ready client IO, and the returned result is the number of clients ready for IO. By traversing these prepared clients, IO processing can be easily performed.

The above three functions are the basic functions of epoll operation. However, if you want to fully understand epoll, you need to understand these three contents first, namely: inode, linked list, and red-black tree.

In the Linux kernel, for the currently opened file, there is an open file table, which records all open file descriptor information; at the same time, there is an inode table, which records the underlying file descriptor information. Here, if file descriptor B forks file descriptor A, although in the open file table, we see that a new file descriptor B has been added, but in fact, in the inode table, the underlying layers of A and B are exactly the same. Here, it will be more appropriate and understandable to understand the contents of the inode table as file attributes in windows. The advantage of this storage is that no matter how the upper-level file descriptor changes, since the data monitored by epoll is always the underlying data of the inode table, then I can always monitor various changes in the file, which is also the basis for epoll’s high efficiency. For more details, please refer to these two articles: Nonblocking IO & The method to epoll’s madness.

The simplified process is as follows:





The data storage is solved, so what data structure should be used to save the connected client socket? The red-black tree is used here. Since the client socket will have frequent addition and deletion operations, the time complexity of the red-black tree is only O(logN), which is quite efficient. Some people may ask why not use a hash table? When a large number of connections are frequently connected or disconnected, expansion or other actions will generate a lot of rehash operations, and hash collisions must also be considered. Although the query speed can indeed reach O(1), rehash or hash conflicts are uncontrollable, so based on these considerations, I think red-black trees are better.

How to manage the client socket is solved. Next, when there is a socket with data that needs to be read and written, the system will add the ready socket to the doubly linked list, and then check it through the epoll_wait method. The most important thing is this doubly linked list. Since the linked list is full of ready data, the situation of traversing the entire client socket list is avoided, which greatly improves the overall efficiency. The overall operation process is:

First, use epoll_create to create an epoll object in the kernel. In fact, this epoll object is a data structure that can store client connections.

Then, when the client socket is connected, the result will be added to the red-black tree data structure of the epoll object through the epoll_ctl operation.

Then, once a socket has an event, it will be added to the ready list doubly linked list through the callback function.

Finally, epoll_wait traverses the linked list to process the prepared sockets, and then performs data perception operations through preset level triggers or edge triggers.

As can be seen from the above details, since epoll internally monitors the underlying file descriptor information, the changed descriptor can be directly added to the ready list, without the need for the user to pass in all the descriptors. At the same time, because epoll_wait scans the ready file descriptors, many invalid traversal queries are avoided, and the overall performance of epoll is greatly improved. It can be said that as long as we talk about IO multiplexing on the linux platform, epoll has become the only one. select.

Level trigger and edge trigger

The epoll mentioned above mainly explained how the client end is connected, but did not explain in detail how epoll_wait is awakened. Here I will explain it in detail in the future.

Level trigger, which means Level Trigger, and edge trigger, which means Edge Trigger, is not easy to understand literally, but if the concept of horizontal edge, rising edge, and falling edge in hardware design is introduced, then It is much easier to understand. For example, we can think of it like this:





If the block in the above figure is regarded as a buffer, it will be easier to understand. For example, for horizontal triggering, as long as the buffer always has data, it will always be notified; while for edge triggering, when the buffer capacity changes, will be notified. Although it can be understood in such a simple way, in fact, the detailed processing part is more refined than that shown in the diagram, so let’s talk about it in detail here.

edge trigger

For read operations, that is, the current fd is in EPOLLIN mode, that is, it can be read. At this time, it means that new data arrives and the receiving buffer is readable. The following buffers refer to the receiving buffer:

  1. The buffer changes from empty to non-empty, which means that when data comes in, this process will trigger a notification.





  1. The buffer originally had some data, and when new data comes in at this time, the data increases, and this process will trigger a notification.





  1. There is data in the buffer. At this time, when the user registers the EPOLL_CTL_MOD event for the fd of the operation, a notification will be triggered.





For write operations, that is, the current fd is in EPOLLOUT mode, and it can be written. At this time, it means that the buffer can be written, and the following buffers refer to the sending buffer:

  1. When the buffer is full, some data is sent out at this time, and the data becomes less, and this process will trigger a notification.





  1. The buffer originally had some data, and some data is sent out at this time, and the data becomes less. This process will trigger a notification.





Here are several situations triggered by the ET mode. It can be seen that they basically revolve around the status changes of the receiving buffer or sending buffer.

Obscure? Does not exist, give a chestnut:

On the server side, we enable the edge trigger mode, and then set the buffer size to 10 bytes to see the specific manifestation.

The server is turned on, the client is connected, and the single character A is sent to the server. The output is as follows:

-->ET Mode: it was triggered once
get 1 bytes of content: A
-->wait to read!

It can be seen that since the buffer changes from empty to non-empty, the edge triggers the notification, and then blocks at epoll_wait, continuing to wait for subsequent events.

Here we change it, input ABCDEFGHIJKLMNOPQ, you can see that the length of characters sent by the client exceeds the buffer size of the server, so what will the output be like?

-->ET Mode: it was triggered once
get 9 bytes of content: ABCDEFGHI
get 8 bytes of content: JKLMNOPQ
-->wait to read!

It can be seen that this time, because the length of the transmission is greater than the buffer size, the content is folded into two parts for reception. Since the edge trigger method is used, the buffer is from empty to non-empty, so only one notification will be generated.

level trigger

Horizontal triggering is much simpler. It includes all scenarios of edge triggering. In short, it is as follows:

When the receiving buffer is not empty and there is data to read, the read event will always be triggered.





When the send buffer is not full, you can continue to write data, and the write event will always be triggered.





Similarly, in order to make the expression clearer, let’s take a chestnut and follow the above input method.

The server is turned on, the client connects and sends a single character A, and the output of the server can be seen as follows:

-->LT Mode: it was triggered once!
get 1 bytes of content: A

As a result of this output, there is no doubt that because there is data in the buffer, the horizontal mode is triggered and the result is output.

The server is turned on, the client connects and sends ABCDEFGHIJKLMNOPQ, and the output of the server can be seen as follows:

-->LT Mode: it was triggered once!
get 9 bytes of content: ABCDEFGHI
-->LT Mode: it was triggered once!
get 8 bytes of content: JKLMNOPQ

From the results, it can be seen that after the data in the buffer is read, there is still unread data, so the horizontal mode will always be triggered, which is why the horizontal mode is triggered twice here.

With the comparison of these two chestnuts, do you, who are smart, get the difference between the two?

In the actual development process, LT is actually easier to use. After all, the system helps us do most of the verification notification work. The SELECT and POLL mentioned above are also used by default. However, it should be noted that when thousands of clients connect to start sending data, due to the characteristics of LT, the kernel will frequently process notification operations, which will consume more system resources than ET, so , as the number of clients increases, its performance becomes worse.

As for edge triggering, because it monitors the status changes of FDs, the overall system notifications are not so frequent, and the overall performance is much better under high concurrency. However, in this mode, users need to actively process each piece of data, which brings a considerable maintenance cost, and mistakes may occur if they are not careful. So use it with great care.

As for how to choose between the two, let the benevolent see the benevolent and the wise see the wisdom.

At this point in the text, the explanation about epoll is basically over. Have you learned a lot from it? Since the research from netty to the bottom layer of linux epoll is very difficult, it can be described as high and low, so there are relatively few articles explored in this area, and many things need to be pondered bit by bit according to man documents and source codes (linux See eventpoll.c for source code details). Here I will correct the search engine, saying that the high performance of epoll is due to the use of mmap technology to realize the memory sharing between the user state and the kernel state, so the performance is good. I was misled by this view for a long time in the early stage. Later, I got down the linux source code and turned it over. For a moment, there is no technical point of mmap in epoll, so this view is wrong. There are many articles with these erroneous views in China, and there are also many abroad. I hope everyone can make prudent choices and avoid being biased by mistakes.

Therefore, the foundation of epoll’s high performance is that its efficient file descriptor processing method plus the characteristic edge-triggered processing mode achieves high concurrency in the true sense with very few switching between kernel mode and user mode.

#Explain #multiplexing #model #Cloud #developers #personal #space #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *