In this section, we will see how Linux (and a lot of other members of the Unix family) implement common network patterns, and how a user will interact with those while writing networking applications. All discussions in this section will be strictly based on a Linux-like OS with the standard C library (glibc). The Portable OS Interface (POSIX) standard includes all of these, making them portable to any POSIX compliant OS. All functions and data structures here follow C (and C++) coding conventions, but as we will see later, some of these are available in Rust as well through libc bindings.
The most important networking primitive that the OS provides is a socket. Now, what is a socket? A socket is a glorified file descriptor, a unique ID that is assigned to each file in a Unix-like OS. This follows from the Unix philosophy that everything should be a file; treating the connection between two hosts over a network as a file enables the OS to expose it as a file descriptor. The programmer is then free to use traditional I/O-related syscalls to write and receive from that file.
Now, obviously, a socket needs to hold some more data than a regular file descriptor. For instance, it needs to track the remote IP and port (and also the local IP and port). Thus, a socket is a logical abstraction for the connection between two hosts, along with all information needed to transfer data between those hosts.
There are two major classes of sockets: UNIX sockets for communicating with processes on the same host, and internet sockets for communication over an IP network.
The standard library also provides a few system calls for interacting with sockets. Some of those are socket specific and some of them are generic I/O syscalls that support writing to file descriptors. Since a socket is basically a file descriptor, those can be used to interact with sockets. Some of these are described in the next image. Note that not all applications will need to use all of these syscalls. A server, for instance, will need to call listen to start listening for incoming connections once it has created a socket. It will not need to call connect for that same connection:
Common networking system calls
Any Unix-like OS will have detailed documentation for each of these syscalls in the manpages. The docs for the socket syscall, for example, can be accessed using the command man socket. The second argument to the man command is the section of the manpages.
Let's look at the signatures of these syscalls in more detail. Unless otherwise mentioned, all of these return 0 on success or -1 on failure, and set the value of errno accordingly.
int socket(int domain, int type, int protocol);
The first parameter for the socket syscall tells it what kind of communication socket will be used. Common types are AF_INET for IPv4, AF_INET6 for IPv6, AF_UNIX for IPC, and so on. The second parameter tells it what type of socket should be created, common values being SOCK_STREAM for a TCP socket, SOCK_DGRAM for a UDP socket, SOCK_RAW for a raw socket which provides direct access to the network hardware at packet level, and so on. The last parameter denotes the layer 3 protocol to be used; in our case, this is exclusively IP. A complete list of supported protocols is available in the file /etc/protocols.
On success, this returns a new file descriptor that the kernel assigns to the socket created.
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
The first parameter for bind is a file descriptor, generally one returned by the socket system call. The second parameter is the address to be assigned to the given socket, passed as a pointer to a structure. The third parameter is the length of the given address.
int listen(int sockfd, int backlog);
listen is a function that takes in the file descriptor for the socket. Note that when an application is listening for incoming connections on a socket, it might not be able to read from it as fast as packets arrive. To handle cases like this, the kernel maintains a queue of packets for each socket. The second parameter here is the maximum length of the queue for the given socket. If more clients are trying to connect after the given number here, the connection will be closed with a connection refused error.
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
This call is used to accept connections on TCP sockets. It takes a connection of the queue for the given socket, creates a new socket, and returns the file descriptor for the new socket back to the caller. The second argument is a pointer to a socket address struct that is filled in with the info of the new socket. The third argument is its length.
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
This function connects the socket given by the first argument to the address specified in the second argument (the third argument being the length of the address struct).
ssize_t send(int sockfd, const void *buf, size_t len, int flags);
This is used to send data over a socket. The first argument tells it which socket to use. The second argument is a pointer to the data to be sent, and the third argument is its length. The last argument is bitwise OR of a number of options which dictates how packets should be delivered in this connection.
This system call returns the number of bytes sent on success.
ssize_t recv(int sockfd, void *buf, size_t len, int flags);
This one is the counterpart of send. As usual, the first argument tells it which socket to read from. The second argument is a pointer to an allocated space where it should write the data it reads, and the third argument is its length. flags here has the same meaning as in the case of send.
This function returns the number of bytes received on success:
int shutdown(int sockfd, int how);
This function shuts down a socket. The first argument tells it which socket to shut down. The second argument dictates if any further transmission or reception should be allowed before the socket is shut down.
int close(int fd);
This system call is used to destroy file descriptors. Consequently, this can be used to close and clean up a socket as well, given its file descriptor number. While shutdown allows the socket to receive pending data and not accept new connections, a close will drop all existing connections and cleanup resources.
Other than the ones noted above, a host will also need to resolve the IP of a remote host using DNS. The getaddrinfo syscall does that. There are some other syscalls that provide various useful information for writing applications: gethostname returns the host name of the current computer, setsockopt sets various control options on a socket, and so on.
Note that a lot of syscalls described above are blocking, which means they block the thread they are invoked in waiting for the given operation to finish. For example, the read syscall will block on the socket if enough data is not available to fill the buffer provided. Often, this is not desirable, especially in modern multithreaded environments where a blocking call will not be able to take full advantage of the computing power available since the thread will loop around doing nothing useful.
Unix provides some more syscalls that enable asynchronous, non-blocking applications using the standard C library. There are two standard ways of doing this:
- Using the select system call: This syscall monitors a list of given sockets and lets the caller know if any of those has data to read from. The caller can then retrieve those file descriptors using some special macros and read from those.
- Using the poll system call: The high-level semantics here is similar to that of select: it takes in a list of socket file descriptors and a timeout. It monitors those asynchronously for the given timeout, and if any of those have some data, it lets the caller know. Unlike select, which checks for all conditions (readability, writability, and error) on all file descriptors, poll only cares about the list of file descriptors and conditions it receives. This makes poll easier to work with and faster than select.
In practice, however, select and poll are both very slow for applications which might need to monitor a lot of sockets for connections. For such applications, either epoll or an event-based networking library like libevent or libev might be more suitable. The gain in performance comes at the cost of portability; those libraries are not available in all systems since they are not part of the standard library. The other cost is complexity in writing and maintaining applications based on external libraries.
In the following section, we will walk through the state transitions of a TCP server and client that is communicating over a network. There are some idealistic assumptions here for the sake of simplicity: we assume that there are no intermediate errors or delays of any kind, that the server and the client can process data at the same rate, and that neither the server nor the client crash while communicating. We also assume that the client initiates the connection (Active open) and closes it down (Active close). We do not show all the possible states of the state machine since that will be way too cumbersome:
TCP state transition for a server and a client
Both the server and the client start from the CLOSED state. Assuming the server starts up first, it will first acquire a socket, bind an address to it, and start listening on it. The client starts up and calls connect to the server's address and port. When the server sees the connection, it calls accept on it. That call returns a new socket from which the server can read data from. But before actual data transmission can occur, the server and the client must do the three-way handshake. The client initiates that by sending a SYN, the server reads that, responds with a SYN + ACK message, and goes to the SYN_RCVD state. The client goes to the SYN_SENT state.
When the client gets the SYN + ACK, it sends out a final ACK and goes to the ESTABLISHED state. The server goes to ESTABLISHED when it gets the final ACK. The actual connection is established only when both parties are in the ESTABLISHED state. At this point, both the server and the client can send and receive data. These operations do not cause a state change. After some time, the client might want to close the connection. For that, it sends out a FIN packet and goes to the FIN_WAIT_1 state. The server receives that, sends an ACK, and goes to the CLOSE_WAIT state. When the client gets that, it goes to the FIN_WAIT_2 state. This concludes the first round of connection termination. The server then calls close, sends out a FIN, and goes to the LAST_ACK state. When the client gets that, it sends out an ACK and goes to the TIME_WAIT state. When the server receives the final ACK, it goes back to the CLOSED state. After this point, all server resources for this connection are released. The client, however, waits for a timeout before moving on to the CLOSED state where it releases all client-side resources.
Our assumptions here are pretty basic and idealistic. In the real world, communication will often be more complex. For example, the server might want to push data, and then it will have to initiate the connection. Packets might be corrupted in transit, causing either of the parties to request retransmission, and so on.
Maximum Segment Lifetime (MSL) is defined to be the maximum time a TCP segment can exist in the network. In most modern systems, it is set to 60 seconds.