CIS 4307: Sockets

Online references:

Primer on Sockets by Jim Frost (Software Tool & Die)
Introductory tutorial on IPC in 4.4BSD-Unix (by S.Sechrest UC-Berkeley) (Postscript)
Advanced tutorial on IPC in 4.4BSD-Unix (by S.Leffler,R.Fabry,W.Joy,P.Lampsey UC-Berkeley, S.Miller,C.Torek U-Maryland) (Postscript)
Beej's Guide to Network Programming
Sockets Tutorial from RPI
Programming UNIX Sockets in C - Frequently Asked Questions
Another Socket FAQ

[Introduction], [Client-Server Architecture], [Summary on Socket Functions], [Socket Functions], [Non-Blocking on Sockets], [Kinds of Sockets], [Examples], [Measurements], [Socket States]

Introduction

We examine some functions for communication through sockets. [Though in your practice you may be able to skip this software level and use things (middleware) like DCE RPC (Distributed Computing Environment Remote Procedure Call) still an understanding of the socket API provides a grounding on some of the issues and problems of distributed computations.] A socket is an endpoint used by a process for bidirectional communication with a socket associated with another process. Sockets, introduced in Berkeley Unix, are a basic mechanism for IPC on a computer system, or on different computer systems connected by local or wide area networks. In the following we will not be concerned with networks and data communications. We will take instead a strictly operational viewpoint: how to program with sockets to create communication channels. The communication channel created with sockets can be like a telephone line (connection oriented), with the sockets as telephones over which a conversation can take place. Or the channel can be as when we send mail (datagram oriented), with the sockets as mailboxes. Connection oriented communication is reliable, i.e. the system takes care of errors, and order-preserving, i.e. the receiver will receive information in the order in which it was sent. Datagram oriented communication is unreliable, i.e. messages can be lost, or they may not be delivered in the order in which they were sent. Of course the convenience of connection oriented sockets comes at a price in performance
A socket appears to the user to be like a file descriptor on which we can read, write, and ioctl. In the connection oriented mode, the file is like a sequence of characters that we can read with as many read operations as we like. In the connectionless mode we have to get a whole message in a single read operation. If we don't, what is left over of the message is lost. Though sockets can be used in a single computer system for interprocess communication (the Unix domain), we will only consider their use for communication across computer systems (the Internet domain). It is possible to send message on a socket that take precedence over other undelivered messages. These priority messages are called out-of-band messages. We will not deal with them in this note. We will also not deal with composite messages (sendmsg and recvmsg).

A problem in communication is how to identify interlocutors. In the case of phones we have telephone numbers, for mail we have addresses. For communicating between sockets we [usually, since within a single computer we could use file names] identify an interlocutor with a pair: IP address and port. [In reality there is a third component, the protocol, but that will not be relevant to us since the protocol used will always be obvious in our cases.] This represents the address or name of the interlocutor. IP addresses (things like 155.247.207.190) are 32 bit unsigned integers (155, 247, 207, 190 are the bytes, from low to high addresses, and with 155 the most significant bits - the network representation of integers, called big-endian). We only consider Version 4 IP Addresses. An IP address consists of two parts, one identifying a network (in the case of 155.247.207.190, the network is 155.247), the other identifying a computer within that network (in our case 207.190) and can be used in a number of formats. IP addresses are more easily rememberer as host names (things like snowhite.cis.temple.edu). [You may find about IP to host conversions with the nslookup command and by looking in the /etc/hosts file.] [IP addresses, to be exact, identify the Network Interface Card between a computer and a network, and a computer might have a number of such cards connecting to a number of networks. But for brevity we will not worry about this distinction. Further, multiple IP addresses could be associated with one interface.] A special IP is used to refer to the local host, 127.0.0.1, the loopback localhost. The IP address, 0.0.0.0, is called INADDR_ANY and is tied to all the IP addresses of this machine (used during bootstrap of this system). Another special IP, 255.255.255.255, is used for broadcast to all hosts on the local network of the current machine. And the host id consisting of all 1s is used to broadcast to all the computers of a LAN.
Ports are 16 bit unsigned integers. (The first 1024 port numbers are reserved for things like http, 80. These ports are called well-known ports. You can look in the files /etc/services and /etc/inetd.conf to see standard uses (ftp, telnet, finger, ..) of these ports. From 1024 up things are not too well established. Certainly from 49152 to 65535 the ports are private and can be dynamically allocated (ephemeral ports). The interval 1024 to 49151 consists of registered ports. For reasons I am not sure of it is recommended that the ports that you personally select be in the range 5000 to 49151. The port 0 is used as a wild card, to request the kernel to find a port for us, we do not care which.

Client-Server Architecture

A standard way of using sockets and communication channels is between clients and servers. A server is a process that is able to carry out some function, called a service, like transferring files, translating host names to IP addresses, or inverting a matrix. A client is a process that requests a server to do a service (say, "translate snowhite.cis.temple.edu"). Typically the server will be at a known IP address and will respond to requests sent to a known port. In some cases that port is not universally known, so the server will advertize the port it is currently using (it may advertize the port by printing out its value, or sending email, or having inetd, a special process, know about it, etc.). In some cases the IP address of the server is not known and one may have a "standard" server that responds to requests of the form "where can I find service Moo" by responding with an appropriate IP address. The client requests the kernel to obtain a free port to be used for communication with the server. The server does not have to know in advance the identity of its clients. It is ready to accept a message from any interlocutor. When it receives a message from a client, the message itself contains the IP and the port of the client, so that the server knows whom to answer to.

An address, host+port, can be used for multiplexing more than one communication. So one server can communicate simultaneously with more than one client. Each communication channel on the server will have its own socket bound to the same address (assuming a singly homed server). In the case of the UDP protocol on the server a socket is identified by the pair [server IP, server port]. Thus messages from different clients to the same socket [IP+port pair] will go to the process that created that socket. In TCP each connection is identified by a socket pair: [client IP, client port] + [server IP, server port]. The number of concurrent connections is not limited by the number of available ports (64K). But the number of services is limited by the number of ports [assuming same protocol].

Summary On Socket Functions

The following is a summary of the basic socket functions as they are used for datagram and connection oriented service by clients and servers. In the following section we will go in greater detail over these functions.

Datagram Service

Client: socket => ([bind =>] [connect =>] {write => read}*) | {sendto => recvfrom}* => close | shutdown
In words: create a socket, then bind it to a local port [if bind is not used, the kernel will select a free local port], establish the address of the server, write and read from it, or just sendto and recvfrom it; then terminate. In the case that client is not interested in a response, it does not need to use bind. Connect is worth using when we send many datagrams to the same server.
Server: socket => bind => {read | recvfrom => write | sendto}* => close | shutdown
In words: create a socket, bind it to a local port, accept and reply to messages from client, terminate. In the case that the server does not need reply to the client, it can just use read instead of recvfrom.

Connection Oriented service

Client: socket => [bind =>] connect => {write | sendto => read | recvfrom }* => close | shutdown
In words: create a socket, bind it to a local port (we usually do not call bind), establish the address of the server, communicate with it, terminate. If bind is not used, the kernel will select a free local port.
Server: socket => bind => listen => {accept => {read | recvfrom => write | sendto}* }* => close | shutdown
In words: create a socket, bind it to a local port, set up service with indication of maximum number of concurrent services, accept requests from connection oriented clients, receive messages and reply to them, terminate.

Socket Functions

Creating a socket

    #include <sys/types.h>
    #include <sys/socket.h>

    int socket(int domain, int type, int protocol)
      domain is either AF_UNIX, AF_INET, or AF_OSI, or ..
         AF_UNIX is the Unix domain, it is used for communication within a 
            single computer system. [AF_LOCAL is the Posix name for AF_UNIX.]
         AF_INET is for communication on the internet to IP addresses.
            We will only use AF_INET.
      type is either SOCK_STREAM (TCP, connection oriented, reliable),
         or SOCK_DGRAM (UDP, datagram, unreliable), or SOCK_RAW (IP level).
         It is the name of a file if the domain is AF_UNIX.
      protocol specifies the protocol used. It is usually 0
         to say we want to use the default protocol for the chosen
         domain and type. We always use 0.
      It returns, if successful, a socket descriptor which is an int.
      It returns -1 in case of failure.

Here is a typical call to socket:

    if ((sd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
       perror("socket"); exit(1);}

Socket Addresses

Here are the structures (OSF Unix) used to store socket addresses as used in the domain AF_INET:

    struct in_addr {
        uint32_t s_addr; //unsigned 32 bit integer also called in_addr_t
      };

    struct sockaddr_in {
        u_short        sin_family; /*protocol identifier; usually AF_INET */
        u_short        sin_port;   /*port number. 0 means let kernel choose */
        struct in_addr sin_addr;   /*the IP address. INADDR_ANY refers to */
                                   /*the IP addresses of the current host.*/
                                   /*It is considered a wildcard IP address.*/
        char           sin_zero[8];}; /*Unused, always zero */

    In order to use struct sockaddr_in you need to include in your program

	#include <netinet/in.h>

    The following structure sockaddr is more generic than but compatible
    with sockaddr_in (both are 16 bytes starting with the same field).

    struct sockaddr {
        u_short  sa_family;
        char     sa_dat[14];};

    In the Unix domain we have a different address, sockaddr_un, which is
    also compatible with sockaddr. In order to use sockaddr_un you need to
    include in your program

	#include <sys/un.h>

Binding to a local port

    #include <sys/types.h>
    #include <sys/socket.h>
    int bind(int sd, const struct sockaddr *addr, int addrlen)
       sd: File descriptor of local socket, as created by the socket
          function.
       addr: Pointer to protocol address structure of this socket. 
       addrlen: Length in bytes of structure referenced by addr.
    It returns an integer, the return code (0=success, -1=failure)

Bind is used to specify for a socket the protocol port number where it will wait for messages. Here is a typical call to bind:

    struct sockaddr_in name;
    .....
    bzero((char *) &name, sizeof(name)); /*zeroes out sizeof(name) characters*/
    name.sin_family = AF_INET;           /*use internet domain*/
    name.sin_port = htons(0);            /*ask kernel to provide a port - since 0
					   is 0 on little and big endian machines
					   htons conversion is unnecessary - there
					   as reminder */
    name.sin_addr.s_addr = htonl(INADDR_ANY); /*use all IPs of host*/
    if (bind(sd, (struct sockaddr *)&name, sizeof(name)) < 0) {
       perror("bind"); exit(1);}

    A call to bind is optional on the client side, required on the server side.
    After a socket is bound we can retrieve its address structure, given the
    socket file descriptor (the int) by using the function getsockname.

We need to understand the reasons for the calls to htons and htonl. Numbers on different machines may be represented differently (big-endian machines and little-endian machines - in a little endian machine the low order byte of an integer appears at the lower address; in a big endian machine instead the low order byte appears at the higher address. For example if c[2] is a byte array initialized to 0x0102, in a little endian machine c[0] contains 2 and c[1] contains 1. Network order, the order in which numbers are sent on the internet, is big-endian. Sun-Sparc machines are big endian. i-386 PC and Digital Alpha are little endian. Here is a simple program that lets you see if your machine is big endian or little endian.) We need to make sure that the right representation is used on each machine. We use functions to convert from host to network form before transmission (htons for short integers, and htonl for long integers), and from network to host form after reception (ntohs for short integers, and ntohl for long integers).

The functions bzero zeroes out a buffer of specified length. It is one of a group of functions for dealing with arrays of bytes. bcopy copies a specified number of bytes from a source to a target buffer. bcmp compares a specified number of bytes of two byte buffers. Alternatively one can use memcpy, memset, ..

Connecting to a Server

A remote process, usually a server, is identified by an IP address and a port number. The connect operation is used on the client side to identify and, possibly, start the connection to the server. It is required in the case of connection oriented communication. In the datagram case it is not required, but, if used, it gives the default name of the interlocutor so that we do not need to repeat it in each message when using send, and it will not accept datagrams from other addresses. In the stream case it performs the active open of the connection (see socket states).

    #include <sys/types.h>
    #include <sys/socket.h>

    int connect(int sd, const struct sockaddr *addr, int addrlen)
       sd file descriptor of local socket
       addr pointer to protocol address of other socket
       addrlen length in bytes of address structure
    It returns an integer (0=success, -1=failure)
    If the socket had been placed in non-blocking mode, and the connection
    cannot be established immediately, the return is -1 and errno is
    EINPROGRESS. The select system call can then be used to determine
    when the connection is completed (sd become writable).

Here is a typical call to connect:

    #define SERV_NAME ...    /* say, "snowhite.cis.temple.edu */
    #define SERV_PORT ...    /* say, 8001 */
    struct sockaddr_in servaddr;
    struct hostent *hp;      /* Here we store information about host*/
    int sd;                  /* File descriptor for socket */
    .......
    /* initialize servaddr */
    bzero((char *)&servaddr, sizeof(servaddr));
    servaddr.sin_family = AF_INET;
    servaddr.sin_port = htons(SERV_PORT);
    hp = gethostbyname(SERV_NAME);
    if (hp == 0) {
       fprintf(stderr, "failure to address of %s\n", SERV_NAME); exit(1);}
    bcopy(hp->h_addr_list[0], (caddr_t)&servaddr.sin_addr, hp->h_length);
    if (connect(sd, (struct sockaddr *)&servaddr, sizeof(servaddr)) < 0) {
       perror("connect"); exit(1);}

The function gethostbyname is described below.

Gethostbyname

The function gethostbyname given a host name, like snowhite.cis.temple.edu, returns 0 in case of failure, or a pointer to a struct hostent object which gives information about the host names+aliases+IPaddresses:

    struct  hostent {
	char	*h_name;	/* official name of host */
	char	**h_aliases;	/* null terminated list of aliases for name*/
	int	h_addrtype;	/* host address type: AF_INET, AF_INET6 */
	int	h_length;	/* length of address structure */
	char	**h_addr_list;	/* null terminated list of IP addresses */
	                        /* for this host*/
    #define	h_addr	h_addr_list[0] /*address,for backward compatibility*/};

In this structure h_addr_list[0] is the first IP address associated with the host. In order to use this structure you must include in your program:

    #include <netdb.h>

The function prototype is

    struct hostent *gethostbyname(const char *hostname);

gethostbyname returns NULL in case of error, in which case the external int variable h_errno contains the code of the error. Other functions help us find out things about hosts, services, protocols, networks: getpeername, gethostbyaddr, getprotobyname, getprotobynumber, getprotoent, getservbyname, getservbyport, getservent, getnetbyname, getnetbynumber, getnetent.
Beware that gethostbyname is not threadsafe and not reentrant (why?).
Here is a small program that shows how to determine what is the hostname of the local machine, what is its IP address in network form (in hexadecimal), and what is the dotted decimal representation of that address.

Listening for a Client

The listen function is used on the server in the case of connection oriented communication to prepare a socket to accept messages from clients. It has the form:

    int listen(int fd, int qlen)
       fd file descriptor of a socket that has already been bound
       qlen specifies the maximum number of connection requests that
          can wait to be processed by the server while the server is 
          busy servicing another connection request.
       It returns an integer (0=success, -1=failure)

Here is a typical call to listen:

    if (listen(sd, 50) < 0) {
       {perror("listen"); exit(1);}

Associated to a socket after listen has exected are two queues. The first queue contains information about incompleted connections. The second queue contains information about completed connections. Entries are moved from the first to the second queue when the corresponding connection is completed or a timeout has expired. Entries move out of the second queue as the corresponding connections are accepted. 50 above indicates the total size of the two queues (or close to it). Completed connections that cannot be fitted in the second queue are thrown away.

Accepting a connection from a Client

The accept function is used on the server in the case of connection oriented communication (after a call to listen) to accept a connection request from a client.

    #include <sys/types.h>
    #include <sys/socket.h>

    int accept(int fd, struct sockaddr *addressp, int *addrlen)
       fd is an int, the file descriptor of the socket the
          server was listening on [in fact it is called the listening
          socket], i.e. on which the server has successfully 
	  completed socket, bind, and listen.
       addressp points to an address. It will be filled with
          address of the calling client. We can use this address to
          determine the IP address and port of the client.
       addrlen is an integer that will contain the actual length
          of address structure of client.
    It returns an integer representing a new socket (-1 in case of failure).
    It is the socket that the server will use from now on to communicate 
    with the client that requested connection [in fact it is called
    the connected socket]. Different calls to accept will result
    in different connected sockets.
    If the listening socket had been placed in non-blocking mode, then
    when we execute accept and there is no pending connection, it returns
    -1 and errno is set to EAGAIN. The select system call can be used to
    determine when the accept would be successful. At that time the listening
    socket will appear readable.

Here is a typical call to accept:

    struct sockaddr_in client_addr;
    int ssd, csd, length;
    ...........
    if ((csd = accept(ssd, (struct sockaddr *)&client_addr, &length)) < 0) {
       perror("accept"); exit(1);}
    /* here we give the new socket to a thread or a process that will */
    /* handle communication with this client. */

Successive calls to accept on the same listening socket return different connected sockets. These connected sockets are multiplexed on the same port of the server by the TCP software. [This software uses the quartet Client-IP-Address.Client-Port.Server-IP-Address.Server-Port (plus the protocol) to identify the various simultaneous connections.]

Read, Write

We can use a socket like a normal file descriptor and read from it or write to it. In order to do so the socket must be connected to an interlocutor. (Other commands we can use when a socket is connected are send and recv.) Beware that when we try to read n octets, we may have to execute the command repeatedly each time retrieving less than the total n. For example, we may ask for 400 octets and retrieve 100 octets in the first read, and the remainder in a second read. But a read operation will return at least one octet (0 if the socket is closed by partner, <0 if there is an error). Normally read and write operations are blocking. This is intended as follows: in the write operation the data is moved from the buffer in the space of the process calling the write operation to a system buffer associated with the socket. Control returns to the caller in a blocking call when all the data in the user buffer has been moved to the system buffer. This does not usually mean that the data has been received by the intended destination, or even sent.

Sendto and Recvfrom

    #include <sys/types.h>
    #include <sys/socket.h>

    int sendto(int sd, char *buff, int len, int flags, 
           struct sockaddr *addressp, int addrlen)
       sd, socket file descriptor
       buff, address of buffer with the information to be sent
       len, size of the message
       flags, usually 0; could be used for priority messages, etc.
       addressp, address of process we are sending message to
       addrlen, length of message
       It returns number of characters sent. It is -1 in case of failure.

The flags we can use with sendto include:

MSG_DONTROUTE: Bypass lookup of routing table, i.e. no gateway, only for local network
MSG_DONTWAIT: The operation will be non-blocking (i.e. it will return an error code if it cannot be completed immediately)
MSG_OOB: send out-of-band (urgent) data

    #include <sys/types.h>
    #include <sys/socket.h>

    int recvfrom (int sd, char *buff, int len, int flags, 
           struct sockaddr *addressp, int *addrlen)
       sd, socket file descriptor
       buff, address of buffer where message will be stored
       len, size of buffer
       flags, usually 0; used for priority messages, peeking etc.
       addressp, buffer that will receive address of process that 
            sent message
       addrlen, contains size of addressp structure;

       It returns number of characters received. It is -1 in case of failure.

The flags we can use with recvfrom include:

MSG_DONTWAIT: The operation will be non-blocking (i.e. it will return an error code if it cannot be completed immediately)
MSG_OOB: receive out-of-band (urgent) data
MSG_PEEK: peek at incoming message
MSG_WAITALL: wait for all the data requested with len (but we may receive less, if there is a signal, an error, disconnection).

The use of flags is beyond the level considered in these notes [as always, see Stevens].
Other operations for reading and writing from sockets are: send, recv, sendmsg, recvmsg.

Shutdown

It is like close but more flexible. It allows us to close just read operations, or write operations or both. For example, we can shutdown the write connection of a socket while keeping the read connection open to read pending data from the interlocutor. Then, when the interlocutor recognizes our "shutdown" (because it forces a end-of-file at a read end) and closes the connection from its end, at our end we will read an end-of-file condition and we completely close the connection.
There is another distinction between shutdown and close. If a socket is open in more than one process (for instance by a parent and a child process), then close will not terminate the connection as long the socket is still open by somebody (it will just disable the socket for the process that executed the close call). Instead shutdown has the desired effect of closing the connection (by sending an appropriate FIN request to the computer and process where is located the associated socket) no matter the number of processes in which the socket is open. However shutdown does not close the socket, i.e. the socket remains in existence even if no other process is using it.

    int shutdown(int sd, int action)
       sd is a socket descriptor
       action is (0 = close for reads) (1 = close for writes)
          (2 = close for both reads and writes)
    It returns an integer (0=success, -1=failure)

An important use of shutdown is in the following situations. Suppose we have a multithreaded program and a thread is blocked in an accept call. If from another thread we would like to abort the accept call, we are discouraged from usin a pthread_kill on the accept thread. Instead it is possible to force return from the accept thread by executing a shutdown from another thread on the listening socket. Similarly if a thread is blocked on a read from a connected socket, we can force its termination by executing a shutdown on that socket from another thread.

Getsockname

It is used to determine the address to which a socket is bound.

    int getsockname(int sd, struct sockaddr *addrp, int *addrlen)
       sd is the socket descriptor of a bound socket.
       addrp points to a buffer. After the call it will have the
          address associated to the socket.
       addrlen gives the size of the buffer. After the call gives 
          size of address.
    It returns an integer (0=success, -1=failure)

Getpeername

It is used to obtain the address of the remote host connected to the current socket. It can be used for example by a server to determine the identity of the client trying to communicate with it and to decide whether to accept or refuse the connection based upon the identity of the client.

    int getpeername(int sd, struct sockaddr *addrp, int *addrlen)
       sd is the socket descriptor of a connected socket
          i.e. of a socket returned by accept.
       addrp points to a sockaddr buffer. After the call it 
          will have the address associated to peer of socket. It is the same
          structure and information as it is available in the client
          address structure after the accept call.
       addrlen gives the size of the buffer. After the call gives 
          size of address.
    It returns an integer (0=success, -1=failure)

Here is an example of use of getpeername:

    struct sockaddr_in name;
    in namelen = sizeof(name);
    .......
    if (getpeername(sd, (struct sockaddr *)&name, &namelen) < 0) {
       perror("getpeername"); exit(1);}
    printf("Connection from %s\n", inet_ntoa(name.sin_addr));

We see here a new function inet_ntoa: Translate an unsigned network byte-ordered integer (IP address) into a dot formatted character string such as 155.247.71.60. It requires the include files:

    #include <netinet/in.h>
    #include <arpa/inet.h>

The inverse of inet_ntoa is inet_addr which, given an IP address as a dot formatted character string returns its value as an unsigned network byte-ordered integer:

    #include <netinet/in.h>
    #include <arpa/inet.h>
    
    unsigned int inet_addr(char *IP_address_string);

Another inverse of inet_ntoa is inet_aton which is not available in many Unix systems.
The functions inet_pton and inet_ntop are being recommended instead of inet_ntoa and inet_aton since they work for both IPv4 and IPv6.

Setsockopt

Usually after a socket has been bound to a port, even after the socket has been closed, the port is unusable by other sockets for a while. That is, if a process has used port p with socket s and it closes s, then another process cannot bind a socket to p for a good number of seconds. To avoid this (and to many other things such as choosing buffer sizes, minimum amount of data to be sent/received in a message, ..), and make a port immediately reusable is available the function setsockopt that must be called before the socket is bound.

    #include <sys/types.h>
    #include <socket.h>
    int setsockopt(int socket, /* the socket created by a socket call */
                   int level,  /* use SOL_SOCKET or, better, see man page  */
		   int option_name, /* use SO_REUSEADDR, SO_LINGER .. */
                               /* SO_LINGER is very interesting: it is used */
 			       /* to specify what to do when we close a */
			       /* socket with the data left in system buffer.*/
                               /* By default close returns to the caller */
			       /* even though data may still have to be sent.*/
			       /* SO_LINGER forces close to wait until */
			       /* data has been delivered to destination, or*/
			       /* a timeout has expired. */
			       /* Another interesting option SO_KEEPALIVE */
			       /* which in some protocol like TCP asks the*/
			       /* kernel to check that the peer is*/
			       /* still alive and if not raises SIGPIPE */ 
			       /* Another option is SO_ACCEPTCONN, to determine */
			       /* if a socket is or not a listening socket.*/
			       /* SO_DONTROUTE only allows communication within */
			       /* local network */
			       /* SO_KEEPALIVE so the socket sends keep-alive messages */
			       /* Then there are all sorts of defaults on size of *
			       /* buffer, minimum number of received bytes before */
			       /* passing data to next layer ... */
		   char * option_value, /* address of an integer set to 1
				to enable an option, 0 to disable */
		   size_t option_length);  /* size of option_value buffer */
    /* Returns 0 in case of success, -1 otherwise */

The function getsockopt is used to retrieve the value of options associated to a socket. Here is an example of use of setsockopt.

    option_value = 1;  /* int option_value; */
    if (setsockopt(sd,SOL_SOCKET, SO_REUSEADDR, (char *)&option_value,
		   sizeof(option_value)) < 0)
    {
	perror("setsockopt");
	exit(0);
    }

One may see the impact of this statement, and the impact of the SO_REUSEADDR, by first writing a server without this statement. Then running the server using a port, say 5194, and a client, then executing at the unix prompt

    netstat - a | grep 5194

This will display how 5194 is being used. The use will not change for a few seconds even if client and server are killed. If setsockopt is used instead the use will change and a new bind on 5194 will be accepted immediately. The reason is explained somewhat in the states section where we see the TIME_WAIT state in which the socket remains at the end of the close.

For a detailed treatment of socket options, see Stevens. Here is a Stevens's program to determine the current values of socket options and here is the corresponding output on digital unix and here on linux.

EINTR

If a process is blocked in a slow system call (for example, connect, accept, read, write) and a signal is delivered to this process (for example, an alarm goes off and the SIGALRM signal is delivered), then the slow system call is terminated and the system external symbol errno is set to the value EINTR. In this case one may not want to terminate a function or a process as often is done, but instead one could restart the slow system call. Here is an example from Stevens showing the function readn to read n characters from a socket and here is another example from Stevens showing the function writen to write n characters to a socket.
Some systems support the flag SA_RESTART which can be set with the sigaction function for a signal. When SA_RESTART works, then if a slow system call is terminated by that signal, then that system call will be automatically restarted.

Non-Blocking Sockets

We have seen that when we open a file we can specify that we want to use it in blocking mode (default) or in non-blocking mode. How do we set existing sockets in non-blocking mode? We use the fcntl system call.
Here is a function we may use to set a socket fd to non-blocking IO:

	# include <fcntl.h>
	/* Add the flags in flgs to the flags currently associated to 
	   the file descriptor fd. Return 0 iff success.
	*/
	int setFlags(int fd, int flgs) {
	    int val = fcntl(fd, F_GETFL, 0);
	    if (val < 0) return -1;
	    val |= flgs; /* add the new flags */
	    if (fcntl(fd, F_SETFL, val) < 0) return -1;
	    return 0;
	}
Then we use it as follows
	if ( setFlags(fd, O_NONBLOCK) ) {
	    /* error setting non-blocking IO */
	}

setFlags could be modified to return the current value of the flags so that later, if desired, the origical flags can be restored.

It is obvious how non-blocking IO and the select system call help in the case that we want to read or to write: with a select call we can wait on any number of file descriptors until some become ready for reading or writing. It is less obvious how to use non-blocking IO and the select system call in the case of connect and accept to avoid the round-trip-time associated on the client side to connect and on the server side to accept.
For connect assume that we have a socket sd already set to non-blocking. Say that readSet and writeSet are fd_set variable used as the read and write sets to the select call. Then we can use the following statements before the select call to establish the connection:

	if ( connect(sd, serverAddress, serverAddressSize) < 0) {
	    if (errno != EINPROGRESS) {
		/* error - do whatever you think appropriate */
	    }
	    FD_SET(sd, &readSet);
	    FD_SET(sd, &writeSet);
	} /* The else part is unnecessary: in that case the connect succeeded already*/
and when select tells us that sd is in the read or write set we can read
or write without blocking.

For accept assume that we have a listening socket sd already set to non-blocking. Then sd will have been set in the read set readSet used for the select call. Then to see if accept is ready to execute without blocking we say

	if (FD_ISSET(sd, readSet)) {
	    connectedSocket = accept(sd, &clientAddress, clientAddressSize);
	    /* here you should check if connectedSocket is negative */	
	}

Kinds of Sockets

It may be useful to neglect for a moment the multitude of socket functions and think instead of how they are used (we talk only of stream sockets). There are three basic kinds of sockets:

Client Sockets: A Client Socket is what a client uses to send data to and receive data from a server. A typical way of setting up communication with a server, given its host name and the port on which it is listening, is given by the clientsocket function.
Listening (or Server) Socket: A Listening (Server) Socket is the socket used on the server that is associated with the port on which the server is listening. A typical way of setting up the listening socket of a server, given the port on which the socket will listen, and the intended maximum number of pending reuests, is with the listensocket function. The only use of the listening socket is to create connected sockets in response to connections from clients.
Connected Socket: A Connected Socket

Code for these functions is here

Examples

Example 1: Simple example where we use gethostname, gethostbyname, socket, bind, getsockname. The example does not do anything useful except show the use of these functions.

Example 2 (Datagram communication): a client and a server. In a loop, the client sends the current time to the server, waits for the reply, prints it out, and sleeps for a while. The server receives messages from clients and prints them out. It replies with its own current time. It also prints out information identifying the IP address and port of the client. No provision is made to cope with the unreliability of the communication channel.

Example 3 (TCP): Similar to Example 2, using TCP and runnable on both Unix and NT (from D.Comer: Computer Networks and Internets) client, and server
Here are other simple TCP client and server (an "echoserver"). Here is an echoserver that we can terminate gracefully with the SIGHUP signal. Here is a multithreaded echoserver.
And here, modified from Stevens, Network Programming, Vol. 1, a "daytime" server and client.

Example 4: (Datagram communication) a client and a server. The client is invoked with three parameters: the name of a user, of a host, and a port. It sends the user name to the server and prints out the response. The server when it receives a user name checks if the user is currently logged on the host. It replies with an appropriate response. No provision is made to cope with the unreliability of the communication channel.

Example 5: (Datagram communication): a client and a server. A client in a loop sends a message to the server and waits with timeout for reply. The server receives messages and gives them to threads to respond to.

Example 6: A TCP concurrent server that forks child processes to handle client connections.

Example 7: A program for testing TCP servers. It is an abridged, slightly modified version of the ab.c program that comes with the apache server distribution. This program establishes a number of concurrent connections to a TCP server (usually an HTTP server) and measures latency, response time, data rate, etc.

Example 8: Another TCP concurrent server. Now, using a technique seen in the Apache server, each child process executes an accept statement on the listening socket. You can use the program in Example 8 to test this server (you will need a "index.html" file in the directory where the server is being run).

For more examples, and a wonderful book on Network Programming, read R.Stevens: "Unix Network Programming", Prentice-Hall, 1998. Here is the code presented in that book.
You may download the tarred file of this code from ftp://ftp.kohala.com/pub/rstevens/unpv12e.tar.gz
You might look in particular to

Using wrappers for system calls so as to avoid repeating error management code. Here is an example. For a more recent library of wrappers consider csapp.h and csapp.c by Professors Bryant and O'Hallaron of CMU.
Reading and Writing n characters from/to a socket while dealing with signals, and reading a line from a socket.
Setting a signal handler
A small library of functions from Snader's book: "Effective TCP/IP Programming".

Measurements

Here are some measurements on the performance of sockets.
We have used the programs tcpserver and tcpclient to determine the performance of the TCP protocol. We have tested the performance on the same computer (programs communicating thorugh TCP), and between computers on the same LAN. For each arrangement we have tested a number of buffer sizes for the client.

TCP BETWEEN PROGRAMS ON SAME COMPUTER (an old alpha with Digital Unix)

	Buffer Size |  Average  | Standard Deviation
           (KB)     |  (Mbps)   |     (Mbps) 
	============================================
	   1024     |   0.046   |  0.000  
	   2048     |   3.058   |  2.835
	   4096     |  40.313   |  1.229
	   8192     |  56.271   |  0.812
          16384     |  60.934   |  7.706
	  32768     |  69.078   |  1.269
          65536     |  71.621   |  1.554
         131072     |  66.575   |  8.319
         262144     |  65.037   |  5.481
         524288     |  65.426   |  0.809

TCP BETWEEN PROGRAMS ON DIFFERENT COMPUTERS ON SAME LAN (both old alpha with Digital Unix - 10Mbps ethernet)

	Buffer Size |  Average  | Standard Deviation
           (KB)     |  (Mbps)   |     (Mbps) 
	============================================
           1024     |   0.045   |  0.000
	   2048     |   2.157   |  0.181
	   4096     |   3.365   |  0.168
	   8192     |   4.988   |  0.222
          16384     |   6.643   |  0.344
	  32768     |   6.747   |  0.182
          65536     |   6.550   |  0.478
         131072     |   6.875   |  0.266
         262144     |   6.876   |  0.370
         524288     |   6.745   |  0.271

Some observations:

The default buffer sizes of the sockets were 32KB. The experimental data confirms that the default is a good choice.
The variance in the results during a run is fairly high. It is even worse across runs.
The use of a buffer size of 1KB has disasterous consequences, both on a single computer and on a LAN: data rate becomes less than 50Kbps.
The data rate increases with buffer size up to 32KB. Then it retains essentially the same value. Thus these measurments do not suggest the use of larger and larger buffers.
The performance of TCP on the same computer is an order of magnitude (a factor of about 10) better than across computers in a LAN.
The programs mentioned above make it possible to test the server with a number of concurrent clients.

We have checked also the bandwith between processes that communicate through pipes. In this case we obtain an average badwidth of 324.4 Mbps and a standard deviation of 6.2 Mbps.
Just to complete this series of tests, we have the following results for memory copy:

	Transfer Size  |  Data Rate  |      Data Rate
           (Bytes)     |  Average    | Standard Deviation
                       |   (Mbps)    |       (Mbps)
 	==================================================
	   4	       |    68.327   |       0.963
	   8   	       |   120.254   |       1.952
	  16           |   218.575   |       1.873
	  32	       |   313.049   |       5.371
	  64           |   435.295   |       1.331
         128	       |   503.564   |       1.422
	 256	       |   531.316   |      10.307
	 512	       |   541.813   |       2.403
 	1024 	       |   545.021   |       1.241

Testing Concurrent Servers

We tested the concurrent servers multiserver and multiserver1 described earlier as examples 6 and 9 using the testing program tst introduced in example 8. We run the test on a LAN between two old alpha workstations running Digital Unix transmitting ~3KB files. There was high variance. On average we found:

                        |  time to  |  latency |  response |  data 
                        |  connect  |          |  time     |  rate
        ===========================================================
	multiserver     |     4     |   120    |    180    |  150
        -----------------------------------------------------------
	multiserver1    |     6     |    26    |     42    |  630
        ----------------------------------------------------------
       where times are in milleseconds and data rate in kilobits per second.

Testing an Apache webserver, we found a similar behavior, with TimeToConnect=10, Latency=99, ResponseTime=108, and DataRate = 98.

Socket States

The following diagram from R.Stevens "Unix Network Programming" describes the state transitions in the TCP protocol:

PS.

The transition from state CLOSED to state SYN_SENT should be labeled:
```
   appl: active open
   send: SYN
```
In the transition SYN_RECVD to LISTEN, RST means "Reset".
The transition from LAST_ACK to CLOSED should have the label:
```
   receive: ACK
   send: nothing
```
In the active close, after the process has executed "close", sent FIN, received ACK and thus entered state FIN_WAIT_2, if FIN is not received from the partner, the connection will time out after 2 MSL (see below) and transition to CLOSED.
Note that if a server crashes (say with control-C) while it has a connected socket, then TCP on the server's host will perform an active close on the connected socket. If instead the server's host crashes while the server has a connected socket, nothing is sent to the client; thus to the client the situation will be as if the communication link had been lost.
In the passive close, when the socket is in state CLOSE_WAIT, if the process does not use the socket, the socket state will remain unchanged indefinitely. If the process performs a write on the CLOSE_WAIT socket, the partner's TCP will send back an RST that results in a ECONNRESET error from the write operation (a second write at this point would give an EPIPE error). If the process performs a read on the CLOSE_WAIT socket it will receive an EOF (i.e. the code returned by read is 0).

You can recognise in this state diagram the handshakes taking place when a connection is started and when it is terminated.
The transition CLOSED -> SYN_SENT -> ESTABLISHED takes place on the client as a result of the successful completion of the connect operation (active open).
The transition CLOSED -> LISTEN takes place in the server upon successful completion of socket+bind+listen (passive open).
The transition LISTEN -> SYN_RECVD -> ESTABLISHED takes place in the server upon successful completion of accept.
The transitions in the dotted boxes names "active close" and "passive close" take place upon successful completion of close operations.

The transitions ESTABLISHED -> FIN_WAIT_1 -> FIN_WAIT_2 ->TIME_WAIT -> CLOSED take place when a connected process closes the connection (active close).
The transitions ESTABLISHED -> CLOSE_WAIT -> LAST_ACK -> CLOSED take place in a connected process when it is told that the close is in progress (passive close).
The transitions ESTABLISHED -> FIN_WAIT_1 -> CLOSING -> TIME_WAIT -> CLOSED take place when both connected processes (active) close independently at the same time.

The TIME_WAIT state is required to make sure that all packets between the client and server have either been delivered or totally lost. A client will remain in the TIME_WAIT state for up to 2*MSL, where MSL is the Maximum Segment Life, the maximum time a segment may remain alive in the network, say, going around because of a routing anomaly. MSL is up to one minute, so a client may remain in TIME_WAIT for up to 2 minutes.

Also from Stevens is the following diagram that displays a typical interaction between a client and a server. Notice in particular the 3-way handshake when the connection is established and the 4-way handshake when the connection is terminated. [mss stands for Maximum Segment Size.]

In the above diagram, the client, immediately after the write operation, should block on a read operation which will complete when the data reply arrives from the server.
The above diagram provides a second reason for the presence of the TIME_WAIT state: assume that the last ack from the active closer to the passive closer is lost. Then the passive closer upon timeout will resend the FIN N message. If the active closer is in the TIME_WAIT state it can resend the ack. If state TIME_WAIT did not exist and the active closer went directly to the CLOSED state, then upon receiving the second FIN it would think it is an error and send an RST to the passive closer that, in absence of the final ack, it would be unable to close the connection cleanly.

You may use the netstat command to determine the sockets currently used in the system, the protocol they use, the local host+port, the remote host+port, and the socket state. For example in my machine netstat printed out (truncated):

Active Internet connections
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp        0      0  joda.80                sp185047.sbm.tem.2129  ESTABLISHED
tcp        0      0  joda.80                sp185047.sbm.tem.2128  TIME_WAIT
tcp        0      0  joda.80                sp185047.sbm.tem.2127  TIME_WAIT
tcp        0      0  joda.80                sp185047.sbm.tem.2126  TIME_WAIT
tcp        0      0  joda.80                sp185047.sbm.tem.2123  ESTABLISHED
tcp        0      0  joda.80                sp185047.sbm.tem.2122  TIME_WAIT
tcp        0      0  joda.80                sp185047.sbm.tem.2121  TIME_WAIT
tcp        0      0  joda.80                ww-to06.proxy.ao.1280  TIME_WAIT
tcp        0      0  joda.pop3              bamboo.1690            TIME_WAIT
tcp        0      0  joda.80                dhcp40-98.netman.1631  FIN_WAIT_2
tcp        0      0  joda.80                hd1-220.hil.comp.1132  FIN_WAIT_2
tcp        0      0  joda.80                200.5.111.75.1205      FIN_WAIT_2
tcp        0      0  joda.80                hd1-220.hil.comp.1129  FIN_WAIT_2
tcp        0      0  joda.2197              wilma.flair.temp.6000  ESTABLISHED
tcp        0      0  joda.imap              cc16262-a.wlgrv1.1922  ESTABLISHED
tcp        0      0  joda.80                207.205.89.226.38539   FIN_WAIT_2
tcp        0      0  joda.80                207.86.17.198.3866     FIN_WAIT_2
tcp        0      0  joda.2105              joda.2104              CLOSE_WAIT
tcp        0      0  joda.telnet            bamboo.1445            ESTABLISHED
tcp        0      0  joda.telnet            mrgrump.2056           ESTABLISHED
tcp        0      0  joda.telnet            wilma.flair.temp.33079 ESTABLISHED
tcp        0      0  joda.1547              joda.80                CLOSE_WAIT
tcp        0      0  joda.1538              www.Sun.COM.80         CLOSE_WAIT
tcp        0      0  joda.1533              wilma.flair.temp.6000  ESTABLISHED
tcp        0      0  joda.telnet            warhog.2316            ESTABLISHED
tcp        0      0  joda.1482              wilma.flair.temp.6000  ESTABLISHED
tcp        0      0  joda.telnet            wilma.flair.temp.33077 ESTABLISHED
tcp        0      0  localhost.1032         *.*                    LISTEN
tcp        0      0  localhost.1031         *.*                    LISTEN
tcp        0      0  localhost.1030         *.*                    LISTEN

ingargio@joda.cis.temple.edu