%
% x-kernel v3.3
%
% Copyright (c) 1996,1993,1991,1990  Arizona Board of Regents
%

\documentstyle[times,fcanon,epsfig,html]{article}

\begin{document}

\bibliographystyle{abbrv}
\newcommand{\xk}{{\it x}-kernel}
\newcommand{\var}{\sf}
\begin{htmlonly}
   %these overwrite the above definitions when running latex2html:
   \newcommand{\var}{\tt}
\end{htmlonly}


\title{Message Library Design Notes}
\author{David Mosberger}
\date{January 1996}
\maketitle{}

\renewcommand{\baselinestretch}{1}
\setlength{\tolerance}{1000}
\small\normalsize

\begin{center}
{\bf\Large Abstract}
\end{center}
\begin{quote}
This document describes the current implementation of the {\xk}
message library. The focus is on its data structures and the
underlying principles.  This document does not describe the
message library's interface or how it is used. Please refer to the
{\xk} Programmer's Manual \cite{ProgMan_3.3} and the {\xk}
Tutorial \cite{Tutorial} for that purpose.
\end{quote}

\section{Introduction}

Conceptually, a message is simply a linear sequence of bytes. A
message therefore has contents (data) and a length (in bytes).  The
data in a message is never interpreted by the message library itself.
The message library is optimized for the processing that is encountered
in typical network protocols, such as TCP or IP.  There are two key
characteristics to network message processing: (a) on the outgoing
side, headers are prepended to the message, and on the incoming
side headers are removed from the beginning of the message, and (b) the
amount of data in a message is typically large so that data copying
should be avoided as much as possible.  Given (a), it is reasonable to
extend the simple message model to reserve space at the head of the
message.  Similarly, it is often necessary to discard a few bytes a
the end of the message (e.g., to discard a CRC sum or a trailer).
This leads to messages that end before the physical end of the buffer
memory. With these two extensions, Figure~\ref{fig1} depicts this
high-level view of a message.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig1.ps,height=1.0in}}
\caption{High-level view of a message.}\label{fig1}
\end{figure}

\subsection{Pushing Headers}

With the above model, it is now possible to prepend (push) a header
onto an existing message simply by decrementing {\var head} by the size of
the header and then copying the header contents to the memory starting
at offset {\var head.}  This is efficient because network headers are small
relative to the size of the message data.  Now, what happens if the
reserved space is all used up?  One solution would be to allocate a
new and bigger memory buffer, copy the existing data into that buffer
and then prepend the new header.  In practice this would be too slow
because it requires copying all the data in the message.  Instead, the
message library extends the message model by implementing messages as a
tree of memory buffers.  Thus, when a new header would overflow the
available header space, a new memory buffer is allocated and linked
into the existing message by creating a new interior (pair) node.
Thus, each header-space overflow adds two new nodes to the message
tree (one interior node and a leaf node).  For example, if the
existing message tree is depicted as a triangle, a push overflow leads
to the configuration shown in Figure~\ref{fig2}.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig2.ps,height=2.0in}}
\caption{New buffer attached to the front of a message.}\label{fig2}
\end{figure}

Notice that each push overflow leads to a new interior node whose left
subtree is a leaf.  The resulting tree is a degenerate tree that is
no more than a linear list. For searching, it would be better if
it were balanced because that would minimize the average distance from
the root.  However, with the message library, searching is not common and
in fact degenerate trees of the above kind (where the left subtree is
always a leaf) are ideal because they allow tree traversal without
recursion.

Notice that while the physical representation of the message now uses
a tree, the logical view of messages as presented in the previous
section is still valid.  The only externally visible difference is
that the message buffer now potentially consists of several smaller
memory buffers that may be scattered in the computer's address space.

\subsection{Popping Headers}

So far, the message representation allows efficient implementation of
pushing headers onto messages.  Let's investigate the opposite
operation: removing (popping) headers.

In the normal case, popping a header simply involves incrementing the
{\var head} index and returning a pointer to the old location.
Similar to the push case, this works, provided there is enough data in
the first leaf. If this is not the case, there are two possible
scenarios: (a) the header data comes from a single other leaf, or (b)
the header data comes from several other leaves.  In case (a), the
message library simply finds the correct leaf and returns a pointer to
the header data (this is the one case where the message tree has to be
searched).  For efficiency, the message library also caches the location
of the leaf in which it found the header data.  If another header is
popped (which is likely), then it is normally the case that the data
for the new header comes from the same leaf as the current header, so
this results in bypassing the step of having to find the leaf in the
message tree.  In case (b), there is no choice: the message library has
to copy the header data into a new, contiguous buffer. Conceptually, this
is simple.  The leaves in the old message are copied until enough data has
been accumulated for the header.  In the example below, the data in leaves
l1, l2, and l3 are copied into the new leaf called {\var header}, as
illustrated in Figure~\ref{fig3}.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig3.ps,height=1.0in}}
\caption{Accumulating leaves into a new buffer.}\label{fig3}
\end{figure}

After the data is copied, the message tree is updated such that the
new contiguous header leaf replaces the first three leaves in the
original tree.  Conceptually, the final message appears as shown
in Figure~\ref{fig4}.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig4.ps,height=1.0in}}
\caption{Message after leaves are accumulated.}\label{fig4}
\end{figure}

Notice that removing the original small leaves requires a recursive
node deletion algorithm.  Since this is slow, the current message
library implementation modifies the tree such that the small leaves
remain in the tree, but are ignored as far as message data is
concerned. In essence, the message library first discards all the bytes
in the small leaves and then creates the pair node that connects the
new header leaf into the tree.  A more accurate picture of the final
tree is given in Figure~\ref{fig5}. The right pair node is setup such
that the subtree containing l1, l2, and l3 will be ignored as far as
message data is concerned.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig5.ps,height=2.5in}}
\caption{Detailed view of message after leaves are accumulated.}\label{fig5}
\end{figure}

After the copying and adjusting the tree, the new header leaf contains
enough data for the header that is being popped.  The message library can
now increment the head index and return the pointer to the contiguous
header memory just as in the previous cases.

\subsection{Trimming Messages}

Other operations that are common include determining the length of the
data in a message and truncating/discarding data. Truncation involves
dropping the last $n$ bytes, whereas discarding involves dropping the
first $m$ bytes in a message.  To support these operations
efficiently, the message {\it fragment} abstraction is introduced. A
fragment refers to a subrange of the data in a message.  To that
effect, the fragment datastructure contains a pointer to the message
tree that contains the message data and two indices: {\var head}, the
index of the first byte in the subrange, and {\var tail}, the index of
the first byte outside of the subrange. With these definitions, the
length of a fragment is given by the expression {\var tail - head}.
Truncation is implemented by decrementing the {\var tail} index,
and the disard operation is implemented by incrementing the {\var
head} index. Notice that the {\var head} and {\var tail} fields in the
fragment data structure correspond exactly to the the {\var head} and
{\var tail} variables described in the beginning of this section.
Thus, as a first approximation, the message data structure is simply a
message fragment.

\subsection{Visiting Message Data}

The final operation, that is, unfortunately, often on the critical path
of network packet processing, is visiting all the data in a message
tree.  In general, visiting a tree requires a recursive algorithm.
For efficiency, the current implementation avoids recursive function
invocations by explicitly managing a recursion stack. In the common
case where a message consists of a single leaf, or in the degenerate
tree case where all left subtrees are leaves, this yields highly
efficient code.  The state that needs to be maintained during a
message traversal is a stack of fragments.  The stack is initialized
with the fragment in the message to be visited.  The fragment at the
top of the stack is the one to be visited next.  After popping a
fragment from the top of the stack, there are two possibilities: (a)
the fragment points to a pair node in which case it is simply
necessary to push the fragments for the right and the left subtrees
onto the stack (in this order).  In case (b), the fragment points to a
leaf node so the data in the leaf can be processed directly.

\section{Data Structures}

We are now at the point where we can present the message library's data
structures.  We start with the most fundamental structure: the message
fragment.

\begin{verbatim}
    struct MsgFrag {
        int      head;
        int      tail;
        MsgNode  tree;
    };
\end{verbatim}

As explained in the Introduction, a fragment refers to a subrange of
the data in a message tree.  The message tree is pointed to by field
{\var tree}. The fields {\var head} and {\var tail} are the indices of
the first byte in the subrange and the first byte outside of the
subrange, respectively. The data length is thus given by the
expression {\var tail - head}.


\subsection{The Message Structure}

The message structure is given as follows:

\begin{verbatim}
    struct Msg {
        struct MsgFrag  f;
        MsgNodeLeaf     first;         /* left-most leaf in tree */
        int             firstOffset;   /* offset to first leaf */
        bool            firstIsMine;   /* is first leaf writable? */
        struct Attrs    attrs;
    };
\end{verbatim}

The central field is {\var f}, the message fragment containing the
data for this message. Because the head of a message is manipulated
frequently, it makes sense to cache the state related to the first
leaf that is not empty. For this reason, the message structure also
contains the fields {\var first} and {\var firstOffset}. Similarly,
field {\var firstIsMine} is used to optimize the pushing of headers.
These fields are described in more detail below.  The final field,
{\var attrs}, is used to hold message attributes.  These are arbitrary
name/value pairs that can be used to associate other information with
messages.

Field {\var first} points to the first (left-most) leaf that is either
not empty or that is the next one to be written on a header push. To
be precise, {\var first} points to the leaf that contains the data
byte that index {\var f.head} refers to.  Field {\var firstOffset} is
the offset (relative to the first byte in the message tree) of the
first byte in leaf {\var first}. It is normally zero, but whenever
popping a header results in an underflow, this field gets incremented.
Field {\var firstIsMine} is true if the first leaf is owned by this
message.  The invariant here is that for any given leaf, there is at
most one message for which {\var firstIsMine} is TRUE.  There may be
leaves for which there is no owner, but there are never two or more
owners for the same leaf. Ownership gives the right to push header
data onto the first leaf. Ownership does not give the right to change
existing message data, however.  This is because message trees (and
therefore leaves) are shared by messages as much as possible. Due to
this sharing, it is necessary that any message byte is written at most
once. Figure~\ref{fig6} illustrates the key-fields in the message data
structure.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig6.ps,height=3.0in}}
\caption{Key fields in message structure.}\label{fig6}
\end{figure}

There are several points worth emphasizing.

\begin{itemize}

\item The number of bytes that are available for header data is given by
the expression {\var f.head - firstOffset}.

\item The {\var first} field does not necessarily point to the left-most 
leaf in the tree (e.g., consider header popping that causes an underflow).

\item {\var f.tail} is not necessarily the index of the last data byte in 
the message tree (e.g., consider message truncation).
\end{itemize}

\subsection{The Message Tree}

There are two types of nodes in the tree: interior nodes ({\var PAIR}
nodes) and leaves. For storage management reasons, leaves are divided
into three subtypes: {\var PAGE}, {\var BUF}, and {\var EMPTY}.

There is one and only one {\var EMPTY} leaf that is used for all
buffers of length zero. {\var PAGE} leaves are used to identify data
areas that were allocated by the message library.  The message library
allocates memory from the heap in units of pages. (This page size is
not necessarily related to the size of virtual memory pages; it is
often much smaller than a virtual memory page). Allocating memory at
the granularity of pages reduces internal fragmentation in the heap
manager.  Also, when creating a {\var PAGE} buffer of $n$ bytes, the
amount of memory requested is rounded up to ensure that at least $m$
bytes are available for headers, where $m$ is maximum size expected to
be taken up by protocol headers.  In the current implemention, $m$ is
given by constant {\var MAX\_HDR\_STACK\_SIZE}. In contrast to {\var
PAGE} leaves, {\var BUF} leaves are used for data whose memory was
allocated outside the message library.  Such nodes have a deallocator
function associated with them that is called as soon as the message
library determines that the memory is no longer referenced by any
of the messages in existence.

The enumeration type that indicates a node's type is given below:

\begin{verbatim}
    enum NodeType {
        MSG_NODE_JUNK = 0,  /* to catch programming errors */
        MSG_NODE_PAIR,      /* joins two subtrees */
        MSG_NODE_PAGE,      /* leaf with one or more pages of data */
        MSG_NODE_BUF,       /* leaf using a special deallocator */
        MSG_NODE_EMPTY      /* used by all zero-length leaves */
    };
\end{verbatim}

The {\var JUNK} node type is there because messages that are not
properly initialized are most likely to contain zeroes.  Thus,
choosing the value 0 for this dummy-type increases the likelihood of
catching programming errors.

The exact declaration of a node depends on its type. However, all
message node structures start with this common part.

\begin{verbatim}
    struct MsgNodePart {
        enum NodeType  type;
        u_int          refCnt;
    };
\end{verbatim}

The {\var type} field allows the full declaration of this node to be
determined.  The {\var refCnt} field is a reference count that gives the
number of permanent references to this node.  Reference counting is
needed for message tree nodes because they can be shared among
multiple messages.  When the reference count reaches zero, the node is
deleted.  If the node is an interior node, the subtrees are visited
recusively and all nodes that are no longer needed are deleted as
well. The current implementation avoids recursion as much as possible
and manages an explicit stack when recursion is unavoidable. To
maximize performance, memory for the stack is allocated via {\var
alloca()}, not the regular heap allocator.

\subsection{Interior Nodes}

An interior ({\var PAIR}) node is simply a pair of fragments. Field
{\var l} is the left fragment that is visited first in a message
traversal and {\var r} is the right fragment:

\begin{verbatim}
    struct MsgNodePairPart {
        struct MsgFrag  l;
        struct MsgFrag  r;
    };
\end{verbatim}

One important point is that the head field in the left fragment is
always zero. This is because the message structures that refer to this
tree also contain a fragment.  The offset of the first byte in a
message is therefore subsumed into the {\var f.head} field in the
message structures.  This opens up the possibility of allowing the
owner of a leaf to push header data onto a message without requiring
changes to any of the possibly existing {\var PAIR} nodes.  The fact
that the head field in the left fragment is always zero also implies
that the length of such a fragment is given directly by the value of
the {\var tail} field. (Recall that in general, the length is given by
{\var tail - head}). The message library makes heavy use of this invariant
and is optimized accordingly.

Now, suppose {\var p} is a {\var PAIR} node. Then the $n$-th byte in the
data represented by {\var p} is found as follows: if $n$ is less than
{\var l.tail}, the byte is located at offset $n$ in the left subtree
({\var l.tree}).  Otherwise, the byte is located at offset $n$
{\var - l.tail + r.head} in the right subtree ({\var r.tree}).

\subsection{Leaves}

The common state that the three leaf node types share is factored into
the following structure.

\begin{verbatim}
    struct MsgNodeLeafPart {
        int   size;     /* size of buffer */
        char  *buf;     /* the buffer */
    };
\end{verbatim}

The {\var buf} field points to the beginning of the memory buffer
represented by this node. The {\var size} field specifies the size of
the buffer in bytes.

A {\var PAGE} node contains the memory for its data in the node
structure itself. As C doesn't support variable length arrays the
structure given below declares the buffer to be only one byte long.
In reality, the buffer is {\var size} bytes long. (This is the {\var
size} field in the {\var MsgNodeLeafPart} structure). This works
correctly so long as this structure appears at the very end of the
enclosing data structure. The declaration is given below.

\begin{verbatim}
    struct MsgNodePagePart {
        char  buffer[1];      /* longer in actuality */
    };
\end{verbatim}

{\var BUF} type nodes contain a reference to the function that is used
to free (deallocate) the data buffer once it is no longer needed by
the message library. The address of the deallocation function is
stored in the {\var free} field in the following structure.

\begin{verbatim}
    struct MsgNodeBufPart {
        MsgDeallocator  free;
    };
\end{verbatim}

The full declarations of the node structures are given below. As
expected, there is one structure per node type. In addition, there are
two structures ({\var MsgNode} and {\var MsgNodeLeaf}) that factor
state that is common to more than one node type.

\begin{verbatim}
    struct MsgNode {
        struct MsgNodePart      c;
    };

    struct MsgNodePair {
        struct MsgNodePart      c;
        struct MsgNodePairPart  pair;
    };

    struct MsgNodeLeaf {
        struct MsgNodePart      c;
        struct MsgNodeLeafPart  leaf;
    };

    struct MsgNodePage {
        struct MsgNodePart      c;
        struct MsgNodeLeafPart  leaf;
        struct MsgNodePagePart  page;
    };

    struct MsgNodeBuf {
        struct MsgNodePart      c;
        struct MsgNodeLeafPart  leaf;
        struct MsgNodeBufPart   buf;
    };
\end{verbatim}

\noindent
Thus, the relationships illustrated in Figure~\ref{fig7} hold, where
each edge denotes the ``{\it is a}'' relation; e.g., {\var
MsgNodeLeaf} {\it is a} {\var MsgNode}.

\begin{figure}[ht]
\centering
\leavevmode\hbox{\epsfig{file=fig7.ps,height=1.5in}}
\caption{Relationship among various structures.}\label{fig7}
\end{figure}

\subsection{Message Walk Context}

The contents of a message can be visited using the {\var msgWalk}
functions.  As described in the previous section, visiting a message
tree requires maintaining a stack of fragments. The structure that
implements this conceptual stack is given below.

\begin{verbatim}
    struct MsgWalk {
        struct MsgFrag      f;       /* frag to be visited next */
        struct MsgFragStack stack;
    };
\end{verbatim}

To understand the reason for maintaining field {\var f} besides the
actual fragment stack stored in field stack, consider the case where a
message consists only of {\var PAIR} nodes whose left subtrees are leaves.
Such a message can be traversed using a single fragment that keeps
track of the position to be visited next.  In essence, this case
corresponds to traversing a linear list.  Memory for the actual stack
needs to be allocated on the heap, which is relatively expensive.
Thus, by maintaining field {\var f}, memory (de-)allocation can be
avoided for common case messages.

\section{Final Comments}

There are several points that are worth emphasizing. These points
capture the invariants that provide the underpinnings of the message
library. As such, they should be helpful in reading and understanding
the code.

\begin{itemize}

\item Message data can have arbitrary alignment. It is not safe to
assume, for example, that initializing a message to a length of 8 will
yield a pointer to memory that is aligned on an 8-byte boundary.

\item Due to sharing, message data must not be written more than once.

\item Pair nodes, and therefore the fragments contained therein, are
immutable. Once created and initialized, they don't change. Only the
fragment in the message structures can change over time.

\item The head field in the left fragment of a {\var PAIR} node must always be
zero. As a result, the {\var tail} field in such a fragment directly
gives the length of the fragment.
\end{itemize}

\bibliography{../manual/manual}
\end{document}