CIS 5511. Programming Techniques

Linear Structures and Hash Tables

### 1. Dynamic sets

Many data structures can be seen abstractly as dynamic sets, containing elements which can be moved into and out of the sets.

A dynamic set S usually supports the following "dictionary operations":

• Insert(S, x): A operation that adds a new element x into S.
• Delete(S, x): A operation that removes an element x from S (if it is there).
• Search(S, k): A query that locates an element in S with a given key k.
When there is an order defined among the elements, the following operations are also often supported:
• Minimum(S): A query that locates the element in S with the smallest key.
• Maximum(S): A query that locates the element in S with the largest key.
• Successor(S, x): A query that locates the element in S that is after x in the order.
• Predecessor(S, x): A query that locates the element in S that is before x in the order.
The query operations do not change the content of the set, and may return a NIL pointer if there is no element in S satisfying the search condition. The Insert and Delete operations change the content of the data structure.

Different data structures handle these operations differently. When implemented, a data structure represents the relation/order among objects in a domain as relation/order among objects within a computer.

There are two basic ways to represent relations in computer:

• Implicitly, using the sequential order among memory cells,
• Explicitly, using specific pointers/references.

### 2. Linear data structures

If the set is totally ordered (usually by key, but not always), then the data structure is linear, in the sense that every element in it has exactly one predecessor and one successor, except the minimum (which has no predecessor) and the maximum (which has no successor).

For a linear data structure, the order among elements is implicitly represented in an array, and explicitly in a linked list.

Between the two, an array is more efficient in random access of elements, while a linked list is more efficient in insertions and deletions.

A linked list can be implemented in two arrays, one of which keeps the elements themselves, and the other keeps the index of the successor for each element. Similarly, multiple arrays can be used to represent an array of objects with multiple fields.

To turn a linear structure into a circular structure, in array we can let the index go from A.length back to 1, and in linked list we can let the last element to point to the first element.

### 3. Stack and queue

Stack and queue are abstract data structures where the order to be maintained is the inserting order of element to the structure. It is assumed that the deletion order is also the same (in queues) or the reverse (in stacks) of the insertion order, therefore both insertion and deletion become "zero address" operations --- the position where the operation is performed is always at the minimum/maximum element, and random access of the other elements are invalid.

Both stack and queue can be implemented by array and linked list. For stack, the implementation of the operations (called "push" and "pop") is relatively simple. For queue, one additional pointer is needed, because the insert and delete operations (sometimes called "enqueue" and "dequeue") work on different ends ("front" and "rear") of the structure. For both data structures, an array implementation is more efficient, but a linked list implementation is more flexible. [Why?]

Insert/Delete Operations on stack and queue have running time Θ(1).

### 4. Hash tables

Conceptually, hash table is a dynamic set without internal order. It provides quick search/insert/delete by directly mapping a key value to an index in a table, but does not efficiently support the operations defined with respect to inter-element order or relation.

Among the three operations, an insertion and deletion often require a search, therefore search is the major factor in efficiency analysis. Since search is basically a mapping from a key to an index in the table, the intuition behind hash table is to directly build such a mapping, without comparisons with the elements in the table.

The design of a hash table usually consists of two major tasks: (1) to define a hash function that calculates index from key, (2) to handle "collision", which is the situation where multiple keys are mapped into the same index.

Typically, the size of a hash table is proportional to the number of keys actually stored, while being much smaller than the range of possible keys.

A good hash function satisfies (approximately) the assumption that each key is equally likely to hash to any of the m index values, independently of the where any other key has hashed to. Often, a hash function works in two steps: (1) to convert a key into an integer, (2) to calculate the index from the integer. Therefore, hash function discussion often assumes that the function is a mapping from an integer (key) to an integer (index).

The most common hash function is to take the reminder of the key divided by the size of the hash table, that is, h(k) = k mod m. A more complicated version is h(k) = f(k) mod m, where f(k) does additional calculation to reduce collision, and the reminder operation makes the function value to cover the whole table. A similar approach is h(k) = floor(g(k) * m), where g(k) maps k into the range of [0, 1).

Common collision handling methods can be divided into two types, open addressing, where all elements are stored in the table itself, and separate chaining, where "overflow" elements are stored at the outside of the table.

Separate chaining requires additional space, though is conceptually simpler than open addressing.

In open addressing, the hash function generates a probe sequence to tell the element where to go if the slot indicated by the hash function is already occupied. Such a sequence can depend on the key (such as double hashing) or follow a fixed pattern (such as linear and quadratic probing). In either way, element comparisons are necessary, as collision can happen in multiple places. After deletion, the released space still needs to be marked for the following search to pass through.

Under the assumption that each possible probing sequence is equally probable, and that the load factor of the hash table α (number of elements / table size) is less than 1, there are the following conclusions:

• Theorem 11.6: The expected number of probes in an unsuccessful search is at most 1/(1 − α).
• Corollary 11.7: The expected number of probes in an insertion is at most 1/(1 − α).
• Theorem 11.8: The expected number of probes in a successful search is at most (1/α) ln (1/(1 − α)).
Therefore, the average cost of the major operations of a hash table is Θ(1), though the worst-case cost is still Θ(n).