3223-08

CIS 3223. Data Structures and Algorithms

Greedy algorithms

Greedy algorithm: always select the best step among the immediate choices, without looking ahead.

Advantage: simple and efficient. Disadvantage: may miss the best path.

1. Minimum spanning tree

In a connected, undirected, weighted graph <V, E>, a MST is its connected subgraph with the same V but the smallest subset of E. It contains |V|-1 edges

Application: connecting a number of computers with the shortest cable.

Reducing the number of trees by one in each step. The cut property.

Prim's algorithm: repeatedly add the next lightest edge into the MST without produce a cycle, using a priority queue (heap), Compared with Dijkstra's algorithm

Kruskal's algorithm: repeatedly add the next lightest edge that doesn't produce a cycle.

Representing a disjoint-set by a tree, where each node points to its parent, identified by its root (which point to itself).

makeset: each node point to its self
find: locate the root of the tree
union: make the root of the shorter tree point to the root of the taller tree.

The Set Interface in Java: no "find".

Kruskal's algorithm has complexity O(|E| log |V|).

2. Huffman encoding

Compression code, such as ZIP (lossless compression) and MP3 (lossy compression)

Basic assumption: the data is sequences of basic symbols or signals in a finite set, which will be coded into binary codewords. The code table is used in encoding and decoding.

Example: the most efficient way to code messages consisting of {A,B,C,D}.

What if the symbols have very different probabilities to appear?

Idea: variable-length code. Average codeword length.

Condition for the code to be usable: the prefix-free property. Codeword table as full binary tree. Example. Encoding and decoding process. Why the prefix-free property is guaranteed.

Input: symbols with probability, frequency, or count in sample message. Output: encoding tree and/or code table.

Frequency of an intermediate node is the sum of those of its children. Total cost: the sum of the frequencies of all nodes except the root. Numbers of leaves and intermediate nodes. Greedy: start from the smaller numbers.

Huffman algorithm. Demo applet.

How to prove that the algorithm produces the shortest code?

Complexity: O(n log n) if the priority queue is a heap.

3. Set cover

Greedy algorithms do not always produce optimal solutions.

Problem: Given some subsets of a set, find the smallest number of them that contains every element of the set.

Solution: Repeatedly select the subset containing the largest number of uncovered elements.

Example. Greedy solution: a,f,c,j; optimal solution: b,e,i.

For n elements, if the optimal solution is k, the greedy solution is at most k ln n. Therefore, the approximation factor of this greedy algorithm is ln n. On the other hand, it has a much lower time complexity than the optimal algorithm on this problem.