1.what's problem we faced?
/**
* Q: what's problem we faced?
*
* A: Data compression is still a problem, even now. we want to compress
* the space of data. This desire is more and more stronger when we
* need to deal with some operation about data transmission. Before
* we start this article, it may be helpful if you try to provide a valid way
* to compress data . I tried, but failed obviously. That why I write this
* article. ^_^
*/
2. How can I solve it?
/**
* Q: How can I solve it?
*
* A: Where have problem is where have an answer, although it not always
* the best one. In 1951, a algorithm was introduced by David A. Huffman.
* It is different from the normal code and is a variable length code, which
* have different length of code for different symbol. Now, there are two
* problems:
*
* No.1: is variable length code possible? How can we know the length
* of current symbol?
*
* The answer is prefix code. Think about this, a tree like following:
*
*
* O
* 1 / 0
* O O
* 1 / 0 c
* O O
* a b
*
* This is a simple binary tree. There are three leaf node: a, b ,and c.we
* label all of left branch as 1, and all of right branch as 0. So if we want
* to arrive the leaf node a, the path is 11. In a similar way, we can get
* all of nodes:
* a : 11
* b : 10
* c : 0
*
* By accident, we get a variable length code.
*
*
* No.2: How can we use variable length code to compress a series of symbol?
*
* Now that we have a ability about variable length code. Some funny thing
* will happen. Image this, In a data, which consist of a series of symbols,
* some of symbols have occur at high proportion. some of symbols has occur
* at low proportion. If we use some shorter code to indicate those symbols
* which have a high proportion, the space of data will smaller than ever.
* That is what we want.
*
* Now, we have been know that we could compress a data by use variable length
* code. However, the next problem is what kind of variable length code is what we
* want. what kind of code is optimal ?
*/
3. What is HuffmanCoding ?
/**
* Q: What is HuffmanCoding ?
*
* A:Now,the problem is how can I create a optimal tree ? Do you have any idea?
* Huffman was introduced a algorithm. It is looks like greedy algorithm. It is may
* be simple, but the result is valid( this will be demonstrated below). The simplest
* construction algorithm use a priority queue where the node with lowest probability
* is given highest priority, the steps as following:
*
* 1. create a leaf node for each symbol, and add it to the priority queue.
* 2. while there is more than one node in the queue:
* 1. remove two nodes that have the highest priority.
* 2. create a new node as the parent node of the two nodes above. the
* probability of this one is equal to the sum of the two nodes' probabilities.
* 3. add the new node to the queue.
* 3. the remaining node is the root of this tree. Read it's code as we do above.
*
*/
4. is it optimal ?
/**
* Q: is it optimal ?
*
* A: Hard to say. I haven't a valid method to measure this. About this issue, it is necessary to hear
* about other people's advice. I believe there must be some exciting advice. By the way, this article
* is just talk about compress of independent symbol, another important issue is about related symbol.
* That maybe a serious problem.
*
*/
5. source code
/**
* Here is an simple example
*/
#include <stdio.h>
#include <iostream>
/**
* In a Huffman tree, some of nodes is valid symbol, and other is a combine node, which
* haven't a valid symbol. we need to label it in our nodes.
*/
enum ELEM_TYPE {
ET_VALID,
ET_INVALID,
ET_MAX,
};
typedef int INDEX;
/**
* this is a container, we push all of element to it, and pop element by a priority. It is
* a class template since we don't know the type of data element.
*/
template <class ELEM>
class Container {
public:
Container( int capacity);
~Container( );
/*
* push a element to this container.
*/
bool push( ELEM item);
/*
* pop a element from this container, the smallest one have the most priority.
* Of course, the element must have provide a reload function for operator '<'.
*/
bool pop( ELEM &item );
private:
bool _find_idle( INDEX &num);
bool _set_elem( INDEX num, ELEM &elem);
bool _get_elem( INDEX num, ELEM &elem);
ELEM *ele;
ELEM_TYPE *stat;
int cap;
};
template <class ELEM>
Container<ELEM>::Container( int capacity)
{
this->ele = new ELEM[capacity] ;
this->stat = new ELEM_TYPE[capacity];
int i;
for( i=0; i<capacity; i++)
this->stat[i] = ET_INVALID;
this->cap = capacity ;
}
template <class ELEM>
Container<ELEM>::~Container( )
{
if( this->ele!=NULL )
delete []this->ele;
if( this->stat!=NULL )
delete []this->stat;
this->cap = 0;
}
template <class ELEM>
bool Container<ELEM>::push( ELEM item)
{
INDEX num = -1;
if( (!this->_find_idle( num))
||(!this->_set_elem( num, item)))
return false;
return true;
}
template <class ELEM>
bool Container<ELEM>::pop( ELEM &item )
{
INDEX i = 0;
INDEX Min;
/*
* find the first valid element.
*/
while( (this->stat[i]!=ET_VALID)
&&( i<this->cap))
i++;
for( Min = i ; i<this->cap; i++)
{
if( ( this->stat[i]==ET_VALID)
&&( this->ele[i]<this->ele[Min]))
{
Min = i;
}
}
return this->_get_elem( Min, item);
}
template <class ELEM>
bool Container<ELEM>::_find_idle( INDEX &num)
{
INDEX i;
for( i=0; i<this->cap; i++)
{
if( this->stat[i]==ET_INVALID )
{
num = i;
return true;
}
}
return false;
}
template <class ELEM>
bool Container<ELEM>::_set_elem( INDEX num, ELEM &elem)
{
if( (num>=this->cap)
||(num<0) )
return false;
this->stat[num] = ET_VALID;
this->ele[num] = elem;
return true;
}
template <class ELEM>
bool Container<ELEM>::_get_elem( INDEX num, ELEM &elem)
{
if( (num<0)
||(num>=this->cap))
return false;
this->stat[num] = ET_INVALID;
elem = this->ele[num];
return true;
}
/**
* define a type of symbol. It will be used to record all information about a symbol.
*/
typedef char SYMINDEX;
typedef int SYMFRE;
class Symbol {
public:
/*
* In the Huffman tree, we need to compute the sum of two child symbol.
* For convenience,build a reload function is necessary.
*/
Symbol operator + ( Symbol &s);
SYMINDEX sym;
SYMFRE freq;
};
Symbol Symbol::operator +( Symbol &s)
{
Symbol ret;
ret.sym = '