Entropy
Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others)
Total Submission(s): 3100 Accepted Submission(s): 1182
Problem Description
An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with “wasted” or “extra” information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.
English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: it’s easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a “prefix-free variable-length” encoding.
In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.
Consider the text “AAAAABCD”. Using ASCII, encoding this would require 64 bits. If, instead, we encode “A” with the bit pattern “00”, “B” with “01”, “C” with “10”, and “D” with “11” then we can encode this text in only 16 bits; the resulting bit pattern would be “0000000000011011”. This is still a fixed-length encoding, however; we’re using two bits per glyph instead of eight. Since the glyph “A” occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode “A” with “0”, “B” with “10”, “C” with “110”, and “D” with “111”. (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to “0000010110111”, a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and you’ll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.
As a second example, consider the text “THE CAT IN THE HAT”. In this text, the letter “T” and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters “C”, “I’ and “N” only occur once, however, so they will have the longest codes.
There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with “00”, “A” with “100”, “C” with “1110”, “E” with “1111”, “H” with “110”, “I” with “1010”, “N” with “1011” and “T” with “01”. The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.
Input
The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word “END” as the text string. This line should not be processed.
Output
For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.
Sample Input
AAAAABCD THE_CAT_IN_THE_HAT END
Sample Output
64 13 4.9 144 51 2.8
Source
Greater New York 2000
Huffman编码:解决了最小带权路径长问题,在本题中叶节点的权值设为该字符出现的次数.
Huffman树构造方法:先将所有节点置于一个集合中.每次取出集合中权值最小的两个节点,从集合中删除它们,并建立一个新的节点.新节点的权值为之前两个节点权值的和,之前的两个节点成为新节点的儿子,并将新节点加入集合中.重复以上操作直到集合中只有一个元素.
一开始关于这道题纠结了好一阵子,其原因在于Huffman编码采用不定长的方式编码,解码时不就乱套了吗?后来我发现自己杞人忧天了,其实解码时很简单,每次从树根开始对于每一个1将指针移向当前节点的左儿子,对于每一个0将指针移向当前节点的右儿子,直到达到叶节点为止.
#include<stdio.h>
#include<string.h>
class hafft
{
public:
int d;
hafft *lson,*rson;
};
class node
{
public:
int d,tar;
};
hafft haffman[600];
node heap[600];
char ch[1024];
int num[300];
int N;
void adjust_down(int x)
{
if (x*2<=N && heap[x].d>heap[x*2].d)
{
node tmp=heap[x];
heap[x]=heap[x*2];
heap[x*2]=tmp;
adjust_down(x*2);
}
if (x*2+1<=N && heap[x].d>heap[x*2+1].d)
{
node tmp=heap[x];
heap[x]=heap[x*2+1];
heap[x*2+1]=tmp;
adjust_down(x*2+1);
}
}
int count(hafft *x,int d)
{
if (x->lson==NULL && x->rson==NULL) return x->d*d;
int ans=0;
if (x->lson!=NULL) ans+=count(x->lson,d+1);
if (x->rson!=NULL) ans+=count(x->rson,d+1);
return ans;
}
int main()
{
while (scanf("%s",ch)!=EOF)
{
if (strlen(ch)==3 && ch[0]=='E' && ch[1]=='N' && ch[2]=='D') return 0;
int i,j;
memset(num,0,sizeof(num));
for (i=0;i<strlen(ch);i++) num[ch[i]+1]++;
for (i=1;i<256;i++)
for (j=i+1;j<=256;j++)
if (num[i]<num[j])
{
int t=num[i];
num[i]=num[j];
num[j]=t;
}
for (i=1;i<=256;i++)
if (num[i]==0)
{
N=i-1;
break;
}
printf("%d ",strlen(ch)*8);
if (N==1)
{
printf("%d %.1lf
",num[1],strlen(ch)*8.0/num[1]);
continue;
}
for (i=1;i<=N;i++)
{
haffman[i].d=num[i];
haffman[i].lson=haffman[i].rson=NULL;
heap[i].d=num[i];
heap[i].tar=i;
}
for (i=N;i>=1;i--) adjust_down(i);
int T=N;
for (i=1;i<T;i++)
{
node node1=heap[1];
heap[1]=heap[N];
N--;
adjust_down(1);
node node2=heap[1];
haffman[T+i].d=node1.d+node2.d;
haffman[T+i].lson=&haffman[node1.tar];
haffman[T+i].rson=&haffman[node2.tar];
heap[1].d=haffman[T+i].d;
heap[1].tar=T+i;
adjust_down(1);
}
int Min=count(&haffman[2*T-1],0);
printf("%d %.1lf
",Min,strlen(ch)*8.0/Min);
}
return 0;
}
明白道理事情就简单了,一遍AC.
再做一道练练:
Safe Or Unsafe
Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)
Total Submission(s): 1192 Accepted Submission(s): 465
Problem Description
Javac++ 一天在看计算机的书籍的时候,看到了一个有趣的东西!每一串字符都可以被编码成一些数字来储存信息,但是不同的编码方式得到的储存空间是不一样的!并且当储存空间大于一定的值的时候是不安全的!所以Javac++ 就想是否有一种方式是可以得到字符编码最小的空间值!显然这是可以的,因为书上有这一块内容--哈夫曼编码(Huffman Coding);一个字母的权值等于该字母在字符串中出现的频率。所以Javac++ 想让你帮忙,给你安全数值和一串字符串,并让你判断这个字符串是否是安全的?
Input
输入有多组case,首先是一个数字n表示有n组数据,然后每一组数据是有一个数值m(integer),和一串字符串没有空格只有包含小写字母组成!
Output
如果字符串的编码值小于等于给定的值则输出yes,否则输出no。
Sample Input
2 12 helloworld 66 ithinkyoucandoit
Sample Output
no yes
Source
HDU 2008-10 Programming Contest
Recommend
gaojie
#include<stdio.h>
#include<string.h>
class hafft
{
public:
int d;
hafft *lson,*rson;
};
class node
{
public:
int d,tar;
};
hafft haffman[600];
node heap[600];
char ch[1024];
int num[300];
int N;
void adjust_down(int x)
{
if (x*2<=N && heap[x].d>heap[x*2].d)
{
node tmp=heap[x];
heap[x]=heap[x*2];
heap[x*2]=tmp;
adjust_down(x*2);
}
if (x*2+1<=N && heap[x].d>heap[x*2+1].d)
{
node tmp=heap[x];
heap[x]=heap[x*2+1];
heap[x*2+1]=tmp;
adjust_down(x*2+1);
}
}
int count(hafft *x,int d)
{
if (x->lson==NULL && x->rson==NULL) return x->d*d;
int ans=0;
if (x->lson!=NULL) ans+=count(x->lson,d+1);
if (x->rson!=NULL) ans+=count(x->rson,d+1);
return ans;
}
int main()
{
int C;
scanf("%d",&C);
while (C--)
{
int M;
scanf("%d",&M);
scanf("%s",ch);
int i,j;
memset(num,0,sizeof(num));
for (i=0;i<strlen(ch);i++) num[ch[i]+1]++;
for (i=1;i<256;i++)
for (j=i+1;j<=256;j++)
if (num[i]<num[j])
{
int t=num[i];
num[i]=num[j];
num[j]=t;
}
for (i=1;i<=256;i++)
if (num[i]==0)
{
N=i-1;
break;
}
if (N==1)
{
if (num[1]>M) printf("no
");
else printf("yes
");
continue;
}
for (i=1;i<=N;i++)
{
haffman[i].d=num[i];
haffman[i].lson=haffman[i].rson=NULL;
heap[i].d=num[i];
heap[i].tar=i;
}
for (i=N;i>=1;i--) adjust_down(i);
int T=N;
for (i=1;i<T;i++)
{
node node1=heap[1];
heap[1]=heap[N];
N--;
adjust_down(1);
node node2=heap[1];
haffman[T+i].d=node1.d+node2.d;
haffman[T+i].lson=&haffman[node1.tar];
haffman[T+i].rson=&haffman[node2.tar];
heap[1].d=haffman[T+i].d;
heap[1].tar=T+i;
adjust_down(1);
}
int Min=count(&haffman[2*T-1],0);
if (Min>M) printf("no
");
else printf("yes
");
}
return 0;
}