zoukankan      html  css  js  c++  java
  • 数据压缩算法之哈夫曼编码(HUFFMAN)的实现

    HUFFMAN编码可以很有效的压缩数据,通常可以压缩20%到90%的空间(算法导论)。具体的压缩率取决于数据的特性(词频)。如果采取标准的语料库进行编码,一般可以得到比较满意的编码结果(对不同文件产生不同压缩率的折中方法)。

    本文采取对单独一个文件进行编码的方式来演示此压缩算法的使用。

    分为下面几个步骤:

    1.统计词频数据

    2.词频数据转换成HUFFMAN算法能够处理的类型(本文为HuffmanNode,内部有存储词频和树节点的结构)

      (1)由输入的HuffmanNode[]数组创建最小优先级队列

      (2)依次取出队列中的每两个节点,然后由此两个节点构造一个新的节点,然后在重新插入回队列。直到队列中只剩唯一一个节点。

        此节点为编码树的根节点。

      (3)依次遍历原来输入的每个HUFFMAN节点,得到每个字符的对应编码(压缩使用)。

      (4)解码方式,依次输入0/1字符码到算法,算法遍历产生的编码树,如果有返回字符,则得到解码字符。

     词频统计的实现:

     public class FrequencyCounter
        {
            public IEnumerable<KeyValuePair<char, int>> MapReduce(string str)
            {
                //the GroupBy method is acting as the map, 
                //while the Select method does the job of reducing the intermediate results into the final list of results.
                var wordOccurrences = str
                    .GroupBy(w => w)
                    .Select(intermediate => new
                        {
                            Key = intermediate.Key,
                            Value = intermediate.Sum(w => 1)
                        })
                    .OrderBy(kvp => kvp.Value);
                IEnumerable<KeyValuePair<char, int>> kvps = from wo in wordOccurrences select new KeyValuePair<char, int>(wo.Key, wo.Value);
                return kvps;
            }
        }
    MapReduce

    HUFFMAN编码类的实现:

     public class Huffman
        {
            private List<HuffmanNode> originalNodes;
            private HuffmanNode rootNode;
            public Huffman(IEnumerable<KeyValuePair<char, int>> kvps)
            {
                //保存原始数据
                var tmpOriginalNodes = from kvp in kvps select new HuffmanNode(kvp.Key, kvp.Value);
                //创建最小优先队列,并输入数据
                MinPriorityQueue<HuffmanNode> minQueue = new MinPriorityQueue<HuffmanNode>();
                originalNodes = new List<HuffmanNode>();
                foreach (var node in tmpOriginalNodes)
                {
                    originalNodes.Add(node);
                    minQueue.Insert(node);
                }
                //建造编码树,并取得编码树的根节点
                while (!minQueue.IsEmpty)
                {
                    HuffmanNode left = minQueue.ExtractMin();
                    if (minQueue.IsEmpty)
                    {
                        rootNode = left;
                        break;
                    }
                    HuffmanNode right = minQueue.ExtractMin();
                    HuffmanNode newNode = new HuffmanNode(null, left.Value + right.Value, left, right);
                    left.Parent = newNode;
                    right.Parent = newNode;
                    minQueue.Insert(newNode);
                }
            }
            //只接受单个char的加密
            public string Encode(char sourceChar)
            {
                HuffmanNode hn = originalNodes.FirstOrDefault(n => n.Key == sourceChar);
                if (hn == null) return null;
                HuffmanNode parent = hn.Parent;
                StringBuilder rtn = new StringBuilder();
                while (parent != null)
                {
                    if (Object.ReferenceEquals(parent.Left, hn))//左孩子,编码为0
                    {
                        rtn.Insert(0, "0", 1);
                    }
                    else//右孩子,编码为1
                    {
                        rtn.Insert(0, "1", 1);
                    }
                    hn = parent;
                    parent = parent.Parent;
                }
                return rtn.ToString();
            }
            //只接受一个字符的解码输出
            public bool Decode(string string01, out char? output)
            {
                HuffmanNode tmpNode = rootNode;
                char[] chars = string01.Trim().ToCharArray();
                for (int i = 0; i < chars.Count(); i++)
                {
                    if (chars[i] == '0') tmpNode = tmpNode.Left;
                    if (chars[i] == '1') tmpNode = tmpNode.Right;
                }
                if (tmpNode != null && tmpNode.Left == null && tmpNode.Right==null)
                {
                    output = tmpNode.Key;
                    return true;
                }
                else
                {
                    output = null;
                    return false;
                }
            }
    
            class HuffmanNode : IHeapValue
            {
                public HuffmanNode(char? key, int value, HuffmanNode left = null, HuffmanNode right = null)
                {
                    this.Left = left;
                    this.Right = right;
                    this.Key = key;
                    this.Value = value;
                }
                public HuffmanNode Left { get; private set; }
                public HuffmanNode Right { get; private set; }
                public HuffmanNode Parent { get; set; }
                public char? Key { get; private set; }
                public int Value { get; set; }
            }
        }
    View Code

    对文本进行编码的用法:

     FrequencyCounter fc = new FrequencyCounter();
                var kvps = fc.MapReduce(这里是你的文本);
                hm = new Huffman(kvps);
                StringBuilder sb = new StringBuilder();
                string ori =这里是你的文本;
                char[] chararray = ori.ToCharArray();
                for (int i = 0; i < chararray.Length; i++)
                {
                    sb.Append(hm.Encode(chararray[i]));
                }

    对编码进行解码:

                string bstr =你的编码后的文本;
                StringBuilder sb = new StringBuilder();
                char? outchar = null;
                string tmpStr = null;
                for (int i = 0; i < bstr.Length; i++)
                {
                    tmpStr = tmpStr + bstr[i];
                    if (hm.Decode(tmpStr, out outchar))
                    {
                        tmpStr = null;
                        sb.Append(outchar);
                    }
                }       

    测试效果,可以看到压缩效果还是很明显的:

    完毕。

    作者:Andy Zeng

    欢迎任何形式的转载,但请务必注明出处。

    http://www.cnblogs.com/andyzeng/p/3703321.html

  • 相关阅读:
    迭代器模式(Iterator.hasNaxt())
    命令模式(Command、Recevier、Invoker)(电脑开机命令)
    中介者模式(Mediator、ConcreteMediator、Colleague Class)(租房中介)
    Python记录
    Spring Boot 处理网页表单
    Spring Boot 使用MySQL数据库
    Spring Boot入门
    codeforces 798 D. Mike and distribution
    Codeforces Round #412 C. Success Rate
    hdu 1754 I Hate It (线段树)
  • 原文地址:https://www.cnblogs.com/andyzeng/p/3703321.html
Copyright © 2011-2022 走看看