zoukankan html css js c++ java

clojure 数据结构与算法第一篇 lz77算法的实现

LZ77压缩算法

本系列文章都来自于[Clojure Data Structures and Algorithms Cookbook]一书，我仅仅是照着树上的代码实现一下，做一下笔记。如有疑问请先阅读该书，然后我们可以一起讨论。

LZ77算法：在经过一个元素序列时，从即将出现（经过）的元素中，找出和过去（已经经过）的元素相同的元素序列，然后将该序列替换为一对数值：distance和length。其中，distance是指从当前元素开始，要找到之前出现的序列的起点，需要经过的距离。Length是指重复的序列长度是多少。
概念：
Input stream:字节输入流
Byte:输入流中基本的数据元素
Coding position:将要编码的下一个元素，即当前位置
Look ahead buffer: 从当前位置到输入流末尾的所有元素（但是实际搜索时，仅仅搜索window长度的字节数）
Window:用于后续元素编码的序列

LZ77的压缩过程：

1，在任何时刻，程序都是在处理一个特定的元素，这位于当前位置。考虑一个大小为n的窗格，在当前位置之前已经有了n个元素，然后将当前位置直到序列最后，就是程序接下来要处理的元素。
lz77

2，从字节序列的第一个元素开始（首先肯定是先填充好window，也就是说序列最开头window长度的字节序列是不会被编码的，因为不知道参考谁），直到填满窗口。
3，继续处理后面一个元素，同时window跟着向前走
4，从window中找到从当前位置起，相匹配的最长的字节序列
5，如果找到了，将其替换为之前说的距离-长度值对，然后向前移动匹配长度的距离，重复第四步。
6, 如果之前没有找到，继续第三步。

以下是代码：

(ns clojure-data-structures-and-argorithms.core
  (:require [clojure.set :as cset]))

;;首先是压缩过程

;;define testcase
(def testcase ["a" "b" "c" "f" "a" "d" "a" "b" "c" "e"])


(defn all-subvecs-from-beginning
  ;;取得从起点开始，向量v所有的前缀向量
  ;;如'abc'的所有前缀向量就是['a','ab','abc']
  [v]
  (set (map #(subvec v 0 %)
            (range 1 (inc (count v))))))
;;test
(all-subvecs-from-beginning testcase)
(all-subvecs-from-beginning ["a" "b" "c"])


(defn all-subvecs
  ;;取得向量v的所有后缀向量的所有前缀向量，就是取得v的所有不重复子序列，用于搜索匹配用
  [v]
  (loop [result #{}
         remaining v]
    (if (seq remaining)
      (recur (into result (all-subvecs-from-beginning remaining));;将当前向量的所有前缀向量存入结果
             (into [] (rest remaining)));;继续处理去掉第一个元素的向量，直到目标向量里没有元素
      result)))

;;定义函数：从look-ahead-buffer的所有匹配search-buffer的最长前缀序列
(defn longest-match-w-beginning
  ;;从当前的search-buffer中找出最长的匹配序列
  ;;search-buffer: 编码用的字典
  ;;look-ahead-buffer: 搜索用的序列，即需要编码的序列
  [search-buffer look-ahead-buffer]
  (let [all-left-chunks (all-subvecs search-buffer);;从当前的search-buffer找到所有子序列
        all-right-chunks-from-beginning (all-subvecs-from-beginning look-ahead-buffer);;从当前待编码区找到所有的前缀子序列（因为编码总是从当前位置开始，因此，总是得包含当前元素为第一个元素，因此，这里是所有的前缀序列
        all-matches (cset/intersection all-right-chunks-from-beginning
                                       all-left-chunks)];;最后，将这两个集合取交集，然后从里面找到最长的就是我们想要的最长匹配序列
    (->> all-matches
         (sort-by count >)
         first)))
    
;;test
(longest-match-w-beginning ["a" "b" "c" "d"]
                           ["b" "c" "d" "a"])

(defn pos-of-subvec
  ;;求出子序列sv在原序列中的位置
  [sv v]
  {:pre [(<= (count sv)
             (count v))]}
  (loop
      [cursor 0];;从序列v的起始位置0开始
    (if (or (empty? v)
            (empty? sv)
            (= cursor (count v)))
      (do  #_(prn "here?") nil)
      (if (= (subvec v cursor (+ (count sv) cursor))
             sv);;根据当前位置比较sv和同长度的v的子序列
        cursor
        (recur (inc cursor))))))

;;test
(pos-of-subvec ["b" "c"] ["a" "c" "b" "c" "d"])

(defn LZ77-STEP
  ;;
  [window look-ahead]
  (let [longest (longest-match-w-beginning window look-ahead)]
    (if-let [pos-subv-w (pos-of-subvec longest window)]
      (let [distance (- (count window) pos-subv-w);;可以想象一下，窗口的右边界到子序列的开始位置的距离就是distance
            pos-subv-l (pos-of-subvec longest look-ahead)
            the-char (first (subvec look-ahead
                                    (+ pos-subv-l
                                       (count longest))))];;look-ahead中已匹配序列之后的第一个字符，也就是下一步将要匹配的字符
        {:distance distance
         :length (count longest)
         :char the-char})
      {:distance 0
       :length 0
       :char (first look-ahead)})))

(defn LZ77
  [bytes-array window-size]
  (->> (loop [result []
              cursor 0
              window []
              look-ahead bytes-array]
         (if (empty? look-ahead)
           result
           (let [this-step-output (LZ77-STEP window look-ahead)
                 distance (:distance this-step-output)
                 length (:length this-step-output)
                 literal (:char this-step-output)
                 raw-new-cursor (+ cursor
                                   length
                                   1)
                 new-cursor (min raw-new-cursor
                                 (count bytes-array))
                 new-window (subvec bytes-array
                                    (max 0 (inc (- new-cursor
                                                   window-size)))
                                    new-cursor)
                 new-look-ahead (subvec bytes-array
                                        new-cursor)]
             (recur (conj result
                          [distance length]
                          literal)
                    new-cursor
                    new-window
                    new-look-ahead))))
       (filter (partial
                not=
                [0 0]))
       (filter (comp
                not
                nil?))
       (into [])))
(LZ77 testcase 5)

查看全文

相关阅读:
leetcode_09_Palindrome Number (easy)
JQuery与JavaScript与Ajax三者的区别与联系
 Blockly常用函数
 JavaWeb中四大域对象
 Thymeleaf知识
 列表、字典、元组、集合的区别
 Spring Boot拦截器
 AWS路由表
 Android调试工具DDMS的使用详解
 Android 列表单选对话框

原文地址：https://www.cnblogs.com/xiaojintao/p/6352085.html

clojure 数据结构与算法 第一篇 lz77算法的实现

LZ77压缩算法

LZ77的压缩过程：

clojure 数据结构与算法第一篇 lz77算法的实现