'real'词频分析

zoukankan html css js c++ java

'real'词频分析
写下来想法来自于无聊时写的代码.https://cryptopals.com/sets/1/challenges/3

The hex encoded string:

1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736
... has been XOR'd against a single character. Find the key, decrypt the message.

You can do this by hand. But don't: write code to do it for you.

How? Devise some method for "scoring" a piece of English plaintext. Character frequency is a good metric. Evaluate each output and choose the one with the best score.

单字节密码加密

我们知道有一种简单的加密.真的很简单.

我给出的方案是这样定义的:

[Scheme := <Gen,Enc,Dec> ]
[Gen : key= random([0,1,..255]) ]
[Enc : ciphertext = xor(plaintext,key*plaintext.len) ]
[Dec: plaintext = xor(ciphertext,key*ciphertext.len) ]
攻击方案

很简单,搜索一下密钥空间,看看哪些有意义.就可以解决了.但是不够高效.指的是我们还需要看256次,如果要解密的很多,会很麻烦.

过滤

最简单的就是过滤非法字符,白名单可以是[0-9a-zA-Z'! ?s],这样可以过滤一些不合法的结果,直接减少了工作量.但是遇到大量混淆过的问题---指定是很多虽然字符合法但是不是英文语言的情况,却又无能为力了.

频率分析

我记得我上密码学的时候,老师讲过一种词频分析的攻击方法.这里的方法的含义就是这样.对每个可能的字符串进行合法判断,给出一个可能是英文句子的分数.然后每次只要最大的分数.再加上之前的白名单过滤.我们就可以快速判断大量数据中的可读的字符串.

实现

获得一份英文单词频率分布表.({x_a,x_b,..,x_z,x_{space}}),对于需要判定的串(w = w_1...w_n),最初始让(score = 0),分数为0.

对串中每个字符,如果(w_i)在分布表中,那么就(score = score + x_{w_i}),如果不在分布表中(score = score - 0.05).

代码如下:
```
def frequencyScore(b:bytes)->int:
    '''
    frequency data: https://web.archive.org/web/20170918020907/http://www.data-compression.com/english.html
    return a frequencyScore for a bytes string.
    '''
    score = 0
    letter_freqs = {
            'a': 0.0651738, 'b': 0.0124248, 'c': 0.0217339,
            'd': 0.0349835, 'e': 0.1041442, 'f': 0.0197881,
            'g': 0.0158610, 'h': 0.0492888, 'i': 0.0558094,
            'j': 0.0009033, 'k': 0.0050529, 'l': 0.0331490,
            'm': 0.0202124, 'n': 0.0564513, 'o': 0.0596302,
            'p': 0.0137645, 'q': 0.0008606, 'r': 0.0497563,
            's': 0.0515760, 't': 0.0729357, 'u': 0.0225134,
            'v': 0.0082903, 'w': 0.0171272, 'x': 0.0013692,
            'y': 0.0145984, 'z': 0.0007836, ' ': 0.1918182}

    for ele in b:
        if (ele>ord('a') and ele<ord('z')) or (ele>ord('A') and ele<ord('Z')) or ele==ord(' '):
            score += letter_freqs[chr(ele).lower()]
        else:
            score -= 0.05
    return score
```
这是完整的代码:https://github.com/zhuowangy2k/code-snippet/blob/master/frequency-single-xor/frequency-single-xor.py
查看全文

相关阅读:
Python学习札记(十五) 高级特性1 切片
 LeetCode Longest Substring Without Repeating Characters
Python学习札记(十四) Function4 递归函数 & Hanoi Tower
single number和变体
 tusen 刷题
 实验室网站
 leetcode 76. Minimum Window Substring
leetcode 4. Median of Two Sorted Arrays
leetcode 200. Number of Islands 、694 Number of Distinct Islands 、695. Max Area of Island 、130. Surrounded Regions 、434. Number of Islands II(lintcode) 并查集、178. Graph Valid Tree(lintcode)
刷题注意事项

原文地址：https://www.cnblogs.com/zhuowangy2k/p/10862814.html

'real'词频分析

单字节密码加密

攻击方案

过滤

频率分析

实现