zoukankan html css js c++ java

计算机如何区分乱码和英文？

一个大的字符串加密之后就是乱码了，一般来说看不出其字面意思，我们需要正确的解密

而解密后，计算机要怎恶么知道解密是否成功呢？

isEnglish()函数会把解密后的字符串分割成单词s，检查每个单词是否在包含成千上万个英文单词的文件里，

尽管这个文件不能说包含所有单词。现在，如果一定数量的单词s是英文单词，那么我们可以大概率的说

这些文字是英文，也就是说我们有信心说找到了正确的密钥，解密成功。

这里定义了一种计算机如何区分乱码和英文de方法：

# Detect English module
# http://inventwithpython.com/hacking (BSD Licensed)

# To use, type this code:
#   import detectEnglish
#   detectEnglish.isEnglish(someString) # returns True or False
# (There must be a "dictionary.txt" file in this directory with all English
# words in it, one word per line. You can download this from
# http://invpy.com/dictionary.txt)
# 好的习惯   常量大写命名！！  ' 	
'分别是 空格 （两个转义字符）制表符 和 换行字符
UPPERLETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' 	
'

def loadDictionary():
    dictionaryFile = open('dictionary.txt')
    # englishWords = {}定义了一个空的字典
    englishWords = {}
    # 把所有的单词设成这个字典的键key，其对应的值都为None
    # dictionaryFile.read().split('
')这是一个列表，read()读出了整个文档内容
    # 成了一个大的字符串，split('
')方法将其分割组成列表
    # （因为这个文件中每一行只有一个单词）
    for word in dictionaryFile.read().split('
'):
        # 我们不在乎每个键里面保存了什么值，所以用None 属于NoneType数据类型
        # 表示这个值暂且不存在，我也说不好他是什么以及会是什么
        englishWords[word] = None
    dictionaryFile.close()
    return englishWords

# 装载一个字典  在detectEnglish的全局代码块中，任何import detectEnglish的
# Python程序都可以看见并使用
ENGLISH_WORDS = loadDictionary()

# 接受一个字符串参数，返回一个浮点值，比例（没有一个0~1全是），表示已经识别出了多少个英文单词
def getEnglishCount(message):
    message = message.upper()
    message = removeNonLetters(message)
    possibleWords = message.split()

    # 考虑到message可能是一个不是英文字母的字符串如'1234568' 那么调用removeNonLetters
    # 返回了空的字符串，然后经过split()方法转化成空的列表 这种情况要return出去
    if possibleWords == []:
        return 0.0 # no words at all, so return 0.0

    matches = 0
    for word in possibleWords:
        if word in ENGLISH_WORDS:
            matches += 1
    # 我们在python中使用除法的时候要避免除以0错误，这里这种错误不会发生，因为如果possibleWords
    # 是空列表时在上面已经return出去了，这是一种处理除以0错误的办法
    return float(matches) / len(possibleWords)
    
# 移除特殊符号和数字（不在LETTERS_AND_SPACE中的字符串）
def removeNonLetters(message):
    lettersOnly = []
    for symbol in message:
        if symbol in LETTERS_AND_SPACE:
            lettersOnly.append(symbol)
    return ''.join(lettersOnly)

# 判断是英文字是通过设定字母和单词所占的比例，即设定阈值来判断的
def isEnglish(message, wordPercentage=20, letterPercentage=85):
    # By default, 20% of the words must exist in the dictionary file, and
    # 85% of all the characters in the message must be letters or spaces
    # (not punctuation or numbers).
    wordsMatch = getEnglishCount(message) * 100 >= wordPercentage
    numLetters = len(removeNonLetters(message))
    messageLettersPercentage = float(numLetters) / len(message) * 100
    lettersMatch = messageLettersPercentage >= letterPercentage
    return wordsMatch and lettersMatch

这个代码可能对我们的其他破译程序有用，所以把它做成单独的模块，以便其他想要调用isEnglish()的程序导入。

这样使用

>>> import detectEnglish
>>> detectEnglish.isEnglish('is this sentence English text?')
True
>>>

查看全文

相关阅读:
UDP：用户数据报协议（User Datagram Protocol）
线程池的使用
 SQL Server表和模式空间使用情况http://www.51myit.com/thread2466911.html
bytetobmp and bmptobyte(Image)
c# TCP例子转载
 POJ 4047Garden
NYOJ 102 次方求模
 Sum
POJ 1094 Sorting It All Out（经典拓扑，唯一排序）
POJ 2387 Til the Cows Come Home（Dijkstra）

原文地址：https://www.cnblogs.com/PiaYie/p/13472365.html