zoukankan      html  css  js  c++  java
  • python之验证码识别 特征向量提取和余弦相似性比较

    0.目录

    1.参考
    2.没事画个流程图
    3.完整代码
    4.改进方向

    1.参考

    https://en.wikipedia.org/wiki/Cosine_similarity

    https://zh.wikipedia.org/wiki/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC%BC%E6%80%A7

    Cosine similarity
    Given two vectors of attributes, A and B, the cosine similarity, cos(θ),
    is represented using a dot product and magnitude as...
    余弦相似性通过测量两个向量的夹角的余弦值来度量它们之间的相似性。0度角的余弦值是1,
    余弦相似度通常用于正空间,因此给出的值为0到1之间。
    范数(norm),是具有“长度”概念的函数。二维度的向量的欧氏范数就是箭号的长度。

    Python 破解验证码

    python3验证码机器学习

    这两篇文章在计算矢量大小的时候函数参数都写成 concordance调和, 而不用 coordinate坐标, 为何???

    欧氏距离和余弦相似度

    numpy中提供了范数的计算工具:linalg.norm()

    所以计算cosθ起来非常方便(假定A、B均为列向量):

    num = float(A.T * B) #若为行向量则 A * B.T
    denom = linalg.norm(A) * linalg.norm(B)
    cos = num / denom #余弦值

    2.没事画个流程图

    流程图 Graphviz - Graph Visualization Software

    3.完整代码

    #!/usr/bin/env python
    # -*- coding: UTF-8 -*
    import os
    import time
    import re
    from urlparse import urljoin
    
    import requests
    ss = requests.Session()
    ss.headers.update({'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
    
    from PIL import Image
    # https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001431918785710e86a1a120ce04925bae155012c7fc71e000
    # 和StringIO类似,可以用一个bytes初始化BytesIO,然后,像读文件一样读取:
    from io import BytesIO
    from string import ascii_letters, digits
    
    import numpy as np 
    
    # ip_port_type_tuple_list = []
      
    class Mimvp():
        def __init__(self, num_width=None, feature_vectors=None, white_before_black=2, threshhold=100, max_nums=None, filepath=None, page=None):
            self.ip_port_type_tuple_list = []
            
            #fluent p189
            if feature_vectors is None:
                self.feature_vectors = []
            else:
                self.feature_vectors = list(feature_vectors)
                 
            self.num_width = num_width
            self.white_before_black = white_before_black
            self.threshhold = threshhold
            self.max_nums = max_nums
    
            self.filepath = filepath
            
            if page is None:
                self.url = 'http://proxy.%s.com/free.php?proxy=in_hp'%'mimvp'
            else:
                self.url = 'http://proxy.%s.com/free.php?proxy=in_hp&sort=&page=%s' %('mimvp', page)   
    
        def get_mimvp(self):
    
            # 预处理提取特征组需要取得 self.port_src_list
            if self.feature_vectors == []:
                self.extract_features()
    
            self.load_mimvp()
            self.get_port_list()
            self.merge_result()
            return self.ip_port_type_tuple_list
            
        def load_mimvp(self):
            resp = ss.get(self.url)
            self.ip_list = re.findall(r"class='tbl-proxy-ip'.*?>(.*?)<", resp.text)
            self.port_src_list = re.findall(r"class='tbl-proxy-port'.*?src=(.*?)s*/>", resp.text)      #图片链接
            self.type_list = re.findall(r"class='tbl-proxy-type'.*?>(.*?)<", resp.text)
                     
        def get_port_list(self):
            self.port_list = []
            for src in self.port_src_list:
                port = self.get_port(src)
                self.port_list.append(port)
            
        def get_port(self, src):    
            img = self.load_image_from_src(src)
            split_imgs = self.split_image(img)
            
            port = ''
            for split_img in split_imgs:
                vector = self.build_vector(split_img)    
                compare_results = []
                for t in self.feature_vectors:
                    cos = self.cos_similarity(vector, t.values()[0])
                    compare_results.append((cos, t.keys()[0]))
                # print sorted(compare_results, reverse=True)
                port += sorted(compare_results, reverse=True)[0][1]
            print port
            return port
    
        def load_image_from_src(self, src):
            src = urljoin(self.url, src)
            print src,
            resp = ss.get(src)
    
            fp = BytesIO(resp.content)
            img = Image.open(fp)
            return img
            
        def split_image(self, img):
            gray = img.convert('L')
            
            if self.num_width is None:
                img.show()
                print gray.getcolors()
                self.num_width = int(raw_input('num_'))
                self.white_before_black = int(raw_input('white_before_black:'))
                self.threshhold = int(raw_input('BLACK < (threshhold) < WHITE:'))
                
            gray_array = np.array(gray)
            bilevel_array = np.where(gray_array<self.threshhold,1,0)  #标记黑点为1,方便后续扫描
            
            left_list = []
            # 从左到右按列求和
            vertical = bilevel_array.sum(0)
            # print vertical
            # 从左到右按列扫描,2白1黑确定为数字左边缘
            for i,c in enumerate(vertical[:-self.white_before_black]):
                if self.white_before_black == 1:
                    if vertical[i] == 0 and vertical[i+1] != 0:
                        left_list.append(i+1)
                else:                    
                    if vertical[i] == 0 and vertical[i+1] == 0 and vertical[i+2] != 0:
                        left_list.append(i+2)
                if len(left_list) == self.max_nums:
                    break
                    
            # 分割可见图片
            # bilevel = Image.fromarray(bilevel_array)    #0/1 手工提取特征 show显示黑块 还没保存gif     
            bilevel = Image.fromarray(np.where(gray_array<self.threshhold,0,255))   
            # the left, upper, right, and lower pixel
            split_imgs = [bilevel.crop((each_left, 0, each_left+self.num_width, img.height)) for each_left in left_list]
    
            return split_imgs
            
        def build_vector(self, img):
            # img = Image.open(img)
            img_array = np.array(img) 
            # 先遍历w,再遍历h,总共w+h维度,不需要/255,标记黑点个数等多余处理
            return list(img_array.sum(0)) + list(img_array.sum(1))
    
        def cos_similarity(self, a, b):
            A = np.array(a)
            B = np.array(b)    
            dot_product = float(np.dot(A, B))   # A*(B.T) 达不到目的
            magnitude_product = np.linalg.norm(A) * np.linalg.norm(B)
            cos = dot_product / magnitude_product
            return cos
            
        def merge_result(self):
            for ip, port, _type in zip(self.ip_list, self.port_list, self.type_list):
                if '/' in _type:
                    self.ip_port_type_tuple_list.append((ip, port, 'both'))        
                elif _type == 'HTTPS':
                    self.ip_port_type_tuple_list.append((ip, port, 'HTTPS'))
                else:
                    self.ip_port_type_tuple_list.append((ip, port, 'HTTP'))
      
        def extract_features(self):
            if self.filepath is not None:
                img_list = self.load_images_from_filepath()
            else:
                self.load_mimvp()
                img_list = self.load_images_from_src_list()
            for img in img_list:
                split_imgs = self.split_image(img)
                for split_img in split_imgs:
                    split_img.show()
                    print split_img.getcolors()
                    input = raw_input('input:')
                    vector = self.build_vector(split_img)
                    item = {input: vector}
                    if item not in self.feature_vectors:
                        print item
                        self.feature_vectors.append(item)
                
            for i in sorted(self.feature_vectors):        
                print i,','   
        
        def load_images_from_filepath(self):
            img_list = []
            postfix = ['jpg', 'png', 'gif', 'bmp']
            for filename in [i for i in os.listdir(self.filepath) if i[-3:] in postfix]:
                file = os.path.join(self.filepath, filename)
                img_list.append(Image.open(file))
            return img_list
            
        def load_images_from_src_list(self):
            img_list = []
            for src in self.port_src_list:
                img = self.load_image_from_src(src)
                img_list.append(img)
            return img_list
        
       
        
    if __name__ == '__main__':
       
        feature_vectors = [    
        {'0': [4845, 5865, 5865, 5865, 5865, 4845, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'1': [5865, 5865, 3825, 6120, 6120, 6375, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1275, 1020, 1020, 1275, 1275, 1275, 1275, 1275, 1275, 255, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'2': [5100, 5610, 5610, 5610, 5610, 5355, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1275, 1020, 1275, 1275, 1275, 1275, 0, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'3': [5355, 5865, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1275, 765, 1275, 1275, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'4': [5610, 5865, 5865, 5865, 3825, 6120, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1275, 1020, 1020, 1020, 1020, 1020, 0, 1275, 1275, 1275, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'5': [4845, 5610, 5610, 5610, 5610, 5100, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 0, 1275, 1275, 1275, 255, 1275, 1275, 1275, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'6': [4590, 5610, 5610, 5610, 5610, 5355, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 765, 1275, 1275, 1275, 255, 1020, 1020, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'7': [6120, 6120, 6120, 5100, 5355, 5610, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 0, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'8': [4590, 5610, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1020, 510, 1020, 1020, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
        {'9': [5610, 5610, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1020, 255, 1275, 1275, 1275, 1275, 765, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} , 
        ]   
        
        # def __init__(self, feature_vectors=None, filepath=None, page=None):
        
        obj = Mimvp(num_width=6, feature_vectors=feature_vectors)
        # obj = Mimvp()    
        # obj = Mimvp(filepath='temp/')   
        
        ip_port_type_tuple_list = obj.get_mimvp()
        
        from pprint import pprint
        pprint(ip_port_type_tuple_list)
              
    View Code

    4.改进方向

    记录每个分割数字的x轴实际长度,这样的话考虑到不对图片的上下留白做处理,每个实际数字的h固定,而w不定,因此建立特征向量的时候改为先遍历h,再遍历w。

    考虑到在比较余弦相似性的时候由于叉乘需要两个向量具有相同的维度,这里需要每次取最小维度再比较。

    如此,在建立特征向量集的时候不需要提前指定每张分割数字为固定宽度。

  • 相关阅读:
    HTML_严格模式与混杂模式
    不要和一种编程语言厮守终生:为工作正确选择(转)
    iOS开发编码建议与编程经验(转)
    UTF-8 和 GBK 的 NSString 相互转化的方法
    UICollectionView 总结
    UIViewController的生命周期及iOS程序执行顺序
    objective-c 中随机数的用法
    clipsToBounds 与 masksToBounds 的区别与联系
    网络请求 代码 系统自带类源码
    iOS CGRectGetMaxX/Y 使用
  • 原文地址:https://www.cnblogs.com/my8100/p/cosine_similarity.html
Copyright © 2011-2022 走看看