zoukankan      html  css  js  c++  java
  • 58同城 字体反爬理解...和猫眼不同

    import requests
    import re
    import base64
    import io
    from lxml import etree
    from fontTools.ttLib import TTFont
    
    url = 'https://gz.58.com/zufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d100000-0000-31f5-5967-5384271a3920&ClickID=2'
    headers = {
        'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
    }
    response = requests.get(url=url,headers=headers)
    # 获取加密字符串
    base64_str = re.search("base64,(.*?)')",response.text).group(1)
    b = base64.b64decode(base64_str)
    font = TTFont(io.BytesIO(b))
    bestcmap = font['cmap'].getBestCmap()
    newmap = dict()
    for key in bestcmap.keys():
        value = int(re.search(r'(d+)', bestcmap[key]).group(1)) - 1
        key = hex(key)
        newmap[key] = value
        
        
    # 把页面上自定义字体替换成正常字体
    response_ = response.text
    for key,value in newmap.items():
        key_ = key.replace('0x','&#x') + ';'
        if key_ in response_:
            response_ = response_.replace(key_,str(value))
    
    
    rec = etree.HTML(response_)
    lis = rec.xpath('//ul[@class="house-list"]/li')
    for li in lis:
        money = li.xpath('.//div[@class="money"]/b/text()')[0]
        if money: 
            print(money)

    和猫眼不同,猫眼是把编码对象在glyf         而58则是在cmap中

    https://www.cnblogs.com/eastonliu/p/9925652.html

  • 相关阅读:
    截图插件
    断点续传
    sql server 将日期减一天
    C# 输出24小时格式时间
    蓝桥 凑平方数
    九宫重排
    HDU 1584
    HDU 2612 (两边一起)
    HDU 1016 Prime Ring Problem
    全排列、排列、排列组合
  • 原文地址:https://www.cnblogs.com/zengxm/p/11107972.html
Copyright © 2011-2022 走看看