zoukankan      html  css  js  c++  java
  • python3中使用builtwith的方法(很详细)

    1. 首先通过pip install builtwith安装builtwith

    C:UsersAdministrator>pip install builtwith  
    Collecting builtwith  
      Downloading builtwith-1.3.2.tar.gz  
    Installing collected packages: builtwith  
      Running setup.py install for builtwith ... done  
    Successfully installed builtwith-1.3.2  

    2. 在pycharm中新建工程并输入下面测试代码

    import builtwith  
    tech_used = builtwith.parse('http://www.baidu.com')  
    print(tech_used)  

    运行会得到下面的错误:

    C:UsersAdministratorAppDataLocalProgramsPythonPython36python.exe F:/python/first/FirstPy  
    Traceback (most recent call last):  
      File "F:/python/first/FirstPy", line 1, in <module>  
        import builtwith  
      File "C:UsersAdministratorAppDataLocalProgramsPythonPython36libsite-packagesuiltwith\__init__.py", line 43  
        except Exception, e:  
                        ^  
    SyntaxError: invalid syntax  
      
      
    Process finished with exit code 1  

    原因是builtwith是基于2.x版本的,需要修改几个地方,在pycharm出错信息中双击出错文件,进行修改,主要修改下面三种:
    1. Python2中的 “Exception ,e”的写法已经不支持,需要修改为“Exception as e”。
    2. Python2中print后的表达式在Python3中都需要用括号括起来。
    3. builtwith中使用的是Python2中的urllib2工具包,这个工具包在Python3中是不存在的,需要修改urllib2相关的代码。
    1和2容易修改,下面主要针对第3点进行修改:
    首先将import urllib2替换为下面的代码:

     
    import urllib.request  
    import urllib.error  

    然后将urllib2的相关方法替换如下:

    request = urllib.request.Request(url, None, {'User-Agent': user_agent})  
    response = urllib.request.urlopen(request)  

    再次运行项目,遇到下面错误:

    C:UsersAdministratorAppDataLocalProgramsPythonPython36python.exe F:/python/first/FirstPy  
    Traceback (most recent call last):  
      File "F:/python/first/FirstPy", line 3, in <module>  
        builtwith.parse('http://www.baidu.com')  
      File "C:UsersAdministratorAppDataLocalProgramsPythonPython36libsite-packagesuiltwith\__init__.py", line 62, 
    in builtwith  
        if contains(html, snippet):  
      File "C:UsersAdministratorAppDataLocalProgramsPythonPython36libsite-packagesuiltwith\__init__.py", line 105, 
    in contains  
        return re.compile(regex.split('\;')[0], flags=re.IGNORECASE).search(v)  
    TypeError: cannot use a string pattern on a bytes-like object  
      
      
    Process finished with exit code 1  

    这是因为urllib返回的数据格式已经发生了改变,需要进行转码,将下面的代码:

    if html is None:  
        html = response.read()  

    修改为

    if html is None:  
         html = response.read()  
         html = html.decode('utf-8')  

    再次运行得到最终结果如下:

    C:UsersAdministratorAppDataLocalProgramsPythonPython36python.exe F:/python/first/FirstPy  
    {'javascript-frameworks': ['jQuery']}  
      
      
    Process finished with exit code 0  

    但是如果把网站换成 'www.163.com',运行再次报错如下:

    C:UsersAdministratorAppDataLocalProgramsPythonPython36python.exe F:/python/first/FirstPy  
    Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte  
    Traceback (most recent call last):  
      File "F:/python/first/FirstPy", line 2, in <module>  
        tech_used = builtwith.parse('http://www.163.com')  
      File "C:UsersAdministratorAppDataLocalProgramsPythonPython36libsite-packagesuiltwith\__init__.py", line 63, 
    in builtwith  
        if contains(html, snippet):  
      File "C:UsersAdministratorAppDataLocalProgramsPythonPython36libsite-packagesuiltwith\__init__.py", line 106, 
    in contains  
        return re.compile(regex.split('\;')[0], flags=re.IGNORECASE).search(v)  
    TypeError: cannot use a string pattern on a bytes-like object  
      
      
      
    Process finished with exit code 1  

    似乎还是编码的问题,将编码设置成 ‘GBK’,运行成功如下:

    C:UsersAdministratorAppDataLocalProgramsPythonPython36python.exe F:/python/first/FirstPy  
    {'web-servers': ['Nginx']}  
      
      
    Process finished with exit code 0  

    所以不同的网站需要用不同的解码方式么?下面介绍一种判别网站编码格式的方法。
    我们需要安装一个叫chardet的工具包,如下:

    C:UsersAdministrator>pip install chardet  
    Collecting chardet  
      Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)  
        100% |████████████████████████████████| 184kB 616kB/s  
    Installing collected packages: chardet  
    Successfully installed chardet-2.3.0  
      
      
    C:UsersAdministrator>  

    将byte数据传入chardet的detect方法后会得到一个Dict,里面有两个值,一个是置信值,一个是编码方式

    {'encoding': 'utf-8', 'confidence': 0.99}  

    将builtwith对应的代码做下面修改:

    encode_type = chardet.detect(html)  
      if encode_type['encoding'] == 'utf-8':  
        html = html.decode('utf-8')  
      else:  
        html = html.decode('gbk')  

    记得 import chardet!!!!
    加入chardet判断字符编码的方式后,就能适配网站了~~~~

     http://blog.csdn.net/fengzhizi76506/article/details/61617067
  • 相关阅读:
    geotrellis使用(二十八)栅格数据色彩渲染(多波段真彩色)
    我的2016,感恩、乐观、努力
    我的奋斗——从印刷工人到地理信息大数据系统程序员
    geotrellis使用(二十七)栅格数据色彩渲染
    用户画像
    栈和队列在python中的实现
    跳一跳第一天总结
    在pycharm中使用scrapy爬虫
    用户使用手册
    项目测试报告和用户使用手册
  • 原文地址:https://www.cnblogs.com/softidea/p/6926193.html
Copyright © 2011-2022 走看看