zoukankan      html  css  js  c++  java
  • Python爬虫入门遇到的坑

    1. 环境 

    - Python
      mac os预装的python 

    $ python -V  
    Python 2.7.10
    $ where python
    /usr/bin/python
    $ ls /System/Library/Frameworks/Python.framework/Versions
    2.3     2.5     2.6     2.7     Current
    $ ls /Library/Frameworks/Python.framework/Versions (用户安装的目录)

    - IDE
      Pycharm
    - 辅助
      安装pip

    sudo easy_install pip

    - Python库

    sudo pip install requests (默认安装requests 2.13.0) 
    sudo pip install BeautifulSoup (默认安装BeautifulSoup 3.2.1)
    sudo pip install lxml (默认安装lxml 3.7.3)

    2. 问题

    - 问题1

    代码:
    soup = BeautifulSoup(html, 'lxml')
    报错:
    Traceback (most recent call last):
    File "/Users/cuizhenyu/Documents/Codes/Python/DownloadMeitu/LibBeautifulSoupTest.py", line 15, in <module>
    soup = BeautifulSoup(html) #soup = BeautifulSoup(html, 'lxml')报错
    TypeError: 'module' object is not callable
    解决:
    from BeautifulSoup import BeautifulSoup

    - 问题2

    代码:
    soup = BeautifulSoup(html, 'lxml')
    报错:
    Traceback (most recent call last):
    File "/Users/cuizhenyu/Documents/Codes/Python/DownloadMeitu/LibBeautifulSoupTest.py", line 15, in <module>
    soup = BeautifulSoup(html, 'lxml') #soup = BeautifulSoup(html, 'lxml')报错
    File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
    File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
    self._feed(isHTML=isHTML)
    File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
    SGMLParser.feed(self, markup)
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
    self.goahead(0)
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 138, in goahead
    k = self.parse_starttag(i)
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 296, in parse_starttag
    self.finish_starttag(tag, attrs)
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 338, in finish_starttag
    self.unknown_starttag(tag, attrs)
    File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1338, in unknown_starttag
    self.endData()
    File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1251, in endData
    (not self.parseOnlyThese.text or
    AttributeError: 'str' object has no attribute 'text'
    解决:
    当前BeautifulSoup是v3版,不支持lxml等,需用v4版。

     

  • 相关阅读:
    数据库锁表处理汇总
    2021,顺其自然
    NetCore中跨域策略的一个坑
    Furion框架亮点之-动态WebAPI
    sql中where in的数量限制
    动态规划学习笔记
    用Go编写Web应用程序
    Asp.net Core AutoFac根据程序集实现依赖注入
    Linux+Docker+Gitee+Jenkins自动化部署.NET Core服务
    CentOS8.0安装Nacos
  • 原文地址:https://www.cnblogs.com/mulisheng/p/6665350.html
Copyright © 2011-2022 走看看