zoukankan      html  css  js  c++  java
  • python爬虫(十六) -IndexError: list index out of range

    在用lxml和xpath对一个网站进行解析,在解析的时候出现错误-IndexError: list index out of range

    原因是在中这个网站的html代码中有的标识为空,只要加上try.....except 错误机制跳过空值就行了

    例如:

    html=etree.HTML(text)
    ul=html.xpath("//ul[@class='lists']")[0]
    lis = ul.xpath("//li")
    for li in lis:
     title=li.xpath("@data-title")[0]
        score=li.xpath("@data-score")[0]
        duration=li.xpath("@data-duration")[0]
        region=li.xpath("@data-region")[0]
        director=li.xpath("@data-director")[0]
        actors=li.xpath("@data-actors")[0]
        thumbnail=li.xpath(".//img/@src")[0]
        movie={
            'title':title,
            'score':score,
            'duration':duration,
            'region':region,
            'director':director,
            'actors':actors,
            'thumbnail':thumbnail
        }
    
        print(movie)

    这个代码在运行之后就会出现错误:IndexError: list index out of range

    修改之后的代码:

    html=etree.HTML(text)
    ul=html.xpath("//ul[@class='lists']")[0]
    lis = ul.xpath("//li")
    
    
    for li in lis:
     try:
        title=li.xpath("@data-title")[0]
        score=li.xpath("@data-score")[0]
        duration=li.xpath("@data-duration")[0]
        region=li.xpath("@data-region")[0]
        director=li.xpath("@data-director")[0]
        actors=li.xpath("@data-actors")[0]
        thumbnail=li.xpath(".//img/@src")[0]
        movie={
            'title':title,
            'score':score,
            'duration':duration,
            'region':region,
            'director':director,
            'actors':actors,
            'thumbnail':thumbnail
        }
    
        print(movie)
     except IndexError:
        pass

  • 相关阅读:
    IP 协议
    以太网协议
    制作Win10系统安装U盘和安装纯净版Win10
    IP地址的配置
    进制转换
    设置QQ环境变量
    修改IE默认页的指向
    虚拟机安装Linux ubuntu19.10
    【Eclipse】Editor does not contain a main type
    Vmware Workstation虚拟机
  • 原文地址:https://www.cnblogs.com/zhaoxinhui/p/12392438.html
Copyright © 2011-2022 走看看