zoukankan      html  css  js  c++  java
  • books新手实践xpath

    笔者最近在看scrapy爬虫实战,在scrapy入门案例中遇到了许多问题,特别是在scrapy中使用css和xpath,遇到实际应用无法实现,只能做到基础的功能

    于是笔者摆脱scrapy框架,按照requests这些基础知识来重做项目,发现运行速度远远低于scrapy框架!

    下面是代码,代码也存在较多冗余,加剧了时间复杂度,导致运行速度过慢

    import requests
    from lxml import etree
    
    class Books(object):
        def index(self,response):
            html = etree.HTML(response.text)  # 结构化
            # print(html)
            # 获取每本书的链接
            index_xpath = html.xpath('//article[@class="product_pod"]/h3/a/@href')
            # print(index_xpath)
            # 获得下一页的链接
            next = html.xpath('//div/ul[@class="pager"]/li[@class="next"]/a/@href')
            # print(next)
            next_url = requests.get("http://books.toscrape.com/" + next[0])
            # print(next_url)
            html = etree.HTML(next_url.text)
            index_xpath.extend(html.xpath('//article[@class="product_pod"]/h3/a/@href'))
            # print(index_xpath)
            for i in index_xpath:
                # print(i)
                self.books(i)
    
            for i in  range(2):#50页减去首页和第二页,因为第二页的下一页url自带catalogue
                #获得下一页的链接
                next=html.xpath('//div/ul[@class="pager"]/li[@class="next"]/a/@href')
                # print(next)
                next_url=requests.get("http://books.toscrape.com/catalogue/"+next[0])
                # print(next_url)
                html=etree.HTML(next_url.text)
                index_xpath.extend(html.xpath('//article[@class="product_pod"]/h3/a/@href'))
            print(index_xpath)
            for i in index_xpath:
                # print(i)
                self.books('catalogue/'+i)#第三页之后的书本url和之前的不一样,需要加上catalogue/
    
        def books(self,index_xpath):
            response = requests.get('http://books.toscrape.com/' + index_xpath)
            # print(response)
            html = etree.HTML(response.text)
            name = html.xpath('//div[@class="col-sm-6 product_main"]/h1/text()')
            price=html.xpath('//div[@class="col-sm-6 product_main"]/p[@class="price_color"]/text()')
            #星级难做!
            npc=html.xpath('//table[@class="table table-striped"]/tr[1]/td/text()')
            # s=re.compile('d+')
            #库存只用xpath不会做!
            # stock=html.xpath('//table[@class="table table-striped"]/tbody/tr[last()-1]/td/text()')
            num=html.xpath('//table[@class="table table-striped"]/tr[last()]/td/text()')
            for i,j,k,l in zip(name,price,npc,num):
                params=str((i,j,k,l))
                with open('books.csv','a',encoding='utf-8')as f:
                    f.write(params+'
    ')#写入一行后自动换行
    
    if __name__=='__main__':
        response = requests.get('http://books.toscrape.com/')
        Books().index(response)
    

      

  • 相关阅读:
    git的使用
    每个JavaScript开发人员应该知道的33个概念
    JavaEE实战——XML文档DOM、SAX、STAX解析方式详解
    Java-函数式编程(三)流(Stream)
    Spring高级装配(二) 条件化的bean
    Spring高级装配(一) profile
    Spring Bean装配学习
    Java7任务并行执行神器:Fork&Join框架
    Stream:java1.8新特性
    java基础
  • 原文地址:https://www.cnblogs.com/fodalaoyao/p/10434672.html
Copyright © 2011-2022 走看看