zoukankan      html  css  js  c++  java
  • 用scrapy/selenium爬取校花网

    用scrapy/selenium爬取校花网

    校花网http://www.xiaohuar.com/

    美女校花首页http://www.xiaohuar.com/list-1-0.html

    第二页:http://www.xiaohuar.com/list-1-1.html

    依次类推

    步骤:

    1、  创建项目(使用终端输入,在相应的目录下)

    source activate spider

    scrapy startproject xiaohua

    2、  编写item(在items.py写)

    class XiaohuaItem(scrapy.Item):

        title = scrapy.Field()

        href = scrapy.Field()

        imgsrc = scrapy.Field()

    3、  编写spider(注意:书签块的xpath以网页源代码的为主,网页源代码与F12看到的div属性值不一样)

    import scrapy

    from urllib import request

    import re

    from xiaohua.items import XiaohuaItem

    class XiaohuaSpider(scrapy.Spider):

        name = 'xiaohua'

        allowed_domains = ['xiaohuar.com']

        start_urls = ['http://www.xiaohuar.com/list-1-0.html']

        def parse(self,response):

            #注意:书签块的xpath以网页源代码的为主,网页源代码与F12看到的div属性值不一样

            bookmarks = response.xpath('//div[@class="item masonry_brick"]')

            print('bookmarks lenth:',len(bookmarks))

            for bm in bookmarks:

                item = XiaohuaItem()

                title = bm.xpath('.//div[@class="title"]/span/a/text()').extract()[0]

                href = bm.xpath('.//div[@class="title"]/span/a/@href').extract()[0]

                imgsrc = bm.xpath('.//div[@class ="img"]/a/img/@src').extract()[0]

                item['title'] = title

                item['href'] = href

                item['imgsrc'] = request.urljoin(response.url,imgsrc)

                '''

                仿照视频教程,试一下多页爬取;成了

                这段必须放在for循环中,否则会一直访问新的网页;

                可以理解为,bookmarks0则不再爬取网页

                '''

                # 提取当前页的数字

                curpage = re.search('(d+)-(d+)', response.url).group(2)  # group(2) 列出第二个括号匹配部分

                # 生成下一页的数字值

                pagenum = int(curpage) + 1

                # 生成下一页url

                url = re.sub(r'1-(d+)', '1-'+str(pagenum), response.url)

                # 把地址通过yield返回;注意callback的写法

                yield scrapy.Request(url, callback=self.parse)

                yield item

     

    4、  编写pipeline,会生成 xiaohua.json文件

    import json

    class XiaohuaPipeline(object):

        '''

        在初始化Pipeline时打开json文件,在关闭spider时关闭json文件,即整个爬取过程中只打开一次json文件

        '''

        def __init__(self):

            self.file = open('xiaohua.json','w')

        def process_item(self,item,spider):

            #item可以直接转化成字典

            content = json.dumps(dict(item),ensure_ascii=False) + ' '

            self.file.write(content)

            return item

        def close_spider(self,spider):

            self.file.close()

        '''

        #下面这种方法也行,但多次打开json文件

        def process_item(self,item,spider):

            with open('xiaohua.json', 'a') as f:

                json.dump(dict(item), f, ensure_ascii=False)

            return item

        '''

    5、  设置pipeline(在settings.py中)

    ITEM_PIPELINES = {

        'xiaohua.pipelines.XiaohuaPipeline': 300,

    }

    6、  中间件,会使用selenium(后来没用selenium)

    7、  设置中间件(在settings.py中,与pipeline设置方法类似,没用上)

    8、  在终端中(spider/exec/xiaohua目录下)输入命令,执行爬取

    scrapy crawl xiaohua

  • 相关阅读:
    Python 集合
    Python sorted()
    CodeForces 508C Anya and Ghosts
    CodeForces 496B Secret Combination
    CodeForces 483B Friends and Presents
    CodeForces 490C Hacking Cypher
    CodeForces 483C Diverse Permutation
    CodeForces 478C Table Decorations
    CodeForces 454C Little Pony and Expected Maximum
    CodeForces 313C Ilya and Matrix
  • 原文地址:https://www.cnblogs.com/djlbolgs/p/12506507.html
Copyright © 2011-2022 走看看