zoukankan      html  css  js  c++  java
  • scrapy框架

    Scrapy框架:

    spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据,item,pipeline.                                                                                        

    新建项目(scrapy startproject xxx):新建一个爬虫项目

    明确目标(编写items.py):明确抓取的目标

    制作爬虫(spiders/xxspider.py):制作爬虫开始爬取数据

    存储内容(pipelines.py):设计管道存储爬取内容

    运行爬虫项目:

    命令行运行:scrapy  crawl  myspider

    pycharm运行:from scrapy import cmdline
          cmdline.execute('scrapy crawl myspider'.split(" "))

     管道:

    先在settings.py里面:

    ITEM_PIPELINES = {
    # 'mySpider.pipelines.mySpiderPipelines':100,
    'mySpider.pipelines.MyspiderPipeline': 300,
    }

     然后在pipelines.py里面:

    import json

    class MyspiderPipeline(object):
    def __init__(self):
    self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
    jsontxt = json.dumps(dict(item),ensure_ascii=False)+ " "
    self.filename.write(jsontxt)
    # return item

    # 结束调用
    def close_spider(self,spider):
    self.filename.close()

     回调函数到下一页:myspider.py:写在for循环外

    # 将请求重新发送给调度器入队列,交给下载器下载

    yield  scrapy.Request(self.url+str(self.offest),callback = self.parse)

     设置报头:

    DEFAULT_REQUEST_HEADERS = {
    'User-Agent':'Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    # 'Accept-Language': 'en',
    }

     设置延迟:

    #DOWNLOAD_DELAY = 3

     设置管道:

    ITEM_PIPELINES = {
    # 'mySpider.pipelines.mySpiderPipelines':100,
    'mySpider.pipelines.MyspiderPipeline': 300,
    }

    管道处理文字:

    import json

    class MyspiderPipeline(object):
    def __init__(self):
    self.filename = open('teacher.json','w',encoding='utf8')

    # 处理item数据
    def process_item(self, item, spider):
    jsontxt = json.dumps(dict(item),ensure_ascii=False)+ " "
    self.filename.write(jsontxt)
    # return item

    # 结束调用
    def close_spider(self,spider):
    self.filename.close()

    管道处理图片:

    import scrapy
    from scrapy.utils.project import get_project_settings
    from scrapy.pipelines.images import ImagesPipeline
    import os

    class ImagesPipeline(ImagesPipeline):
    #def process_item(self, item, spider):
    # return item
    # 获取settings文件里设置的变量值
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):
    image_url = item["imagelink"]
    yield scrapy.Request(image_url)

    def item_completed(self, result, item, info):
    image_path = [x["path"] for ok, x in result if ok]

    os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")

    item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]

    return item

     注意的点:

    url = re.sub('d+',str(page),response.url)

    re.sub(s1,s2,s3) 把s3里的s1替换成s2

    content = json.dumps(dict(item),ensure_ascii=False)把中文转成Unicode

  • 相关阅读:
    771. Jewels and Stones
    706. Design HashMap
    811. Subdomain Visit Count
    733. Flood Fill
    117. Populating Next Right Pointers in Each Node II
    250. Count Univalue Subtrees
    94. Binary Tree Inorder Traversal
    116. Populating Next Right Pointers in Each Node
    285. Inorder Successor in BST
    292. Nim Game Java Solutin
  • 原文地址:https://www.cnblogs.com/xuezhihao/p/11636153.html
Copyright © 2011-2022 走看看