zoukankan      html  css  js  c++  java
  • scrapy入门实战-爬取代理网站

    入门scrapy。

    学习了有这几点

    1.如何使用scrapy框架对网站进行爬虫;

    2.如何对网页源代码使用xpath进行解析;

    3.如何书写spider爬虫文件,对源代码进行解析;

    4.学会使用scrapy的基础命令,创建项目,使用模板生成一个爬虫文件spider;

    5,通过配置settings.py反爬虫。如设置user-agent;

    设定目标:爬取网络代理www.xicidaili.com网站。

    使用scrapy startproject 项目名称

    scrapy startproject xicidailiSpider

    项目名称应该如何命名呢:建议是需要爬虫的域名+Spider.举个例子:比如要爬取www.zhihu.com,那么项目名称可以写成zhihuSpider。

    会在目录中出现该文件目录:

     

    2.  目录中spiders放置的是爬虫文件,然后middlewares.py是中间件,有下载器的中间件,有爬虫文件的中间件。pipelines.py是管道文件,是对spider爬虫文件解析数据的处理。settings.py是设置相关属性,是否遵守爬虫的robotstxt协议,设置User-Agent等。

    3.可以使用scrapy提供的模板,命令如下:

    scrapy genspider 爬虫名字   需要爬虫的网络域名

    举例子:

    我们需要爬取的www.xicidaili.com

    那么可以使用

    scarpy genspider xicidaili  xicidaili.com

    命令完成后,最终的目录如下:

    建立后项目后,需要对提取的网页进行分析

    经常使用的有三种解析模式:

    1.正则表达式

    2 xpath   response.xpath("表达式")

    3 css    response.css("表达式")

    XPath的语法是w3c的教程。http://www.w3school.com.cn/xpath/xpath_syntax.asp

    需要安装一个xpath helper插件在浏览器中,可以帮助验证书写的xpath是否正确。

    xpath语法需要多实践,看确实不容易记住。

    xicidaili.py
    

      

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    # 继承scrapy,Spider类
    class XicidailiSpider(scrapy.Spider):
        name = 'xicidaili'
        allowed_domains = ['xicidaili.com']
        start_urls = ['https://www.xicidaili.com/nn/',
                      "https://www.xicidaili.com/nt/",
                      "https://www.xicidaili.com/wn/,"
                      "https://www.xicidaili.com/wt/"]
    
        # 解析响应数据,提取数据和网址等。
        def parse(self, response):
            selectors = response.xpath('//tr')
            for selector in selectors:
                ip = selector.xpath("./td[2]/text()").get()
                port = selector.xpath("./td[3]/text()").get()     #.代表当前节点下
                country = selector.xpath("./td[4]/a/text()").get()   # get()和extract_first() 功能相同,getall()获取多个
                # print(ip,port,country)
                Items={
                    "ip":ip,
                    "port":port,
                    "country":country
                }
                yield  Items
            """
            # 翻页操作
            # 获取下一页的标签
            next_page = response.xpath("//a[@class='next_page']/@href").get()
            # 判断next_page是否有值,也就是是否到了最后一页
            if next_page:
                # 拼接网页url---response.urljoin
                next_url = response.urljoin(next_page)
                # 判断最后一页是否
                yield  scrapy.Request(next_url,callback=self.parse)   # 回调函数不要加括号
        """
    
    # -*- coding: utf-8 -*-
    # settings.py设置
    # Scrapy settings for xicidailiSpider project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'xicidailiSpider'
    
    SPIDER_MODULES = ['xicidailiSpider.spiders']
    NEWSPIDER_MODULE = 'xicidailiSpider.spiders'
    
    # 设置到处文件的字符编码
    FEED_EXPORT_ENCODING ="UTF8"
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'xicidailiSpider (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    # 是否准售robots.txt协议,不遵守
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Language': 'en',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'xicidailiSpider.middlewares.XicidailispiderSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'xicidailiSpider.middlewares.XicidailispiderDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    #    'xicidailiSpider.pipelines.XicidailispiderPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

     运行

    scrapy crawl xicidai 项目名,这个必须唯一。

    如果需要输出文件,

    scarpy crawl xicidaili --output ip.json 或者ip.csv 

      

  • 相关阅读:
    .NET 分布式架构开发实战之一
    frame中隐藏横向滚动条
    实时检测网络状态及是否可以连接Internet
    jquery表格插件推荐
    FireFox窗体frameset,iframe间的js调用方法
    用C#实现实现简单的 Ping 的功能,用于测试网络是否已经联通
    一个阴历阳历互相转化的类
    CSS技巧 — 不使用图片实现圆角、阴影、渐变等功能
    Windows下命令行下启动ORACLE服务
    使用C#进行点对点通讯和文件传输(通讯基类部分)
  • 原文地址:https://www.cnblogs.com/hamish26/p/11131593.html
Copyright © 2011-2022 走看看