zoukankan html css js c++ java

学习Spider 了解 Scrapy的流程

Scrapy　　　　　　　　　　

先创建项目

在windows下

scrapy startproject myproject #myproject是你的项目名称

cd 项目名称

scrapy genspider myspider 爬取域名 # myspider是你的爬虫名称后跟爬取域名

启动爬虫

scrapy crawl 爬虫名

配置

在setting.py 中配置

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 32

#其中Scrapy下载执行现有的最大请求数

# 默认值：16

# 并发是指scrapy同时处理的request的数量，默认的全局并发限制为16，可增加这个值，增加多少取决于爬虫占CPU多少，设置前最好测试一下，一般占在80-90%为好

DOWNLOAD_DELAY = 3 #设置延迟下载可以避免被发现

COOKIES_ENABLED = True #禁止cookies，有些站点会从cookies中判断是否为爬虫

#它是用于Scrapy的HTTP请求的默认标题

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'en',

}

# 管道

ITEM_PIPELINES = {

'Per.pipelines.PerPipeline': 300,

}

# 日志

LOG_FILE = './TEST.log'

# 编码

FEED_EXPORT_ENCODING='utf-8'

在你的myspider.py文件编写爬虫

import scrapy,re,requests
from ..items import PerItem

class LishiSpider(scrapy.Spider):
    name = 'myspider'  #爬虫名


    # allowed_domains = ['http://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=2&start=1']   
    start_urls = ['http://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=2&start=1']  #爬取的域名

    def parse(self, response):
        # 标题
        title = response.xpath('/html/body/li[@class="categoryem"]/div[@class="vervideo-bd"]/a//div[@class="vervideo-title"]/text()').extract()
        # 链接
        t_url = response.xpath('/html/body/li[@class="categoryem"]/div[@class="vervideo-bd"]/a/@href').extract()
        # 时间
        data = response.xpath('/html/body/li[@class="categoryem"]/div[@class="vervideo-bd"]/a//div[@class="cm-duration"]/text()').extract()
        
　　　　　#爬取的标题等需传到items.py里
        for i in range(len(title)):
            item  = PerItem()
            item['title'] = title[i]
            item['t_url'] = 'http://www.pearvideo.com/' + t_url[i]
            item['data'] = data[i]

            #yield item

　　　　　　　print(item)

注意：爬取的字段要跟 items.py里的一致

import scrapy


class PerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    t_url = scrapy.Field()
    data = scrapy.Field()
    shi = scrapy.Field()

最后启动爬虫

scrapy crawl myspider

查看全文

相关阅读:
[js对象]JS入门之Date对象
 从Microsoft SqlServer 2005中返回有一定顺序的记录集
 [js对象]JS入门之Global对象
 [JS.IntelliSense]VS2008(Orcas) So Cool
即插即用插件式框架的程序集处理遐想(TypeFinder)
[C#3.0体验]Orcas中内置的LinQ,XLinQ[DLinQ]扩展方法
 [ASP.NET入门]页面生命周期
 [IE]IE6&IE7运行于同一个系统中
 [js对象]JS入门之Boolean&Object对象
 RSS(Really Simple Syndication)常用标签

原文地址：https://www.cnblogs.com/wudameng/p/11083372.html