zoukankan      html  css  js  c++  java
  • scapy剖析

    
    

    ========================================================================================================================

    1. 基类 scrapy.Spider
    name: spider的名称
    allowed_domains: 允许的域名
    start_urls: 初始的urls
    custom_settings: 个性化设置,会覆盖全局的设置
    crawler: 抓取器,spder将绑定到它上面
    settings: 配置示例
    logger: 日志示例

    method:
    from_crawler(crawler, *args, **kwargs): 类方法,用于创建spiders
    start_quests(): 生成初始的requests
    make_requests_from_url(url): 根据url生成一个request
    parse(response): 用来解析网页内容
    log()
    closed()

    ========================================================================================================================

    2. 子类 CrawlSpider
    1) 最常用的spider,用于抓取普通的网页
    2) 增加了两个成员
    rules: 定义了一些抓取规则--链接怎么跟踪,使用那个parse函数解析此链接
    parse_start_url(resposne): 解析初始url的相应
    实例:
    import scrapy
    from scrapy.spiders import CrawlSpider,Rule
    from scrapy.linkextractors import LinkExtractor

    class MySpiser(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
    Rule(LinkExtractor(allow=('category.php',), deny=('subsection.php')),
    Rule(LinkExtractor(allow=('item.php',)), callback='parse_item)
    )
    def parse_item(self,resposne):
    self.logger.info('Hi')
    item = scrapy.Item()
    item['id'] = response.xpath()
    ...
    return item


    ========================================================================================================================
    Selector

    from scrapy.selector import Selector
    from scrapy.http import HtmlResponse

    body = '<html><body></body></html>'
    Selector(text=body).xpath('//span/text()').extract() => u'good'

    resposne = HtmlResponse(url='http://example.com', body=body)
    Selector(response=response).xpath('//span/text()').extract()


    response.xpath('//title/text()')
    response.css('title::text')

    response.css('img').xpath('@src').extract_first()
    response.css('img').xpath('@src').extract(default='not found')


    selector中常用的抽取方法: xpath/css/re/extract

    ========================================================================================================================
    item
    ========================================================================================================================
    item pipeline
    1. 数据的清洗
    2. 数据的验证(是否符合要求的字段)
    3. 去重
    4. 存储
    示例:

    from scrapy.exceptions import DropItem


    class PricePipeline(object):
    vat_factor = 1.15
    def process_item(self, item, spider):
    if item['price']:
    item['price'] = item['price'] * self.vat_factor
    return item
    else:
    raise DropItem("Missing Price %s " % item)

    class MongoPipeline(object):
    collection_name = 'scrapy_items'
    def __init__(self,mongo_url,mongo_db):
    self.mongo_url = mongo_url
    self.mongo_db = mongo_db

    class DuplicatePipeline(object):
    def __init__(self):
    self.ids_seen = set()

    def process_item(self, item, spider):
    if item['id'] in self.ids_seen:
    raise DropItem("Duplicate item found : %s" % item)
    else:
    self.ids_seen.add(item['id'])
    return item

    启用pipeline
    item_pipeline = {
    'myproject.pipelines.pricepipeline': 300,
    }

    ========================================================================================================================
    requests

    class scrapy.http.Request(url[,callback,method='GET',headers,body,cookies,meta,encoding='utf-8',priority=0,don't_filter=False,errback]) # 发生错误时调用的函数

    示例一:

    def parse_page(self, response):
    return scrapy.Request('http://www.example',callback=self.parse_page2)

    def parse_page2(self, response):
    self.logger.info('visited %s', response.url)

    示例二:

    def parse_page1(self, resposne):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request('http://www.example.com/some_page.html',callback.parse_page2)
    request.meta['item'] = item
    return request

    def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

    子类: FormRequest
    class scrapy.http.FormRequest(url[])

    示例:
    return [FormRequest(url='http://www.example.com/post/action',formdata={'name':'Jhon Doe','age':'27'},callback=self.after_post]

    示例二:
    class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
    return scrapy.FormRequest.from_response(
    response,
    formdata={'username':'John','password':'secret'},
    callback=self,after_login
    )
    def after_login(self, response):
    if 'authentication failed' in response.body:
    self.logger.error('Login failed')
    return


    ========================================================================================================================
    response
    class scrapy.http.Response()

    response.xpath('//p')
    response.css('p')

    子类:HtmlResponse

    ========================================================================================================================
    import logging
    logging.warning('This is a warning')

    在scrapy中的使用

    import scrapy

    class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['']
    def parse(self, response):
    self.logger.info('parse function called on %s', response.url)

    LOG_FILE
    LOG_ENABLED
    LOG_ENCODING
    LOG_LEVEL
    LOG_FORMAT
    LOG_DATAFORMAT
    LOG_STDOUT

    ========================================================================================================================
    Stats Collections


    ========================================================================================================================
    ================================================= 深入理解scrapy框架 =====================================================
    ========================================================================================================================
    scrapy engine:
    负责组件之间数据的流转,当某个动作发生触发事件
    scheduler:
    接受requests,并把他们入队,以便后续的调度
    downloader:
    负责抓取页面,并传递给引擎,之后将结果传递给spider
    spiders
    解析response,产生items和url
    item pipeline
    负责处理item,清洗-验证-持久化
    downloader_middlewares:
    位于引擎和下载器之间的一个钩子,处理传送到下载器的requests和传送到引擎的response

    1. downloadmiddlerwares (重写)

    class scrapy.downloadermiddlerwares.DownloaderMiddleware
    process_request(request, spider):

    process_response(request, response, spider):

    process_exception(request, exception, spider): #出现异常时候的处理


    内置的:
    class scrapy.downloadermiddlewares.cookies.CookiesMiddleware

    2. spidermiddlerwares (重写)

    class scrapy.spidermiddlewares.SpiderMiddleware()
    process_spider_input(response, spider)

    process_spider_output(response, result, spider)

    process_spider_exception(response, exception, spider)

    process_start_requests(start_requests, spider)

    内置的:
    DepthMiddleware
    HttpErrorMiddleware

    ================================================= cookies ========================================================
    cookies 通过在客户端记录信息确定用户身份
    cookies实际上是一段文本信息,客户端请求服务器. 如果服务器要记录用户状态,就使用response向客户端颁发一个cookies。 客户端会把cookies保存起来, 当浏览器在请求该网站时,浏览器把请求的网址连同cookies一起发给服务器,服务器查收cookies,以此来辨认用户的状态,服务器还可以根据需要修改cookies的内容

    session 通过在服务器端记录信息确定用户身份
    session保存在服务器上


    FormRequest

    COOKIES_ENABLED # Default: True
    if disabled, no cookies will be sent to web services.

    示例:
    class stackoverflowspider(scrapy.Spider):
    name = ''
    start_urls = ['',]

    def parse_requests(self):
    url = ''
    cookies = {
    'dz_username':'wst_today',
    'dz_uid':'2u3873',
    'buc_key':'jdofqejj',
    'buc_token':'a17384kdjfqi'
    }
    return [
    scrapy.Request(url,cookies=cookies),
    ]
    def parse(self, response):
    ele = response.xpath('//table[@class="hello"]/text()')
    if ele:
    print('success')
  • 相关阅读:
    lncRNA表达定量方法评估
    比对软件之STAR的使用方法
    怎么检测自己fastq的Phred类型 | phred33 phred64
    质控工具之TrimGalore使用方法
    怎么从bam文件中提取出比对OR没比对上的paired reads | bamToFastq | STAR
    RepBaseRepeatMaskerEdition下载 | RepeatMasker
    Nr,GenBank, RefSeq, UniProt 数据库的异同
    质控工具之cutadapt的使用方法
    初步了解hg19注释文件的内容 | gtf
    单细胞数据高级分析之消除细胞周期因素 | Removal of cell cycle effect
  • 原文地址:https://www.cnblogs.com/liyugeng/p/7836174.html
Copyright © 2011-2022 走看看