zoukankan      html  css  js  c++  java
  • crawlspider

    crawlspider提取url

    创建一个crawlspider爬虫

    scrapy genspider --t crawl baidu baidu.com
    rules = (
            Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
        )            
    allow  正则匹配
    restrict_css css匹配, 自动提取url
    restrict_xpath xpath匹配, 自动提取url
    # restrict_xpaths=("//div[@class='a'/li]") 能够去除li下面所有a标签的url地址并进行请求
    Rule(LinkExtractor(restrict_xpaths=("//div[@class='a'/li]")), callback='parse_item', follow=True),

    创建的爬虫

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class CfSpider(CrawlSpider):
        name = 'cf'
        allowed_domains = ['circ.gov.cn']
        start_urls = ['http://circ.gov.cn/']
        # 提取规则 follow=True 继续提取(提取下一页地址 需要True)
        rules = (
            Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            item = {}
            #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
            #item['name'] = response.xpath('//div[@id="name"]').get()
            #item['description'] = response.xpath('//div[@id="description"]').get()
            return item

     爬去腾讯招聘职位

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class HrSpider(CrawlSpider):
        name = 'hr'
        allowed_domains = ['tencent.com']
        start_urls = ['https://hr.tencent.com/index.php']
        rules = (
            # https://hr.tencent.com/social.php
            Rule(LinkExtractor(allow=r'https://hr.tencent.com/position.php'),
                 callback='parse_item', follow=True),
         # next page Rule(LinkExtractor(allow
    =r'https://hr.tencent.com/position.php?keywords=&tid=0&start=d{1}0#a'), follow=True), ) def parse_item(self, response): item = {} tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1] for tr in tr_list: item['title'] = tr.xpath("./td[1]/a/text()").extract_first() item['position'] = tr.xpath("./td[2]/text()").extract_first() item['pub_date'] = tr.xpath("./td[5]/text()").extract_first() yield item
  • 相关阅读:
    深入解析QML引擎,第1部分:QML文件加载
    解释器原理
    NLP入门(十)使用LSTM进行文本情感分析
    Python之将Python字符串生成PDF
    SPARQL入门(二)使用Java操作ARQ
    SPARQL入门(一)SPARQL简介与简单使用
    NLP入门(九)词义消岐(WSD)的简介与实现
    利用百度文字识别API识别图像中的文字
    NLP入门(八)使用CRF++实现命名实体识别(NER)
    Cayley图数据库的可视化(Visualize)
  • 原文地址:https://www.cnblogs.com/tangpg/p/10778715.html
Copyright © 2011-2022 走看看