zoukankan      html  css  js  c++  java
  • scrapy爬虫框架(一)

    scrapy爬虫框架(一)

    创建项目

    scrapy startproject 项目名
    

    创建爬虫文件

    此前要进入爬虫文件夹,使用cd命令

    scrapy genspider 爬虫名 网站域名
    

    修改配置文件Settings.py

    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    

    第一个实例

    爬取糗事百科

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class QsbkSpider(scrapy.Spider):
        name = 'qsbk'
        allowed_domains = ['www.yicommunity.com']
        start_urls = ['http://www.yicommunity.com/']
    
        def parse(self, response):
            print("="*80)
            contents = response.xpath('//div[@class="col1"]/div')
            print(contents)
            print("="*80)
            for content in contents:
    
                author = content.xpath("./div[@class='author']/text()").get()
                word = content.xpath("./div[@class='content']/text()").get()
                print(author,word)
    
    

    运行cmd命令

    scrapy crawl qsbk
    

    mark

    pycharm中运行

    在pyvenv.cfg同目录下创建start.py文件

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl qsbk".split())
    
  • 相关阅读:
    软件测试:Homework 3
    软件测试:Lab 1
    软件测试:Homework 2
    软件测试:Homework 1
    JAVA的回忆
    Java 操作符
    Java 笔录
    经典C#编程理解,概要,经典
    网络精灵
    签到计时
  • 原文地址:https://www.cnblogs.com/senup/p/12319005.html
Copyright © 2011-2022 走看看