zoukankan      html  css  js  c++  java
  • Scrapy应用之抓取《宦海沉浮》小说

    目标站点
     

    第一步:新建项目

    KeysdeMacBook:Desktop keys$ scrapy startproject MyCrawl
    New Scrapy project 'MyCrawl', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
        /Users/keys/Desktop/MyCrawl
    You can start your first spider with:
        cd MyCrawl
        scrapy genspider example example.com
    

    第二步:创建爬虫

    KeysdeMacBook:Desktop keys$ cd MyCrawl/
    KeysdeMacBook:MyCrawl keys$ scrapy genspider FirstSpider www.shushu8.com/huanhaichenfu
    

    第三步:配置item.py

    import scrapy
    
    
    class MycrawlItem(scrapy.Item):
        url = scrapy.Field()
        title = scrapy.Field()
        text = scrapy.Field()
    
     
    第四步:编写爬虫
    # -*- coding: utf-8 -*-
    import scrapy
    from MyCrawl.items import MycrawlItem
    
    
    class FirstspiderSpider(scrapy.Spider):
        name = 'FirstSpider'
        allowed_domains = ['www.shushu8.com/huanhaichenfu']
        start_urls = ['http://www.shushu8.com/huanhaichenfu/'+str(i+1) for i in range(502)]
    
        def parse(self, response):
            url = response.url
            title = response.xpath('//*[@id="main"]/div[2]/div/div[1]/h1/text()').extract_first('')
            text = response.css('#content::text').extract()
    
            myitem = MycrawlItem()
            myitem['url'] = url
            myitem['title'] = title
            myitem['text'] = ','.join(text)
    
            yield myitem
    

    第五步:配置pipeline.py

    # -*- coding: utf-8 -*-
    import pymysql
    
    class MysqlPipeline(object):
        # 采用同步的机制写入mysql
        def __init__(self):
            self.conn = pymysql.connect(
                '127.0.0.1',
                'root',
                'rootkeys',
                'Article',
                charset="utf8",
                use_unicode=True)
            self.cursor = self.conn.cursor()
    
        def process_item(self, item, spider):
            insert_sql = """
                insert into huanhaichenfu(url, title, text)
                VALUES (%s, %s, %s)
            """
            # 使用VALUES实现传值
            self.cursor.execute(
                insert_sql,
                (item["url"],
                 item["title"],
                 item["text"]))
            self.conn.commit()
    

    第六步:配置setting.py

    # -*- coding: utf-8 -*-
    
    BOT_NAME = 'MyCrawl'
    SPIDER_MODULES = ['MyCrawl.spiders']
    NEWSPIDER_MODULE = 'MyCrawl.spiders'
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    ROBOTSTXT_OBEY = False
    ITEM_PIPELINES = {
       'MyCrawl.pipelines.MysqlPipeline': 1,
    }
    

    第七步:运行爬虫

    import os
    import sys
    from scrapy.cmdline import execute
    
    
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    
    run_spider = 'FirstSpider'
    
    if __name__ == '__main__':
        print('Running Spider of ' + run_spider)
        execute(['scrapy', 'crawl', run_spider])
    

      

  • 相关阅读:
    Android开发 ViewConfiguration View的配置信息类
    Android 开发 倒计时功能 转载
    Android 开发 关于7.0 FileUriExposedException异常 详解
    Android 开发 实现文本搜索功能
    Android 开发 Activity里获取View的宽度和高度 转载
    Android 开发 存储目录的详解
    Android 开发 Fresco框架点击小图显示全屏大图实现 ZoomableDraweeView
    Android 开发 将window变暗
    Android 开发 DisplayMetrics获取Android设备的屏幕高宽与其他信息
    Android 开发 DP、PX、SP转换详解
  • 原文地址:https://www.cnblogs.com/Keys819/p/10391650.html
Copyright © 2011-2022 走看看