zoukankan      html  css  js  c++  java
  • Scrapy应用之抓取《宦海沉浮》小说

    目标站点
     

    第一步:新建项目

    KeysdeMacBook:Desktop keys$ scrapy startproject MyCrawl
    New Scrapy project 'MyCrawl', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
        /Users/keys/Desktop/MyCrawl
    You can start your first spider with:
        cd MyCrawl
        scrapy genspider example example.com
    

    第二步:创建爬虫

    KeysdeMacBook:Desktop keys$ cd MyCrawl/
    KeysdeMacBook:MyCrawl keys$ scrapy genspider FirstSpider www.shushu8.com/huanhaichenfu
    

    第三步:配置item.py

    import scrapy
    
    
    class MycrawlItem(scrapy.Item):
        url = scrapy.Field()
        title = scrapy.Field()
        text = scrapy.Field()
    
     
    第四步:编写爬虫
    # -*- coding: utf-8 -*-
    import scrapy
    from MyCrawl.items import MycrawlItem
    
    
    class FirstspiderSpider(scrapy.Spider):
        name = 'FirstSpider'
        allowed_domains = ['www.shushu8.com/huanhaichenfu']
        start_urls = ['http://www.shushu8.com/huanhaichenfu/'+str(i+1) for i in range(502)]
    
        def parse(self, response):
            url = response.url
            title = response.xpath('//*[@id="main"]/div[2]/div/div[1]/h1/text()').extract_first('')
            text = response.css('#content::text').extract()
    
            myitem = MycrawlItem()
            myitem['url'] = url
            myitem['title'] = title
            myitem['text'] = ','.join(text)
    
            yield myitem
    

    第五步:配置pipeline.py

    # -*- coding: utf-8 -*-
    import pymysql
    
    class MysqlPipeline(object):
        # 采用同步的机制写入mysql
        def __init__(self):
            self.conn = pymysql.connect(
                '127.0.0.1',
                'root',
                'rootkeys',
                'Article',
                charset="utf8",
                use_unicode=True)
            self.cursor = self.conn.cursor()
    
        def process_item(self, item, spider):
            insert_sql = """
                insert into huanhaichenfu(url, title, text)
                VALUES (%s, %s, %s)
            """
            # 使用VALUES实现传值
            self.cursor.execute(
                insert_sql,
                (item["url"],
                 item["title"],
                 item["text"]))
            self.conn.commit()
    

    第六步:配置setting.py

    # -*- coding: utf-8 -*-
    
    BOT_NAME = 'MyCrawl'
    SPIDER_MODULES = ['MyCrawl.spiders']
    NEWSPIDER_MODULE = 'MyCrawl.spiders'
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    ROBOTSTXT_OBEY = False
    ITEM_PIPELINES = {
       'MyCrawl.pipelines.MysqlPipeline': 1,
    }
    

    第七步:运行爬虫

    import os
    import sys
    from scrapy.cmdline import execute
    
    
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    
    run_spider = 'FirstSpider'
    
    if __name__ == '__main__':
        print('Running Spider of ' + run_spider)
        execute(['scrapy', 'crawl', run_spider])
    

      

  • 相关阅读:
    android 8 wifi wifi 扫描过程
    Android WiFi 日志记录(四次握手)
    Android 8 Wifi 初始化过程
    wifi 通过omnipeek 查看 pmf是否生效
    qualcomm 查看 wifi 配置生效
    Android 8 AudioPolicy 分析
    2. 观点提取和聚类代码详解
    1. pyhanlp介绍和简单应用
    6. EM算法-高斯混合模型GMM+Lasso详细代码实现
    5. EM算法-高斯混合模型GMM+Lasso
  • 原文地址:https://www.cnblogs.com/Keys819/p/10391650.html
Copyright © 2011-2022 走看看