zoukankan      html  css  js  c++  java
  • scrapy初试

    python3  支持 scrapy了。

    通过pycharm的菜单file-default setting-project interpreter,进行搜索安装;

    通过如下pip也可安装:

    $ pip install scrapy==1.1.0rc1

    scrapy下的每个item对象表示网站的一个页面。可以定义不同的item(url,content,header,image)

    首先,在当前目录下创建scrapy项目:

    $scrapy startproject wikiSpider

    会新建一个wikiSpider的项目文件夹,目录中有item.py、settings.py、spiders文件夹等;

    在spider文件夹下新建articleSpider.py:

    from scrapy import Spider
    from wikiSpider.items import Article
    
    class ArticleSpider(Spider):
        name = 'article'
        allowed_domains = ['en.wikipedia.org']
        start_urls = ['http://en.wikipedia.org/wiki/Main_Page', 'http://en.wikipedia.org/wiki/Python_%28programming_language%29']
        def parse(self, response):
            item = Article()
            title = response.xpath('//h1/text()')[0].extract()
            print('title is :'+title)
            item['title'] = title
            return item

    把item.py改成:

    from scrapy import Item,Field
    
    
    class Article(Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = Field()
        pass

    同时在setting.py中修改日志,方便查看输出结果:

    LOG_LEVEL = 'ERROR'

    然后在wikiSpider主目录中运行:

    $scrapy crawl article

    可以出现调试信息:

    title is :Main Page
    title is :Python (programming language)
  • 相关阅读:
    table
    html <input>
    html基本结构
    Spark join连接
    combineByKey
    scala mkstring
    countByValue
    spark aggregate
    scala flatmap、reduceByKey、groupByKey
    生态圈安装
  • 原文地址:https://www.cnblogs.com/vivivi/p/5917577.html
Copyright © 2011-2022 走看看