zoukankan      html  css  js  c++  java
  • [scrapy] scrapy 使用goose作为正文提取

    import scrapy
    from goose import Goose
    
    class Article(scrapy.Item):
        title = scrapy.Field()
        text = scrapy.Field()
    
    class MyGooseSpider(scrapy.Spider):
        name = 'goose'
        start_urls = [
            'http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/',
            'http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/',
        ]
    
        def parse(self, response):
            article = Goose().extract(raw_html=response.body)
            yield Article(title=article.title, text=article.cleaned_text)
    

    转自:http://stackoverflow.com/questions/26940002/can-i-use-scrapy-with-goose

  • 相关阅读:
    博客园的界面设置
    ARM 汇编指令集
    winfroms更换皮肤
    面向对象的七项设计原则
    S2-01
    机票查询与订购系统
    重点语法
    第二章
    一、17.09.13
    习作
  • 原文地址:https://www.cnblogs.com/bushe/p/4757981.html
Copyright © 2011-2022 走看看