[scrapy] scrapy 使用goose作为正文提取

import scrapy
from goose import Goose

class Article(scrapy.Item):
    title = scrapy.Field()
    text = scrapy.Field()

class MyGooseSpider(scrapy.Spider):
    name = 'goose'
    start_urls = [
        'http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/',
        'http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/',
    ]

    def parse(self, response):
        article = Goose().extract(raw_html=response.body)
        yield Article(title=article.title, text=article.cleaned_text)

转自：http://stackoverflow.com/questions/26940002/can-i-use-scrapy-with-goose

查看全文

相关阅读:
数据仓库
 HiveSQL 数据定义语言（DDL）
HiveSQL 数据操控、查询语言（DML、DQL）
【ASP.NET Core】Blazor+MiniAPI完成文件下载
 MySQL的WAL（WriteAhead Logging）机制
 MySQL系列 | 索引数据结构大全
 眼见为实，看看MySQL中的隐藏列
 mysql的默认隔离级别：可重复读(Repeatable Read)
缓存淘汰算法LRU算法
 Android设计模式系列(12)SDK源码之生成器模式（建造者模式）

原文地址：https://www.cnblogs.com/bushe/p/4757981.html