zoukankan      html  css  js  c++  java
  • 提取数据之goose使用

    1.简介

    Python-goose项目是用Python重写的Goose,Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息,支持中文网页。
    Python-goose可提取的信息包括:

    • 文章主体内容
    • 文章主要图片
    • 文章中嵌入的任何Youtube/Vimeo视频
    • 元描述
    • 元标签

    2.安装

    virtualenv --no-site-packages goose
    cd goose
    #windows下
    Scriptsactivate
    #linux下使用/bin/acitvate
    git clone https://github.com/grangier/python-goose.git
    cd python-goose
    pip install -r requirements.txt
    python setup.py install

    3.使用

    >>> from goose import Goose
    >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
    >>> g = Goose()
    >>> article = g.extract(url=url)
    >>> article.title
    u'Occupy London loses eviction fight'
    >>> article.meta_description
    "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
    >>> article.cleaned_text[:150]
    (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
    >>> article.top_image.src
    http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
    

      对于中文文章,需要

    g = Goose({'browser_user_agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.3
     6','stopwords_class':StopWordsChinese})

    参考:

    https://pypi.python.org/pypi/goose-extractor/

  • 相关阅读:
    Java守护线程Daemon
    在for循环中创建双向链表
    Java泛型-官方教程
    大自然搬运工
    转 curl命令
    HashMap扩容问题及了解散列均分
    mysql 分组查询并取出各个分组中时间最新的数据
    CNN 模型复杂度分析
    Attention机制
    深度学习之目标检测
  • 原文地址:https://www.cnblogs.com/hupeng1234/p/6685395.html
Copyright © 2011-2022 走看看