zoukankan      html  css  js  c++  java
  • 网页正文提取,降噪的实现(readability/Document)

    安装: pip install readability-lxml

    使用:

    # encoding:utf-8
    import html2text
    import requests
    import re
    from readability.readability import Document


    res = requests.get('http://finance.sina.com.cn/roll/2019-02-12/doc-ihrfqzka5034116.shtml')

    # 获取新闻标题
    readable_title = Document(res.content).short_title()
    # 获取内容并清洗
    readable_article = Document(res.content).summary()
    text_p = re.sub(r'</?div.*?>', '', readable_article)
    text_p = re.sub(r'((</p>)?<a href=.*?>|</a>(<p>)?)', '', text_p)
    text_p = re.sub(r'<select>.*?</select>', '', text_p)
    print text_p
    html2text

    html2text的使用:

    安装: pip install html2text

    使用:

    def test_func2(html):
    """获取指定URLhtml,对html进行处理"""
    h = html2text.HTML2Text()
    h.ignore_links = True # (True剔除超链接,False保留)
    print h.handle(html)
    res = requests.get('http://finance.sina.com.cn/roll/2019-02-12/doc-ihrfqzka5034116.shtml')

    test_func2(res.content.decode('utf-8'))


  • 相关阅读:
    Spring Boot邮件功能
    jenkins自动部署
    spring boot定时任务解析
    类的加载classload和类对象的生成
    排序算法
    Robbin负载均衡
    ActiveMQ消息中间件
    hystrix熔断器
    css3整理--calc()
    css3整理--media
  • 原文地址:https://www.cnblogs.com/fanjp666888/p/10441835.html
Copyright © 2011-2022 走看看