zoukankan      html  css  js  c++  java
  • NLP(十一) 提取文本摘要

    原文链接:http://www.one2know.cn/nlp11/

    • gensim.summarization库的函数
      gensim.summarization.summarize(text, ratio=0.2, word_count=None, split=False)
      Parameters(参数):
      text : str
      Given text.
      ratio : float, optional
      Number between 0 and 1 that determines the proportion of the number of
      sentences of the original text to be chosen for the summary.
      word_count : int or None, optional
      Determines how many words will the output contain.
      If both parameters are provided, the ratio will be ignored.
      split : bool, optional
      If True, list of sentences will be returned. Otherwise joined
      strings will bwe returned.
    • 代码
    from gensim.summarization import summarize # 基于文本排序的摘要算法
    from bs4 import BeautifulSoup # 用于解析HTML文档的BeautifulSoup库
    import requests # 用于下载HTTP资源的库
    urls = { # 题目:网站 字典
        'Deconstructing Voice-over-IP':
        'http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html',
        'Exploration of the Location-Identity Split':
        'http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html',
    }
    # 摘要(真实的):
    # 1.The implications of ambimorphic archetypes have been far-reaching and pervasive. After years of natural research into consistent hashing, we argue the simulation of public-private key pairs, which embodies the confirmed principles of theory. Such a hypothesis might seem perverse but is derived from known results. Our focus in this paper is not on whether the well-known knowledge-based algorithm for the emulation of checksums by Herbert Simon runs in Θ( n ) time, but rather on exploring a semantic tool for harnessing telephony (Swale).
    # 2.Superblocks must work. Given the current status of homogeneous configurations, security experts particularly desire the simulation of 802.11b. we consider how the Internet can be applied to the refinement of Scheme.
    for key in urls.keys():
        url = urls[key]
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'html.parser')
        data = soup.get_text() # HTML去标签后的文本
        pos1 = data.find('1 Introduction') + len('1 Introduction')
        pos2 = data.find('Related Work')
        text = data[pos1:pos2].strip() # 提取pos1与pos2之间的引言部分
        print('PAPER URL: {}'.format(url))
        print('TITLE: {}'.format(key))
        print('GENERATED SUMMARY: {}'.format(summarize(text)))
        print()
    

    输出:

    PAPER URL: http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html
    TITLE: Deconstructing Voice-over-IP
    GENERATED SUMMARY: 。。。。。。
    
    PAPER URL: http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html
    TITLE: Exploration of the Location-Identity Split
    GENERATED SUMMARY: 。。。。。。
    
  • 相关阅读:
    Python字典推导式将cookie字符串转化为字典
    爬取百度贴吧前1000页内容(requests库面向对象思想实现)
    牛客网:连续子数组的最大和
    在字符串中找出第一个只出现一次的字符,Python实现
    关于时间日期的程序,主要datetime模块
    [读书笔记] Python数据分析 (五) pandas入门
    [学习笔记] CS131 Computer Vision: Foundations and Applications:Lecture 3 线性代数初步
    [读书笔记] Python数据分析 (四) 数组和矢量计算
    [读书笔记] Python数据分析 (三) IPython
    [读书笔记] R语言实战 (六) 基本图形方法
  • 原文地址:https://www.cnblogs.com/peng8098/p/nlp_11.html
Copyright © 2011-2022 走看看