zoukankan      html  css  js  c++  java
  • 自然语言处理----处理原始文本

    本文主要介绍编程访问网络文本的几种方式。

    1. 访问网络资源

    >>> from urllib import urlopen
    >>> url='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html'
    >>> raw=urlopen(url).read()
    >>> type(raw)
    <type 'str'>
    >>> len(raw)
    16429
    >>> raw[:75]
    '
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      "http://'
    View Code

    如果Python无法正确自动检测出Internet代理,可以使用下面方法手动指定。

    >>> proxies={'http': 'http://www.someproxy.com:3128'}
    >>> raw=urlopen(url, proxies=proxies).read()

    2. 访问博客

    在Universal Feed Parser的第三方python库的帮助下,可以访问博客的内容。

    >>> import feedparser
    >>> llog=feedparser.parse('http://weibo.com/ttarticle/p/show?id=2309404116343489194022')
    >>> llog.keys()
    ['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception']
    >>> type(llog['feed'])
    <class 'feedparser.FeedParserDict'>
    >>> llog['feed'].keys()
    ['meta', 'summary']
    >>> llog['feed']['meta']
    {'content': u'text/html; charset=gb2312', 'http-equiv': u'Content-type'}
    >>> llog['feed']['summary']
    u'<span id="message"></span>
    
    &amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;'
    View Code

    3. 处理html

     一般有三种方式:正则匹配, nltk.clean_html(), BeautifulSoup. 正则表达式比较繁琐,而nltk.clean_html()现在已经不支持了,比较简单常用的是用BeautifulSoup包。

    from bs4 import BeautifulSoup
    
    html_doc=''' 
        <html><head><title>The Document's story</title></head>
        <html><head><title>The Dormouse's story</title></head>
        <body>
        <p class="title"><b>The Dormouse's story</b></p>
    
        <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.</p>
    
        <p class="story">...</p>
        </body></html>
    '''
    soup = BeautifulSoup(html_doc, 'html.parser')
    content=soup.get_text()
    print content

    运行结果如下:

    runfile('D:/my project/e_book/XXMLV-2/4.Python_代码/test.py', wdir='D:/my project/e_book/XXMLV-2/4.Python_代码')
    
    The Document's story
    The Dormouse's story
    
    The Dormouse's story
    Once upon a time there were three little sisters; and their names were
        Elsie,
        Lacie and
        Tillie;
            and they lived at the bottom of a well.
    ...
  • 相关阅读:
    广陵基地输电线路实训场
    广陵基地配电网综合实训室
    广陵基地电缆实训室
    Windows Phone 9再见了!
    Windows Phone 8初学者开发—第23部分:测试并向应用商店提交
    Windows Phone 8初学者开发—第22部分:用演示图板创建卷盘的动画
    JDBC数据类型
    Java-BlockingQueue的使用
    苹果下如果安装nginx,给nginx安装markdown第三方插件
    苹果电脑包管理
  • 原文地址:https://www.cnblogs.com/no-tears-girl/p/6964600.html
Copyright © 2011-2022 走看看