zoukankan html css js c++ java

自然语言处理----处理原始文本

本文主要介绍编程访问网络文本的几种方式。

1. 访问网络资源

>>> from urllib import urlopen
>>> url='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html'
>>> raw=urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
16429
>>> raw[:75]
'

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://'

View Code

如果Python无法正确自动检测出Internet代理，可以使用下面方法手动指定。

>>> proxies={'http': 'http://www.someproxy.com:3128'}
>>> raw=urlopen(url, proxies=proxies).read（）

2. 访问博客

在Universal Feed Parser的第三方python库的帮助下，可以访问博客的内容。

>>> import feedparser
>>> llog=feedparser.parse('http://weibo.com/ttarticle/p/show?id=2309404116343489194022')
>>> llog.keys()
['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception']
>>> type(llog['feed'])
<class 'feedparser.FeedParserDict'>
>>> llog['feed'].keys()
['meta', 'summary']
>>> llog['feed']['meta']
{'content': u'text/html; charset=gb2312', 'http-equiv': u'Content-type'}
>>> llog['feed']['summary']
u'<span id="message"></span>

&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;'

View Code

3. 处理html

一般有三种方式：正则匹配， nltk.clean_html(), BeautifulSoup. 正则表达式比较繁琐，而nltk.clean_html（）现在已经不支持了，比较简单常用的是用BeautifulSoup包。

from bs4 import BeautifulSoup

html_doc=''' 
    <html><head><title>The Document's story</title></head>
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    </body></html>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
content=soup.get_text()
print content

运行结果如下：

runfile('D:/my project/e_book/XXMLV-2/4.Python_代码/test.py', wdir='D:/my project/e_book/XXMLV-2/4.Python_代码')

The Document's story
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
    Elsie,
    Lacie and
    Tillie;
        and they lived at the bottom of a well.
...

查看全文

相关阅读:
广陵基地输电线路实训场
 广陵基地配电网综合实训室
 广陵基地电缆实训室
 Windows Phone 9再见了!
Windows Phone 8初学者开发—第23部分：测试并向应用商店提交
 Windows Phone 8初学者开发—第22部分：用演示图板创建卷盘的动画
 JDBC数据类型
 Java-BlockingQueue的使用
 苹果下如果安装nginx,给nginx安装markdown第三方插件
 苹果电脑包管理

原文地址：https://www.cnblogs.com/no-tears-girl/p/6964600.html