zoukankan html css js c++ java

读BeautifulSoup官方文档之html树的打印

prettify()能返回一个格式良好的html的Unicode字符串 :

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>
 <head>
 </head>
 <body>
  <a href="http://example.com/">
...'

print(soup.prettify())
# <html>
#  <head>
#  </head>
#  <body>
#   <a href="http://example.com/">
#    I linked to
#    <i>
#     example.com
#    </i>
#   </a>
#  </body>
# </html>

但是你只是想要一个代表该html的字符串, 并不在乎它的格式, 你可以使用str()或者unicode()...这里str()返回的是格式为utf8的字符串, 你可以使用encode使它变为bytestring或者decode使它变成Unicode.

str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'

unicode(soup.a)
# u'<a href="http://example.com/">I linked to <i>example.com</i></a>'

其他还有一些细节我不太像看下去了, 最后还有一个get_text()我在提下, 它能返回调用标签中所有的text部分...

markup = '<a href="http://example.com/">
I linked to <i>example.com</i>
</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'
I linked to example.com
'
soup.i.get_text()
u'example.com'

你还可以为他传递一个字符串参数, 用这个参数来划分出每一部分的text.

# soup.get_text("|")
u'
I linked to |example.com|
'

同时还可以设置strip参数来去掉每个部分(注意是每个部分而不是整体)前后的空白字符

# soup.get_text("|", strip=True)
u'I linked to|example.com'

当然, 这种情况也可以使用我们之前提到的stripped_strings(), 不记得的可以看之前的文章...

[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']

看到这里文档也看完了70%左右, 我感觉这些已经足够我目前的需求了, 所以就我不就继续往下看了...

查看全文

相关阅读:
Android 课程设计
 第十个作业简易通讯录
 第九个作业 QQ的账号密码保存
 第八个作业 QQ账号的保存
 第七个作业 Activity之间的数据回传
 第六个作业应用列表
 第五个作业背景换色
 JSP第一次作业
 安卓课设
 Android第八次作业

原文地址：https://www.cnblogs.com/nzhl/p/5593424.html

最新文章
第十次作业
 第八次作业
 Android课程设计报告
 第七次作业2
第七次作业1
第九次作业
 第六次作业
 第十次作业
 第九次作业
 第八次作业

热门文章
第七次作业3
第七次作业（2）
第七次作业
 第六次作业
 第四次作业
 第三次作业
 第二次作业
 JSP第三次作业
 JSP第二次作业
 jsp(第一次作业)