zoukankan html css js c++ java

爬虫日记之美味汤的各种属性值的运用

美味汤Beautifulsoup 实例

这个东西需要下载，打开cmd 输入指令pip install bs4 就可以下在这个库了。

它用来解析你爬取过来乱糟糟的html或者xml的代码，会自动帮你整理好。具体用法在上面。

BeautifulSoup里面的两个参数，第一个是爬取的html内容，第二个是用来解析的html解析器。

还有其他解析器

（来自右上角的视频截图，如有侵权，望告知，定整改。）

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

原来爬动的是这么难看的数据，头皮发麻，用bs4之后做成了汤

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

变得这么好看，这是下面代码的soup全内容。

import requests
from bs4 import BeautifulSoup
try:
    r=requests.get('https://python123.io/ws/demo.html')
    demo=r.text
    soup = BeautifulSoup(demo,'html.parser')

    # 用来返回标题，也就是用左上角的那个东西 <title>This is a python demo page</title>
    print(soup.title)


    # 用来返回标签名为a的标签，默认返回第一个a标签  <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    print(soup.a)
    print(soup.a.string)


    # 返回标签a的标签名  a
    print(soup.a.name)


    # 返回a标签的父标签的标签名  p
    print(soup.a.parent.name)


    # 返回a标签的爷爷标签的标签名  body
    print(soup.a.parent.parent.name)

    tag=soup.a

    # 返回a标签的各种属性  {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
    print(tag.attrs)

    # 返回a标签的class属性的内容  ['py1']
    print(tag.attrs['class'])

    # 返回a标签的href属性的内容  http://www.icourse163.org/course/BIT-268001
    print(tag.attrs['href'])

    # 返回类型  <class 'dict'>
    print(type(tag.attrs))


    # 返回soup.a的数据类型 <class 'bs4.element.Tag'>
    print(type(tag))


    # 可以穿透内置标签，直接输出string部分的内容 Basic Python
    print(tag.string)



    tag2=soup.p.string


    #The demo python introduces several python courses.
    print(tag2.string)



    #<class 'bs4.element.NavigableString'>
    print(type(tag2.string))


    newsoup=BeautifulSoup("<b><!--This is a comment --></b><p>This is not a comment</p>","html.parser")

    #This is a comment
    print(newsoup.b.string)


    #<class 'bs4.element.Comment'>(注释类型）
    print(type(newsoup.b.string))

except:
    print('爬取失败')

注释都是我自己写的，好好看，还看不懂就重新来过吧，反正没有反正。上面用到较少，了解即可。

查看全文

相关阅读:
CF1354D
Keiichi Tsuchiya the Drift King
二分查找[搬运链接]
树状数组的修炼疑惑篇
 离线
 关于二维差分和二维前缀和的注意事项
 QWORD PTR [rcx],0x0 ？？
字典树
 数据库题
 需要掌握的技能汇总

原文地址：https://www.cnblogs.com/chanyuli/p/11395582.html