zoukankan      html  css  js  c++  java
  • Beautifulsoup

    Beautiful Soup:解析HTML页面信息标记与提取方法

    获取网页源代码

    import requests
    from bs4 import BeautifulSoup
    
    kv = {'user-agent':'Mozilla/5.0'}
    url = "https://python123.io/ws/demo.html"
    r = requests.get(url,headers = kv)
    print(r.status_code)
    demo = r.text
    soup = BeautifulSoup(demo,"html.parser")#解析
    print(soup.prettify())

    200
    <html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>

    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

    BeautifulSoup的使用

     

     

     BeautifulSoup库解析器

      BeautifulSoup类的基本元素

     

    https://python123.io/ws/demo.html

    import requests
    from bs4 import BeautifulSoup
    kv = {'user-agent':'Mozilla/5.0'}
    url = "https://python123.io/ws/demo.html"
    r = requests.get(url,headers = kv)
    print(r.status_code)
    demo = r.text
    soup = BeautifulSoup(demo,"html.parser")
    print(soup.title)
    tag = soup.a#只能返回一个a标签
    print(tag)

    200
    <title>This is a python demo page</title>
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

    print(soup.a.name)
    print(soup.a.parent.name)
    print(soup.a.parent.parent.name)

    a
    p
    body

    print(tag.attrs['href'])
    print(type(tag.attrs))字典
    print(type(tag))

    {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
    http://www.icourse163.org/course/BIT-268001
    <class 'dict'>
    <class 'bs4.element.Tag'>



    tag = soup.a
    print(tag)
    print(tag.string)
    tag1 = soup.p
    print(tag1)
    print(tag1.string)
    tag2  = soup.b
    print(tag2)
    print(tag2.string)

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    Basic Python
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    The demo python introduces several python courses.
    <b>The demo python introduces several python courses.</b>
    The demo python introduces several python courses.

    print(type(tag2.string))

    <class 'bs4.element.NavigableString'>

    soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
    print(soup.b.string)
    print(type(soup.b.string))
    print(soup.p.string)
    print(type(soup.p.string))

    This is a comment
    <class 'bs4.element.Comment'>
    This is not a comment
    <class 'bs4.element.NavigableString'>

    基于bs4库的HTML内容遍历方法

    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

     上述的标签树如下 

    三种遍历方式

     下行遍历

    soup = BeautifulSoup(demo,"html.parser")
    print(soup.head)
    print(soup.head.contents)
    print(soup.body.contents)返回列表
    print(len(soup.body.contents))
    print(soup.body.contents[1])

    <head><title>This is a python demo page</title></head>
    [<title>This is a python demo page</title>]
    [' ', <p class="title"><b>The demo python introduces several python courses.</b></p>, ' ', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the foll
    owing courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, ' ']
    5
    <p class="title"><b>The demo python introduces several python courses.</b></p>

    for child in soup.body.children:
    print(child) # 遍历儿子节点


    <p class="title"><b>The demo python introduces several python courses.</b></p>


    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    for child in soup.body.descendants:
         print(child) # 遍历子孙节点


    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <b>The demo python introduces several python courses.</b>
    The demo python introduces several python courses.


    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    Basic Python
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    Advanced Python
    .


    上行遍历
    
    
    
    for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
       if parent is None:
           print(parent)
       else:
           print(parent.name)

    p
    body
    html
    [document]

    soup = BeautifulSoup(demo,"html.parser")
    tag = soup.a
    print(tag)
    print(tag.parent)

    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    强调:soup.html的parent是它本身  soup.parent是空的

    for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
       if parent is None:
           print( parent)
       else:
           print(parent.name)

    p
    body
    html
    [document]

    平行遍历

    平行遍历发生在同一父节点的各节点间

    标签间的NavigableString也会构成标签树的节点,那么某个节点的父节点、子节点或者平行标签都有可能是NavigableString类型的

    soup = BeautifulSoup(demo,"html.parser")
    tag = soup.a
    print(tag.next_sibling)
    print(tag.next_sibling.next_sibling)
    print(tag.previous_sibling)
    print(tag.previous_sibling.previous_sibling)

    print(tag.parent)

    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    .

    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

    基于bs4库的HTML格式输出 
    soup = BeautifulSoup(demo,"html.parser")
    print(soup.prettify())#在每个标签后面加了一个换行符,便于美观的输出

    bs4的编码默认都为utf-8编码
    soup = BeautifulSoup("<p>你好</p>","html.parser")
    print(soup.p.string)
    print(soup.p.prettify())


  • 相关阅读:
    LeetCode 264. Ugly Number II
    LeetCode 231. Power of Two
    LeetCode 263. Ugly Number
    LeetCode 136. Single Number
    LeetCode 69. Sqrt(x)
    LeetCode 66. Plus One
    LeetCode 70. Climbing Stairs
    LeetCode 628. Maximum Product of Three Numbers
    Leetcode 13. Roman to Integer
    大二暑假周进度报告03
  • 原文地址:https://www.cnblogs.com/tingtin/p/12907452.html
Copyright © 2011-2022 走看看