zoukankan      html  css  js  c++  java
  • 基于bs4库的HTML内容查找方法

    一、信息提取实例

    提取HTML中所有的URL链接

    思路:1)搜索到所有的<a>标签

       2)解析<a>标签格式,提取href后的链接内容

    >>> import requests
    >>> r= requests.get("https://python123.io/ws/demo.html")
    >>> demo=r.text
    >>> demo
    '<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>'
    >>> from bs4 import BeautifulSoup

    soup=BeautifulSoup(demo,'html.parser')

    >>> print(soup.prettify())
    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

    >>> for link in soup.find_all('a'):
    ... print(link.get("href"))
    ...
    http://www.icourse163.org/course/BIT-268001
    http://www.icourse163.org/course/BIT-1001870001

    二、基于bs4库的HTML内容查找方法

    <>.find_all(name,attrs,recursive,string,**kwargs)可以在soup的变量中去查找里面的信息

    返回一个列表类型,存储查找的结果

    1、name:对标签名称的检索字符串

    >>> soup.find_all('a')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> soup.find_all(['a','b'])
    [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> for tag in soup.find_all(True):  #如果给出的标签名称是True,将显示当前soup的所有标签信息
    ... print(tag.name)
    ...
    html
    head
    title
    body
    p
    b
    p
    a
    a
    >>> import re

    >>> for tag in soup.find_all(re.compile('b')):  #正则表达式库所反馈的结果是指以b开头的所有的信息作为查找的要素
    ... print(tag.name)
    ...
    body
    b

    2、attrs:对标签属性值的检索字符串,可标注属性检索

    >>> soup.find_all('p','course')
    [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

    >>> soup.find_all(id='link1')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
    >>> soup.find_all(id='link')
    []
    >>> import re
    >>> soup.find_all(id=re.compile('link'))
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

    3、recursive:是否对子孙全部检索,默认True

    >>> soup.find_all('a')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> soup.find_all('a',recursive=False)
    []

    说明从soup根节点开始,他的儿子节点层面上是没有a标签的,a标签应该在子孙的后续节点

    4、string:<>...</>中字符串区域的检索字符串

    >>> soup
    <html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    </body></html>
    >>> soup.find_all(string = "Basic Python")
    ['Basic Python']
    >>> import re
    >>> soup.find_all(string=re.compile("python"))
    ['This is a python demo page', 'The demo python introduces several python courses.']
    >>>

    <tag>(..) 等价于 <tag>.find_all(..)

    soup(..)等价于soup.find_all(..)

    七个扩展方法

    <>.find()

    <>.find_parents()

    <>.find_parent()

    <>.find_next_siblings()

    <>.find_next_sibling()

    <>.find_previous_siblings()

    <>.find_previous_sibling()

  • 相关阅读:
    javascript:getElementsByName td name
    C# 批量复制文件
    笨笨图片批量下载器[C# | WinForm | 正则表达式 | HttpWebRequest]
    浩方魔兽"去"小喇叭终极解决方案[Warcraft III]
    JavaScript——DataListBox(组合框)
    PowerDesigner 12 根据名称生成注释(完整示例)
    笨笨图片批量抓取下载 V0.2 beta[C# | WinForm | 正则表达式 | HttpWebRequest | Async异步编程]
    使用IHttpHandler做权限控制[ASP.NET | IHttpHandler | AjaxPro | UserHostName]
    AjaxPro 未定义错误
    北京有趣地名(二)
  • 原文地址:https://www.cnblogs.com/suitcases/p/11232139.html
Copyright © 2011-2022 走看看