zoukankan      html  css  js  c++  java
  • 网络爬虫BeautifulSoup库的使用

    使用BeautifulSoup库提取HTML页面信息

    #!/usr/bin/python3
    import requests
    from bs4 import BeautifulSoup
    
    url='http://python123.io/ws/demo.html'
    r=requests.get(url)
    if r.status_code==200:
        print('网络请求成功')
    
    demo=r.text
    soup=BeautifulSoup(demo,'html.parser')
    print(soup.prettify())

    BeautifulSoup类的基本属性

    #!/usr/bin/python3
    import requests
    from bs4 import BeautifulSoup
    
    url='http://python123.io/ws/demo.html'
    r=requests.get(url)
    if r.status_code==200:
        print('网络请求成功')
    
    demo=r.text
    soup=BeautifulSoup(demo,'html.parser')
    tag_title=soup.title
    print(tag_title)
    tag_a_attrs=soup.a.attrs
    print(soup.p.string)

    标签树的下行遍历

    #!/usr/bin/python3
    import requests
    from bs4 import BeautifulSoup
    
    url='http://python123.io/ws/demo.html'
    r=requests.get(url)
    if r.status_code==200:
        print('网络请求成功')
    
    demo=r.text
    soup=BeautifulSoup(demo,'html.parser')
    
    print(soup.prettify())
    print('我是分割线'.center(80,'-'))
    #遍历子节点
    
    for child in soup.body.children:
        print(child)
    #遍历子孙节点
    for descendant in soup.body.descendants:
        print(descendant)

    标签树的上行遍历

    遍历title标签的上一级标签

    print(soup.title.parent)
    #a标签的下一标签
    print(soup.a.next_sibling)

    遍历a标签的所有前序节点以及后续节点

    #遍历a标签的前序节点
    for sibling in soup.a.next_siblings:
        print(sibling)
    #遍历a标签的前序节点
    for sibling in soup.a.previous_siblings:
        print(sibling)

    soup标签的上一级标签为空,所以要进行判断

  • 相关阅读:
    MVC 路由规则
    MVC系统过滤器、自定义过滤器
    MVC部署
    MVC 读书笔记
    C# 调用 Web Service
    RESTful 架构
    WebSocket C# Demo
    C# 编写服务 Windows service
    C# 调用FFmpeg 根据图片合成视频
    Socket 编程示例(二)
  • 原文地址:https://www.cnblogs.com/liberate20/p/10778322.html
Copyright © 2011-2022 走看看