zoukankan      html  css  js  c++  java
  • 爬虫库之BeautifulSoup学习(三)

    遍历文档树:

      1、查找子节点

      .contents  

      tag的.content属性可以将tag的子节点以列表的方式输出。

      print soup.body.contents

      print type(soup.body.contents)

      运行结果:

    [u' ', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, u' ', <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, u' ', <p class="story">...</p>, u' ']

    <type 'list'>
    [Finished in 0.2s]

     

    .children

    它返回的不是一个list,不过我们可以通过它来遍历获取所有子节点。

    我们可以打印输出,可以发现它返回的是一个list生成器对象

    print soup.body.children  

    <listiterator object at 0x0294DE90>

     

    我们怎样获得里面的内容呢?遍历一下就ok了:

    for child in  soup.boyd.children:

      print child

    运行返回内容:

    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>


    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>


    <p class="story">...</p>


    [Finished in 0.2s]

     

    2、所有子孙节点

    .descendants

    .contents 和 .children 属性仅包含tag的直接子节点,.descendants 属性可以对所有tag的子孙节点进行递归循环,和 children类似,我们也需要遍历获取其中的内容。

    for child in soup.descendants:
      print child

    运行结果如下,可以发现,所有的节点都被打印出来了,先生最外层的 HTML标签,其次从 head 标签一个个剥离,以此类推。

     

    3、节点内容

    .string

    如果一个标签里面没有标签了,那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 .string 也会返回最里面的内容。

    果tag包含了多个子节点,tag就无法确定,string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

    print soup.head.string
    print soup.title.string
    print soup.body.string

    #The Dormouse's story
    #The Dormouse's story
    #None
    [Finished in 0.2s]

     

    4、多个内容

    .strings

    获取多个内容,不过需要遍历获取

    for string in soup.strings:

      print repr(string)

     

      .stripped_strings 

      输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

    for string in soup.stripped_strings:
      print repr(string)

    运行结果:

    u"The Dormouse's story"
    u"The Dormouse's story"
    u'Once upon a time there were three little sisters; and their names were'
    u','
    u'Lacie'
    u'and'
    u'Tillie'
    u'; and they lived at the bottom of a well.'
    u'...'
    [Finished in 0.2s]

     

    5、父节点

     .parent 

    print soup.p.parent.name

    print soup.head.title.string.parent.name

    #body

    #title

     

    6、兄弟节点、前后节点等略

     

  • 相关阅读:
    HTML5 五大特性
    JS DATE对象详解
    MySQL复制错误 The slave I/O thread stopsbecause master and slave have equal MySQL server UUIDs; these UUIDs must bedifferent for replication to work 解析
    MySQL OSC(在线更改表结构)原理
    Mycat基本搭建
    MySQL MVCC原理
    MySQL索引
    MySQL5.7新特性
    mysql报错"ERROR 1206 (HY000): The total number of locks exceeds the lock table size"的解决方法
    监控Mongo慢查询
  • 原文地址:https://www.cnblogs.com/yu2000/p/6847039.html
Copyright © 2011-2022 走看看