zoukankan      html  css  js  c++  java
  • Web Scraping with Python第二章

    1.BeautifulSoup对象类型

    • BeautifulSoup对象,例如bsObj.div.h1
    • tag对象,例如使用find或findAll函数返回的对象
    • NavigableString对象,即指HTML中的文本节点
    • comment对象,指HTML中的注释,如<!--like this one-->

    2. findAll()与find()函数

    用法:
    findAll(tag, attributes, recursive, text, limit, keywords)
    find(tag, attributes, recursive, text, keywords)

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
    bsObj = BeautifulSoup(html.read(),"html.parser")
    
    nameList = bsObj.findAll("span",{"class":"green"})
    for name in nameList:
        print(name.get_text())
    
    Anna
    Pavlovna Scherer
    Empress Marya
    Fedorovna
    Prince Vasili Kuragin
    Anna Pavlovna
    St. Petersburg
    the prince
    Anna Pavlovna
    Anna Pavlovna
    the prince
    the prince
    the prince
    Prince Vasili
    Anna Pavlovna
    Anna Pavlovna
    the prince
    Wintzingerode
    King of Prussia
    le Vicomte de Mortemart
    Montmorencys
    Rohans
    Abbe Morio
    the Emperor
    the prince
    Prince Vasili
    Dowager Empress Marya Fedorovna
    the baron
    Anna Pavlovna
    the Empress
    the Empress
    Anna Pavlovna's
    Her Majesty
    Baron
    Funke
    The prince
    Anna
    Pavlovna
    the Empress
    The prince
    Anatole
    the prince
    The prince
    Anna
    Pavlovna
    Anna Pavlovna
    

    3.子节点、后代节点、兄弟节点、父节点

    .children:获取该节点的所有字节点
    .descendants:获取该节点的所有后代节点,包括字节点
    .next_siblings:获取兄弟节点(除了自己,而且是后面的兄弟节点)
    .previous_siblings:获取兄弟节点(除了自己,而且是后面的兄弟节点)
    .next_sibling:获取后一个兄弟节点
    .previous_sibling:获取前一个兄弟节点
    .parents:获取所有父节点
    .parent:获取第一级父节点

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html.read(),"lxml")
    
    #for child in bsObj.find("table",{"id":"giftList"}).children:
    #for child in bsObj.find("table",{"id":"giftList"}).descendants:
    for child in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
        print(child)
        
    print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
    
    <tr class="gift" id="gift1"><td>
    Vegetable Basket
    </td><td>
    This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
    <span class="excitingNote">Now with super-colorful bell peppers!</span>
    </td><td>
    $15.00
    </td><td>
    <img src="../img/gifts/img1.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift2"><td>
    Russian Nesting Dolls
    </td><td>
    Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
    </td><td>
    $10,000.52
    </td><td>
    <img src="../img/gifts/img2.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift3"><td>
    Fish Painting
    </td><td>
    If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
    </td><td>
    $10,005.00
    </td><td>
    <img src="../img/gifts/img3.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift4"><td>
    Dead Parrot
    </td><td>
    This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
    </td><td>
    $0.50
    </td><td>
    <img src="../img/gifts/img4.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift5"><td>
    Mystery Box
    </td><td>
    If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
    </td><td>
    $1.50
    </td><td>
    <img src="../img/gifts/img6.jpg"/>
    </td></tr>
    
    
    
    $15.00
    

    4.正则

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html,"html.parser")
    
    for img in bsObj.findAll("img",{"src":re.compile("../img/gifts/imgd+.jpg")}):
        print(img)
    
    <img src="../img/gifts/img1.jpg"/>
    <img src="../img/gifts/img2.jpg"/>
    <img src="../img/gifts/img3.jpg"/>
    <img src="../img/gifts/img4.jpg"/>
    <img src="../img/gifts/img6.jpg"/>
    

    5.匿名函数

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html,"html.parser")
    
    #找出只有2个属性值的标签
    for lst in bsObj.findAll(lambda tag: len(tag.attrs) == 2):
        print(lst)
    
    <img src="../img/gifts/logo.jpg" style="float:left;"/>
    <tr class="gift" id="gift1"><td>
    Vegetable Basket
    </td><td>
    This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
    <span class="excitingNote">Now with super-colorful bell peppers!</span>
    </td><td>
    $15.00
    </td><td>
    <img src="../img/gifts/img1.jpg"/>
    </td></tr>
    <tr class="gift" id="gift2"><td>
    Russian Nesting Dolls
    </td><td>
    Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
    </td><td>
    $10,000.52
    </td><td>
    <img src="../img/gifts/img2.jpg"/>
    </td></tr>
    <tr class="gift" id="gift3"><td>
    Fish Painting
    </td><td>
    If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
    </td><td>
    $10,005.00
    </td><td>
    <img src="../img/gifts/img3.jpg"/>
    </td></tr>
    <tr class="gift" id="gift4"><td>
    Dead Parrot
    </td><td>
    This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
    </td><td>
    $0.50
    </td><td>
    <img src="../img/gifts/img4.jpg"/>
    </td></tr>
    <tr class="gift" id="gift5"><td>
    Mystery Box
    </td><td>
    If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
    </td><td>
    $1.50
    </td><td>
    <img src="../img/gifts/img6.jpg"/>
    </td></tr>
  • 相关阅读:
    git创建版本库
    DataSet的加密解密
    在InstallShield中加密字符串,在C#中解密
    asp.net后台长时间操作时,向前台输出“请等待"信息的方法
    DataSet的加密解密(续)
    XXTEA加密算法的InstallShield 脚本实现
    c#如何监视文件或者文件夹的变化
    wpf制作毛玻璃效果按钮的代码
    WPF中用于Path的Geometry MiniLanguage
    如何在非英文环境中正确显示数字
  • 原文地址:https://www.cnblogs.com/dxs959229640/p/8672842.html
Copyright © 2011-2022 走看看