zoukankan      html  css  js  c++  java
  • Web Scraping with Python第二章

    1.BeautifulSoup对象类型

    • BeautifulSoup对象,例如bsObj.div.h1
    • tag对象,例如使用find或findAll函数返回的对象
    • NavigableString对象,即指HTML中的文本节点
    • comment对象,指HTML中的注释,如<!--like this one-->

    2. findAll()与find()函数

    用法:
    findAll(tag, attributes, recursive, text, limit, keywords)
    find(tag, attributes, recursive, text, keywords)

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
    bsObj = BeautifulSoup(html.read(),"html.parser")
    
    nameList = bsObj.findAll("span",{"class":"green"})
    for name in nameList:
        print(name.get_text())
    
    Anna
    Pavlovna Scherer
    Empress Marya
    Fedorovna
    Prince Vasili Kuragin
    Anna Pavlovna
    St. Petersburg
    the prince
    Anna Pavlovna
    Anna Pavlovna
    the prince
    the prince
    the prince
    Prince Vasili
    Anna Pavlovna
    Anna Pavlovna
    the prince
    Wintzingerode
    King of Prussia
    le Vicomte de Mortemart
    Montmorencys
    Rohans
    Abbe Morio
    the Emperor
    the prince
    Prince Vasili
    Dowager Empress Marya Fedorovna
    the baron
    Anna Pavlovna
    the Empress
    the Empress
    Anna Pavlovna's
    Her Majesty
    Baron
    Funke
    The prince
    Anna
    Pavlovna
    the Empress
    The prince
    Anatole
    the prince
    The prince
    Anna
    Pavlovna
    Anna Pavlovna
    

    3.子节点、后代节点、兄弟节点、父节点

    .children:获取该节点的所有字节点
    .descendants:获取该节点的所有后代节点,包括字节点
    .next_siblings:获取兄弟节点(除了自己,而且是后面的兄弟节点)
    .previous_siblings:获取兄弟节点(除了自己,而且是后面的兄弟节点)
    .next_sibling:获取后一个兄弟节点
    .previous_sibling:获取前一个兄弟节点
    .parents:获取所有父节点
    .parent:获取第一级父节点

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html.read(),"lxml")
    
    #for child in bsObj.find("table",{"id":"giftList"}).children:
    #for child in bsObj.find("table",{"id":"giftList"}).descendants:
    for child in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
        print(child)
        
    print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
    
    <tr class="gift" id="gift1"><td>
    Vegetable Basket
    </td><td>
    This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
    <span class="excitingNote">Now with super-colorful bell peppers!</span>
    </td><td>
    $15.00
    </td><td>
    <img src="../img/gifts/img1.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift2"><td>
    Russian Nesting Dolls
    </td><td>
    Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
    </td><td>
    $10,000.52
    </td><td>
    <img src="../img/gifts/img2.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift3"><td>
    Fish Painting
    </td><td>
    If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
    </td><td>
    $10,005.00
    </td><td>
    <img src="../img/gifts/img3.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift4"><td>
    Dead Parrot
    </td><td>
    This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
    </td><td>
    $0.50
    </td><td>
    <img src="../img/gifts/img4.jpg"/>
    </td></tr>
    
    
    <tr class="gift" id="gift5"><td>
    Mystery Box
    </td><td>
    If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
    </td><td>
    $1.50
    </td><td>
    <img src="../img/gifts/img6.jpg"/>
    </td></tr>
    
    
    
    $15.00
    

    4.正则

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html,"html.parser")
    
    for img in bsObj.findAll("img",{"src":re.compile("../img/gifts/imgd+.jpg")}):
        print(img)
    
    <img src="../img/gifts/img1.jpg"/>
    <img src="../img/gifts/img2.jpg"/>
    <img src="../img/gifts/img3.jpg"/>
    <img src="../img/gifts/img4.jpg"/>
    <img src="../img/gifts/img6.jpg"/>
    

    5.匿名函数

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html,"html.parser")
    
    #找出只有2个属性值的标签
    for lst in bsObj.findAll(lambda tag: len(tag.attrs) == 2):
        print(lst)
    
    <img src="../img/gifts/logo.jpg" style="float:left;"/>
    <tr class="gift" id="gift1"><td>
    Vegetable Basket
    </td><td>
    This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
    <span class="excitingNote">Now with super-colorful bell peppers!</span>
    </td><td>
    $15.00
    </td><td>
    <img src="../img/gifts/img1.jpg"/>
    </td></tr>
    <tr class="gift" id="gift2"><td>
    Russian Nesting Dolls
    </td><td>
    Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
    </td><td>
    $10,000.52
    </td><td>
    <img src="../img/gifts/img2.jpg"/>
    </td></tr>
    <tr class="gift" id="gift3"><td>
    Fish Painting
    </td><td>
    If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
    </td><td>
    $10,005.00
    </td><td>
    <img src="../img/gifts/img3.jpg"/>
    </td></tr>
    <tr class="gift" id="gift4"><td>
    Dead Parrot
    </td><td>
    This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
    </td><td>
    $0.50
    </td><td>
    <img src="../img/gifts/img4.jpg"/>
    </td></tr>
    <tr class="gift" id="gift5"><td>
    Mystery Box
    </td><td>
    If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
    </td><td>
    $1.50
    </td><td>
    <img src="../img/gifts/img6.jpg"/>
    </td></tr>
  • 相关阅读:
    C#实现Winform自定义半透明遮罩层
    C# winform 窗体弹出选择目录或文件 的对话框
    C# winform 窗体弹出选择目录或文件 的对话框
    python语言实现贪吃蛇
    python语言实现贪吃蛇
    用python写一个简单的表白代码
    用python写一个简单的表白代码
    用python写一个简单的表白代码
    百练2810:完美立方
    百练2810:完美立方
  • 原文地址:https://www.cnblogs.com/dxs959229640/p/8672842.html
Copyright © 2011-2022 走看看