zoukankan      html  css  js  c++  java
  • PyQuery使用

    PyQuery库是一个非常强大的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。它的语法与 jQuery 几乎完全相同,所以不用再去费心记一些奇怪的方法了。
    官网地址:http://pyquery.readthedocs.io/en/latest/
    jQuery参考文档: http://jquery.cuishifeng.cn/


    1、字符串的初始化

    from pyquery import PyQuery as pq

    html = '''<div>
    <ul>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    print(doc)
    print(type(doc))
    print(doc('li'))

    2、打开html文件

    注意路径问题

    from pyquery import PyQuery as pq
    doc = pq(filename='index.html')
    print(doc)
    print(doc('head'))

    3、打开某个网站

    doc = pq('https://www.baidu.com')
    # doc1 = pq(url='https://www.baidu.com')
    print(doc)
    print(doc('head'))
      

    4、基于CSS选择器查找

    from pyquery import PyQuery as pq

    html = '''<div>
    <ul id = 'haha'>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    print(doc)
    #id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开)
    print(doc('#haha .item-0 a span'))

    5、可以通过已经查找到的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。

    from pyquery import PyQuery as pq

    html = '''<div class=‘content’>
    <ul id = 'haha'>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    item = doc('div ul')
    print(item)
    #我们可以通过已经查找到的标签,再次查找这个标签下面的标签
    print(item.parent())
    print(item.children())

    ----------------------------------------------------------------

    from pyquery import PyQuery as pq

    html = '''<div class=‘content’>
    <ul id = 'haha'>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    item = doc('div ul')
    print(item)
    #注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙
    print(item.children('[class]'))

    6、获取属性值

    from pyquery import PyQuery as pq

    html = '''<div class=‘content’>
    <ul id = 'haha'>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    #注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话,
    #就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点
    item = doc(".item-0.active a")
    print(type(item))
    print(item)
    #获取属性值的两种方法
    print(item.attr.href)
    print(item.attr('href'))

    7、获取标签的内容

    from pyquery import PyQuery as pq

    html = '''<div class=‘content’>
    <ul id = 'haha'>
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul></div>'''

    doc = pq(html)
    a = doc("a").text()
    print(a)

    8、Dom操作

      1、属性的增加删除操作

      from pyquery import PyQuery as pq

      html = '''<div class=‘content’>
      <ul id = 'haha'>
      <li class="item-0">first item</li>
      <li class="item-1"><a href="link2.html">second item</a></li>
      <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
      <li class="item-1 active"><a href="link4.html">fourth item</a></li>
      <li class="item-0"><a href="link5.html">fifth item</a></li>
      </ul></div>'''

      doc = pq(html)
      li = doc('.item-0.active')
      print(li)
      #删除classactive
      print(li.removeClass('active'))
      #增加class属性haha
      print(li.addClass('haha'))

      2、attrs和css

      注意:下列操作有则改之,无则加之。

      from pyquery import PyQuery as pq

      html = '''<div class=‘content’>
      <ul id = 'haha'>
      <li class="item-0">first item</li>
      <li class="item-1"><a href="link2.html">second item</a></li>
      <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
      <li class="item-1 active"><a href="link4.html">fourth item</a></li>
      <li class="item-0"><a href="link5.html">fifth item</a></li>
      </ul></div>'''

      doc = pq(html)
      li = doc('.item-0.active')
      print(li)
      print(li.attr('id','id_test'))
      print(li.css('font-size','20px'))

      3、删除某个标签,在爬取过程中我们通常抓取到的内容总会有一些不想要的标签,这个时候我们可以用以下类似的方法来删除这些标签。

      from pyquery import PyQuery as pq

      html = '''<div class='content'>
      <ul id = 'haha'>
      <li class="item-0">first item</li>
      <li class="item-1"><a href="link2.html">second item</a></li>
      <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
      <li class="item-1 active"><a href="link4.html">fourth item</a></li>
      <li class="item-0"><a href="link5.html">fifth item</a></li>
      </ul></div>'''

      doc = pq(html)
      data = doc('.content')
      print(data.text())
      #删除所有a标签
      data.find('a').remove()
      #再次打印
      print(data.text())

  • 相关阅读:
    RE
    【LeetCode】198. House Robber
    【LeetCode】053. Maximum Subarray
    【LeetCode】152. Maximum Product Subarray
    【LeetCode】238.Product of Array Except Self
    【LeetCode】042 Trapping Rain Water
    【LeetCode】011 Container With Most Water
    【LeetCode】004. Median of Two Sorted Arrays
    【LeetCode】454 4Sum II
    【LeetCode】259 3Sum Smaller
  • 原文地址:https://www.cnblogs.com/niansi/p/8547581.html
Copyright © 2011-2022 走看看