zoukankan      html  css  js  c++  java
  • python爬虫之PyQuery的基本使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。
    官网地址:http://pyquery.readthedocs.io/en/latest/
    jQuery参考文档: http://jquery.cuishifeng.cn/

    1、字符串的初始化

    from pyquery import PyQuery as pq
    
    html = '''<div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    print(doc)
    print(type(doc))
    print(doc('li'))
    
    <div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
    运行结果

    2、打开html文件

      注意路劲问题

    from pyquery import PyQuery as pq
    doc = pq(filename='index.html')
    print(doc)
    print(doc('head'))
    
        <title>Title</title>
    </head>
    <body>
        <div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    </body>
    </html>
    <head>
        <meta charset="UTF-8"/>
        <title>Title</title>
    </head>
    运行结果

    3、打开某个网站

    doc = pq('https://www.baidu.com')
    # doc1 = pq(url='https://www.baidu.com')
    print(doc)
    print(doc('head'))
    

      

    4、基于CSS选择器查找

    from pyquery import PyQuery as pq
    
    html = '''<div>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    print(doc)
    #id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开)
    print(doc('#haha .item-0 a span'))
    <div>
        <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <span class="bold">third item</span>
    运行结果

    5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    item = doc('div ul')
    print(item)
    #我们可以通过已经查找到的标签,再此查找这个标签下面的标签
    print(item.parent())
    print(item.children())
    
    <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
    <div class="&#x2018;content&#x2019;">
        <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
    运行结果
    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    item = doc('div ul')
    print(item)
    #注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙
    print(item.children('[class]'))
    

    6、获取属性值

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    #注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话,
    #就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点
    item = doc(".item-0.active a")
    print(type(item))
    print(item)
    #获取属性值的两种方法
    print(item.attr.href)
    print(item.attr('href'))
    
    <class 'pyquery.pyquery.PyQuery'>
    <a href="link3.html"><span class="bold">third item</span></a>
    link3.html
    link3.html
    运行结果

    7、获取标签的内容

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    a = doc("a").text()
    print(a)
    
    #结果很有趣,他是找到所有标签的值,然后给连到一起打出来,就像一段话
    second item third item fourth item fifth item
    运行结果

    8、Dom操作

    1、属性的增加删除操作

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    #删除classactive
    print(li.removeClass('active'))
    #增加class属性haha
    print(li.addClass('haha'))
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>
    运行结果

    2、attrs和css

      注意:下列操作有则改之,无则加之。

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.attr('id','id_test'))
    print(li.css('font-size','20px'))
    
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>
    运行结果

    3、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。

    from pyquery import PyQuery as pq
    
    html = '''<div class='content'>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    data = doc('.content')
    print(data.text())
    #删除所有a标签
    data.find('a').remove()
    #再次打印
    print(data.text())
    first item second item third item fourth item fifth item
    first item
    运行结果
  • 相关阅读:
    Android OpenGL ES 2.0 (四) 灯光perfragment lighting
    Android OpenGL ES 2.0 (五) 添加材质
    冒泡排序函数
    javascript object 转换为 json格式 toJSONString
    Liunx CentOS 下载地址
    jquery 图片切换特效 鼠标点击左右按钮焦点图切换滚动
    javascript 解析csv 的function
    mysql Innodb Shutdown completed; log sequence number解决办法
    Centos 添加 yum
    javascript 键值转换
  • 原文地址:https://www.cnblogs.com/lei0213/p/7676254.html
Copyright © 2011-2022 走看看