zoukankan      html  css  js  c++  java
  • pyquery详细用法

    python爬虫之PyQuery的基本使用

     

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。
    官网地址:http://pyquery.readthedocs.io/en/latest/
    jQuery参考文档: http://jquery.cuishifeng.cn/

    1、字符串的初始化

    from pyquery import PyQuery as pq
    
    html = '''<div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    print(doc)
    print(type(doc))
    print(doc('li'))
    
    复制代码
    <div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
    复制代码
    运行结果

    2、打开html文件

      注意路劲问题

    from pyquery import PyQuery as pq
    doc = pq(filename='index.html')
    print(doc)
    print(doc('head'))
    
    复制代码
        <title>Title</title>
    </head>
    <body>
        <div>
        <ul>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    </body>
    </html>
    <head>
        <meta charset="UTF-8"/>
        <title>Title</title>
    </head>
    复制代码
    运行结果

    3、打开某个网站

    doc = pq('https://www.baidu.com')
    # doc1 = pq(url='https://www.baidu.com')
    print(doc)
    print(doc('head'))
    

      

    4、基于CSS选择器查找

    from pyquery import PyQuery as pq
    
    html = '''<div>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    print(doc)
    #id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开)
    print(doc('#haha .item-0 a span'))
    复制代码
    <div>
        <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <span class="bold">third item</span>
    复制代码
    运行结果

    5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    item = doc('div ul')
    print(item)
    #我们可以通过已经查找到的标签,再此查找这个标签下面的标签
    print(item.parent())
    print(item.children())
    
    复制代码
    <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
    <div class="&#x2018;content&#x2019;">
        <ul id="haha">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>
    <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
    复制代码
    运行结果
    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    item = doc('div ul')
    print(item)
    #注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙
    print(item.children('[class]'))
    

    6、获取属性值

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    #注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话,
    #就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点
    item = doc(".item-0.active a")
    print(type(item))
    print(item)
    #获取属性值的两种方法
    print(item.attr.href)
    print(item.attr('href'))
    
    <class 'pyquery.pyquery.PyQuery'>
    <a href="link3.html"><span class="bold">third item</span></a>
    link3.html
    link3.html
    运行结果

    7、获取标签的内容

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    a = doc("a").text()
    print(a)
    
    #结果很有趣,他是找到所有标签的值,然后给连到一起打出来,就像一段话
    second item third item fourth item fifth item
    运行结果

    8、Dom操作

    1、属性的增加删除操作

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    #删除classactive
    print(li.removeClass('active'))
    #增加class属性haha
    print(li.addClass('haha'))
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>
    运行结果

    2、attrs和css

      注意:下列操作有则改之,无则加之。

    from pyquery import PyQuery as pq
    
    html = '''<div class=‘content’>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.attr('id','id_test'))
    print(li.css('font-size','20px'))
    
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
             
    <li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>
    运行结果

    3、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。

    from pyquery import PyQuery as pq
    
    html = '''<div class='content'>
        <ul id = 'haha'>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul></div>'''
    
    doc = pq(html)
    data = doc('.content')
    print(data.text())
    #删除所有a标签
    data.find('a').remove()
    #再次打印
    print(data.text())
    first item second item third item fourth item fifth item
    first item
    运行结果
  • 相关阅读:
    python数据库的增删改查
    Python基础教程笔记——第3章:使用字符串
    Python基础教程笔记——第2章:列表和元组
    Python基础教程笔记——第1章
    指针与数组的对比(——选自:C++内存管理技术内幕)
    C++内存分配方式(——选自:C++内存管理技术内幕)
    C++函数的重载,覆盖和隐藏(——高质量编程第14章)
    vim—基本命令1
    linux命令1——基础
    free delete malloc new(——高品质量程序设计指南第16章)
  • 原文地址:https://www.cnblogs.com/dalaoban/p/10099401.html
Copyright © 2011-2022 走看看