zoukankan      html  css  js  c++  java
  • PyQuery库详解

    强大又灵活的网页解析库,如果觉得正则表达式写起来太麻烦,而BeautifulSoup语法太难记,但是熟悉jQuery的语法,那么PyQuery就是一个绝佳选择。

    安装:pip3 install pyquery

    初始化

    字符串初始化

    from pyquery import PyQuery  as pq
    
    html = '''
    <div>
        <url>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    print(doc('li'))
    #这里的选择与css选择器一样,选class加点,选id加#,选标签什么都不加 输出结果为: <li class="item-0">first item</li> <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li> 

    URL初始化

    from pyquery import PyQuery  as pq
    
    doc = pq(url='http://www.baidu.com')
    print(doc('head'))
    输出结果为:
    <head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 
    

    这种是传入一个url,会自动请求这个url,把源代码给pq,生成一个pq对象 

    文件初始化

    from pyquery import PyQuery  as pq
    
    doc = pq(filename='1.html')
    print(doc('url'))
    输出结果为:
    
    <url>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li>
         </url>
    ------------------------
    1.html内容:
    <div>
         <url>
             <li class='item-0'>first item</li>
             <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
         </url>
    </div>
    

    基本css选择器:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    print(doc('#container .list li'))
    
    输出结果为:
    <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    css选择器,id前面加#号,class前面加点,标签前面什么都不加 

    查找元素

    查找子元素

    find 方法:查找元素里面包含的元素

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(type(items))
    print(items)
    lis = items.find('li')
    print(type(lis))
    print(lis)
    
    输出结果为:
    <class 'pyquery.pyquery.PyQuery'>
    <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        
    </ul>
    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li> 

     children方法,查找直接子元素,find查找的只要在里面就行,find更常用

    查找父元素

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(items.parent())
    输出结果为:
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        
    </ul></div>
    

     还有parents方法,查找祖先节点,不只是父节点,父节点的父节点也会查找到

    可以像查找元素一样,在这些方法里加上参数(类似于css选择器)来进一步进行筛选,如:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(items.parent('#container'))
    #对父元素中id = container的进行筛选 输出结果为: <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul></div>

    兄弟元素

    siblings与sibling方法

    ##在查找的时候,例如doc('.list   .item-0.active'),有空格表示一级级往下找,没有空格表示并列的意思,就是即含有iten-0,又含有active的意思

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list .item-0.active')
    print(items)
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    

    执行items.siblings()就会输出其兄弟元素:

    <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0">first item</li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
    

     

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list .item-0.active')
    print(items.siblings())
    print(items.siblings('.active'))
    #在查找的时候,可以进行进一步满足条件的筛选 输出结果为: <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0">first item</li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li>  

     遍历

     items()方法:实际上就是产生了一个产生器,再用for循环进行遍历

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    lis = doc('li').items()
    for li in lis:
        print(li)
    
    输出结果为:
    <li class="item-0">first item</li>
            
    <li class="item-1"><a href="link2.html">second item</a></li>
            
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            
    <li class="item-0"><a href="link5.html">fifth item</a></li>  

    获取信息

    获取属性

    比如要获取item元素的属性:

    item.attr('属性名称'),或者:

    item.attr.属性名称

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active a')
    print(li.attr.href)
    print(li.attr('href'))
    
    输出结果为:
    link3.html
    link3.html  

    获取文本

    text()方法

    获取html

    html()方法,如:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.html())
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <a href="link3.html"><span class="bold">third item</span></a>
    #输出li得到,这个标签及里面的内容,
    #使用html方法后,得到标签里面的html代码

    DOM操作

    就是节点操作

    addClass,removeClass 增删属性

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.removeClass('active'))
    print(li.addClass('active'))
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>  

    attr,css 修改属性

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.attr('name','link'))
    print(li.css('font-size','14px'))
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
    #原先没有name属性,现在增加了一个name属性,如过原来有name属性,那么就会修改原来的值       
    <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
    #用了css之后,就出现了style这个属性

    remove  

    from pyquery import PyQuery as pq
    
    html = '''
    <div class='wrap'>
        hello world
        <p>this is a paragraph</p>
    </div>
    '''
    
    doc = pq(html)
    wrap = doc('.wrap')
    print(wrap.text())
    print(wrap.find('p'))
    wrap.find('p').remove()
    print(wrap.text())
    输出结果为:
    hello world
    this is a paragraph
    <p>this is a paragraph</p>
    
    hello world
    

     其他DOM方法

    http://pyquery.readthedocs.io/en/latest/api.html 

    伪类选择器

    更多选择器点击这里 

    pyquery官方文档

    三样东西有助于缓解生命的疲劳:希望、睡眠和微笑。---康德
  • 相关阅读:
    bash脚本编程之数组和字符串处理
    Linux启动流程简介以及各启动阶段失败的恢复方法
    Linux路由表的重要性以及配置
    Linux终端和伪终端简述
    Linux九阴真经之无影剑残卷9(Shell脚本编程进阶)
    Linux九阴真经之无影剑残卷8(计划任务)
    Linux九阴真经之无影剑残卷7(进程管理)
    Linux九阴真经之无影剑残卷5(Linux静态路由的实现)
    Linux九阴真经之无影剑残卷4(创建虚拟内存--swap)
    Linux九阴真经之无影剑残卷3(将home目录搬到新分区)
  • 原文地址:https://www.cnblogs.com/ronghe/p/9190630.html
Copyright © 2011-2022 走看看