zoukankan      html  css  js  c++  java
  • PyQuery库详解

    强大又灵活的网页解析库,如果觉得正则表达式写起来太麻烦,而BeautifulSoup语法太难记,但是熟悉jQuery的语法,那么PyQuery就是一个绝佳选择。

    安装:pip3 install pyquery

    初始化

    字符串初始化

    from pyquery import PyQuery  as pq
    
    html = '''
    <div>
        <url>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    print(doc('li'))
    #这里的选择与css选择器一样,选class加点,选id加#,选标签什么都不加 输出结果为: <li class="item-0">first item</li> <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li> 

    URL初始化

    from pyquery import PyQuery  as pq
    
    doc = pq(url='http://www.baidu.com')
    print(doc('head'))
    输出结果为:
    <head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 
    

    这种是传入一个url,会自动请求这个url,把源代码给pq,生成一个pq对象 

    文件初始化

    from pyquery import PyQuery  as pq
    
    doc = pq(filename='1.html')
    print(doc('url'))
    输出结果为:
    
    <url>
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li>
         </url>
    ------------------------
    1.html内容:
    <div>
         <url>
             <li class='item-0'>first item</li>
             <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li>
         </url>
    </div>
    

    基本css选择器:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    print(doc('#container .list li'))
    
    输出结果为:
    <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
    

    css选择器,id前面加#号,class前面加点,标签前面什么都不加 

    查找元素

    查找子元素

    find 方法:查找元素里面包含的元素

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(type(items))
    print(items)
    lis = items.find('li')
    print(type(lis))
    print(lis)
    
    输出结果为:
    <class 'pyquery.pyquery.PyQuery'>
    <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        
    </ul>
    <class 'pyquery.pyquery.PyQuery'>
    <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li> 

     children方法,查找直接子元素,find查找的只要在里面就行,find更常用

    查找父元素

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(items.parent())
    输出结果为:
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        
    </ul></div>
    

     还有parents方法,查找祖先节点,不只是父节点,父节点的父节点也会查找到

    可以像查找元素一样,在这些方法里加上参数(类似于css选择器)来进一步进行筛选,如:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list')
    print(items.parent('#container'))
    #对父元素中id = container的进行筛选 输出结果为: <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul></div>

    兄弟元素

    siblings与sibling方法

    ##在查找的时候,例如doc('.list   .item-0.active'),有空格表示一级级往下找,没有空格表示并列的意思,就是即含有iten-0,又含有active的意思

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list .item-0.active')
    print(items)
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    

    执行items.siblings()就会输出其兄弟元素:

    <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0">first item</li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
    

     

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    items = doc('.list .item-0.active')
    print(items.siblings())
    print(items.siblings('.active'))
    #在查找的时候,可以进行进一步满足条件的筛选 输出结果为: <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0">first item</li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li>  

     遍历

     items()方法:实际上就是产生了一个产生器,再用for循环进行遍历

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    lis = doc('li').items()
    for li in lis:
        print(li)
    
    输出结果为:
    <li class="item-0">first item</li>
            
    <li class="item-1"><a href="link2.html">second item</a></li>
            
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            
    <li class="item-0"><a href="link5.html">fifth item</a></li>  

    获取信息

    获取属性

    比如要获取item元素的属性:

    item.attr('属性名称'),或者:

    item.attr.属性名称

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active a')
    print(li.attr.href)
    print(li.attr('href'))
    
    输出结果为:
    link3.html
    link3.html  

    获取文本

    text()方法

    获取html

    html()方法,如:

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.html())
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <a href="link3.html"><span class="bold">third item</span></a>
    #输出li得到,这个标签及里面的内容,
    #使用html方法后,得到标签里面的html代码

    DOM操作

    就是节点操作

    addClass,removeClass 增删属性

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.removeClass('active'))
    print(li.addClass('active'))
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>  

    attr,css 修改属性

    from pyquery import PyQuery as pq
    
    html = '''
    <div id='container'>
        <ul class='list'>
            <li class='item-0'>first item</li>
            <li class='item-1'><a href='link2.html'>second item</a></li>
            <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li>
            <li class='item-1 active'><a href='link4.html'>fourth item</a></li>
            <li class='item-0'><a href='link5.html'>fifth item</a></li>
        </url>
    </div>
    '''
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.attr('name','link'))
    print(li.css('font-size','14px'))
    输出结果为:
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            
    <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
    #原先没有name属性,现在增加了一个name属性,如过原来有name属性,那么就会修改原来的值       
    <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
    #用了css之后,就出现了style这个属性

    remove  

    from pyquery import PyQuery as pq
    
    html = '''
    <div class='wrap'>
        hello world
        <p>this is a paragraph</p>
    </div>
    '''
    
    doc = pq(html)
    wrap = doc('.wrap')
    print(wrap.text())
    print(wrap.find('p'))
    wrap.find('p').remove()
    print(wrap.text())
    输出结果为:
    hello world
    this is a paragraph
    <p>this is a paragraph</p>
    
    hello world
    

     其他DOM方法

    http://pyquery.readthedocs.io/en/latest/api.html 

    伪类选择器

    更多选择器点击这里 

    pyquery官方文档

    三样东西有助于缓解生命的疲劳:希望、睡眠和微笑。---康德
  • 相关阅读:
    广域网(ppp协议、HDLC协议)
    0120. Triangle (M)
    0589. N-ary Tree Preorder Traversal (E)
    0377. Combination Sum IV (M)
    1074. Number of Submatrices That Sum to Target (H)
    1209. Remove All Adjacent Duplicates in String II (M)
    0509. Fibonacci Number (E)
    0086. Partition List (M)
    0667. Beautiful Arrangement II (M)
    1302. Deepest Leaves Sum (M)
  • 原文地址:https://www.cnblogs.com/ronghe/p/9190630.html
Copyright © 2011-2022 走看看