zoukankan      html  css  js  c++  java
  • python笔记2--lxml.etree爬取html内容

    前言

    本篇继续lxml.etree学习,在线访问接口,通过接口返回的html,解析出想要的text文本内容

    环境准备:

    python3.7

    lxml

    requests

    定位目标

    爬取我的博客首页https://www.cnblogs.com/canglongdao/侧边个人基本信息。

    打开fiddler抓包,刷新我的博客首页。抓取到的接口地址如下图。

     找到该接口地址https://www.cnblogs.com/canglongdao/ajax/news.aspx 

    # coding:utf-8
    import requests
    from lxml import etree
    import urllib3
    urllib3.disable_warnings()
    url="https://www.cnblogs.com/canglongdao/ajax/news.aspx"
    r=requests.get(url,verify=False)
    #print(r.text)
    a=etree.HTML(r.content.decode("utf-8"))
    b=a.xpath("//*[@id='profile_block']")
    #打印提取到的结果
    r=etree.tostring(b[0],encoding="utf-8",pretty_print=True)
    print(r.decode("utf-8"))
    

     运行结果

     提取内容

    # coding:utf-8
    import requests
    from lxml import etree
    import urllib3
    urllib3.disable_warnings()
    url="https://www.cnblogs.com/canglongdao/ajax/news.aspx"
    r=requests.get(url,verify=False)
    #print(r.text)
    a=etree.HTML(r.content.decode("utf-8"))
    b=a.xpath("//*[@id='profile_block']")
    t0=b[0].xpath("text()") #获取当前节点文本元素
    print(t0)
    t00=t0[::2]
    print(t00)
    t1=b[0].xpath('a')#定位a标签的位置
    print(t1)
    
    #打印结果
    for i,j in zip(t00,t1):
        ii=i.replace('
    ','').replace(' ','')#去掉i中的所有
    ,空格
        jj=j.text.replace('
    ','').replace(' ','')
        print(ii,jj)
    

     运行结果

    ['
            昵称:
            ', '
            ', '
            园龄:
            ', '
            ', '
            粉丝:
            ', '
            ', '
            关注:
            ', '
            ', '
            ', '
        ']
    ['
            昵称:
            ', '
            园龄:
            ', '
            粉丝:
            ', '
            关注:
            ', '
            ']
    [<Element a at 0x163596090c8>, <Element a at 0x16359609048>, <Element a at 0x1635961ba08>, <Element a at 0x1635961b248>]
    昵称: 星空6
    园龄: 1年7个月
    粉丝: 8
    关注: 3  

     总结

    1.获取当前节点标签名称.tag

    print(b[0].tag)
    div
    

     2.获取当前节点文本

     

    print(b[0].text)
    昵称:
    

     3.获取当前节点元素全部属性dict

    print(b[0].attrib)
    {'id': 'profile_block'}
    

     4.获取当前节点某个属性

    print(b[0].get("id"))
    profile_block
    

     5.所有子节点

    for i in b[0].iter():
        print(i.text)
    
    昵称:
            
    
                星空6
            
    None
    
                1年7个月
            
    None
    
                8
            
    None
    
                3
            
    None
    getFollowStatus('fe2d40f4-c531-49cf-1c8d-08d666411c36');
    

     6.获取当前节点下全部文本 

    print(b[0].xpath('text()'))
    
    ['
            昵称:
            ', '
            ', '
            园龄:
            ', '
            ', '
            粉丝:
            ', '
            ', '
            关注:
            ', '
            ', '
            ', '
        ']
    

     7.获取本节点和子节点所有文本信息

    print(b[0].xpath('.//text()'))
    
    ['
            昵称:
            ', '
                星空6
            ', '
            ', '
            园龄:
            ', '
                1年7个月
            ', '
            ', '
            粉丝:
            ', '
                8
            ', '
            ', '
            关注:
            ', '
                3
            ', '
            ', '
            ', "getFollowStatus('fe2d40f4-c531-49cf-1c8d-08d666411c36');", '
        ']
    

      8.获取父节点

     

    print(b[0].getparent().tag)
    div
    

      

    # coding:utf-8
    import requests
    from lxml import etree
    import urllib3
    urllib3.disable_warnings()
    url="https://www.cnblogs.com/canglongdao/ajax/news.aspx"
    r=requests.get(url,verify=False)
    #print(r.text)
    a=etree.HTML(r.content.decode("utf-8"))
    b=a.xpath("//*[@id='profile_block']")
    print(b[0].tag) #div
    print(b[0].text)#昵称:
    print(b[0].attrib)#{'id': 'profile_block'}
    print(b[0].get("id"))#profile_block
    for i in b[0].iter():
        print(i.text)
    print(b[0].xpath('text()'))
    print(b[0].xpath('.//text()'))
    print(b[0].getparent().tag)
    

      

     

     

     

     

    越努力,越幸运!!! good good study,day day up!!!
  • 相关阅读:
    ini_set /ini_get函数功能-----PHP
    【转】那个什么都懂的家伙
    word 2007为不同页插入不同页眉页脚
    August 26th 2017 Week 34th Saturday
    【2017-11-08】Linux与openCV:opencv版本查看及库文件位置等
    August 25th 2017 Week 34th Friday
    August 24th 2017 Week 34th Thursday
    August 23rd 2017 Week 34th Wednesday
    August 22nd 2017 Week 34th Tuesday
    August 21st 2017 Week 34th Monday
  • 原文地址:https://www.cnblogs.com/canglongdao/p/13447629.html
Copyright © 2011-2022 走看看