zoukankan      html  css  js  c++  java
  • 利用etree对象进行爬取数据(xpath函数)

    环境安装
    pip install lxml

    解析原理:

    • 获取页面源码数据
    • 实例化一个etree的对象,并且将页面源码数据加载到该对象中
    • 调用该对象的xpath方法进行指定标签的定位
    • 注意:xpath函数必须结合着xpath表达式进行标签定位和内容捕获

    实例

    1、例如爬取58二手房相关的数据

    代码:

     1 import requests
     2 from lxml import etree
     3 
     4 url = 'https://bj.58.com/shahe/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0047-e4e6-f587-683307ca570e&ClickID=1'
     5 headers = {
     6     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
     7 }
     8 page_text = requests.get(url=url,headers=headers).text
     9 
    10 tree = etree.HTML(page_text)  # 创建一个etree实例对象
    11 li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
    12 fp = open('58.csv','w',encoding='utf-8')
    13 for li in li_list:
    14     title = li.xpath('./div[2]/h2/a/text()')[0]
    15     price = li.xpath('./div[3]//text()')
    16     price = ''.join(price)
    17     fp.write(title+":"+price+'
    ')
    18 fp.close()
    19 print('over')

    2、爬取高清图片

    这里我们用到urllib来快速的存储我们的图片

     1 import requests
     2 from lxml import etree
     3 import os
     4 import urllib
     5 
     6 url = 'http://pic.netbian.com/4kmeinv/'
     7 headers = {
     8     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
     9 }
    10 response = requests.get(url=url,headers=headers)
    11 #response.encoding = 'utf-8'
    12 if not os.path.exists('./imgs'):
    13     os.mkdir('./imgs')
    14 page_text = response.text
    15 
    16 tree = etree.HTML(page_text)
    17 li_list = tree.xpath('//div[@class="slist"]/ul/li')
    18 for li in li_list:
    19     img_name = li.xpath('./a/b/text()')[0]
    20     #处理中文乱码
    21     img_name = img_name.encode('iso-8859-1').decode('gbk')
    22     img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
    23     img_path = './imgs/'+img_name+'.jpg'
    24     urllib.request.urlretrieve(url=img_url,filename=img_path)
    25     print(img_path,'下载成功!')
    26 print('over!!!')

    3、下载煎蛋网的图片数据

    这里会有常见的反爬机制:数据加密

    打开请求返回的response找到加密的图片

    原来的element里的HTML是加载好之后的图片地址,不能直接获取

    代码:

     1 import requests
     2 from lxml import etree
     3 import base64
     4 import urllib
     5 
     6 headers = {
     7     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
     8 }
     9 url = 'http://jandan.net/ooxx'
    10 page_text = requests.get(url=url,headers=headers).text
    11 
    12 tree = etree.HTML(page_text)
    13 img_hash_list = tree.xpath('//span[@class="img-hash"]/text()')
    14 for img_hash in img_hash_list:
    15     img_url = 'http:'+base64.b64decode(img_hash).decode() # 将图片的地址进行解密获取原地址
    16     img_name = img_url.split('/')[-1]
    17     urllib.request.urlretrieve(url=img_url,filename=img_name)

     4、下载简历模板

    当在连续请求时,由于请求次数太多ip被禁掉,可以使用代理ip或请求结束的时候断开本次连接

     1 import requests
     2 import random
     3 from lxml import etree
     4 headers = {
     5     'Connection':'close', #当请求成功后,马上断开该次请求(及时释放请求池中的资源)
     6     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
     7 }
     8 url = 'http://sc.chinaz.com/jianli/free_%d.html'
     9 for page in range(1,4):
    10     if page == 1:
    11         new_url = 'http://sc.chinaz.com/jianli/free.html'
    12     else:
    13         new_url = format(url%page)
    14     
    15     response = requests.get(url=new_url,headers=headers)
    16     response.encoding = 'utf-8'
    17     page_text = response.text
    18 
    19     tree = etree.HTML(page_text)
    20     div_list = tree.xpath('//div[@id="container"]/div')
    21     for div in div_list:
    22         detail_url = div.xpath('./a/@href')[0]
    23         name = div.xpath('./a/img/@alt')[0]
    24 
    25         detail_page = requests.get(url=detail_url,headers=headers).text
    26         tree = etree.HTML(detail_page)
    27         download_list  = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')
    28         download_url = random.choice(download_list)
    29         data = requests.get(url=download_url,headers=headers).content
    30         fileName = name+'.rar'
    31         with open(fileName,'wb') as fp:
    32             fp.write(data)
    33             print(fileName,'下载成功')
  • 相关阅读:
    poj 1579(动态规划初探之记忆化搜索)
    hdu 1133(卡特兰数变形)
    CodeForces 625A Guest From the Past
    CodeForces 625D Finals in arithmetic
    CDOJ 1268 Open the lightings
    HDU 4008 Parent and son
    HDU 4044 GeoDefense
    HDU 4169 UVALive 5741 Wealthy Family
    HDU 3452 Bonsai
    HDU 3586 Information Disturbing
  • 原文地址:https://www.cnblogs.com/liaopeng123/p/10446657.html
Copyright © 2011-2022 走看看