爬虫2 数据解析 --图片、bs4 、xpath 、l乱码的一个解决方法 “|”

zoukankan html css js c++ java

爬虫2 数据解析 --图片、bs4 、xpath 、l乱码的一个解决方法 “|”
### 回顾

- requests作用：模拟浏览器发起请求

- urllib：requests的前身

- requests模块的编码流程：

- 指定url

- 发起请求：

- get（url,params,headers）

- post（url,data,headers）

- 获取响应数据

- 持久化存储

- 参数动态化：

- 有些情况下我们是需要将请求参数进行更改。将get或者post请求对应的请求参数封装到一个字典（键值对==请求参数）中，然后将改字典作用到get方法的params参数中或者作用到psot方法的data参数中

- UA检测（反爬机制）：

- 什么是UA：请求载体的身份标识。服务器端会检测请求的UA来鉴定其身份。

- 反反爬策略：UA伪装。通过抓包工具捕获某一款浏览器的UA值，封装到字典中，且将该字典作用到headers参数中

- 动态加载的数据

- 通过另一个单独的请求请求到的数据

- 如果我们要对一个陌生的网站进行指定数据的爬取？

- 首先要确定爬取的数据在改网站中是否为动态加载的

- 是：通过抓包工具实现全局搜索，定位动态加载数据对应的数据包，从数据包中提取请求的url和请求参数。

- 不是：就可以直接将浏览器地址栏中的网址作为我们requests请求的url

### 今日内容

- 数据解析

- 数据解析的作用：

- 可以帮助我们实现聚焦爬虫

- 数据解析的实现方式：

- 正则

- bs4

- xpath

- pyquery

- 数据解析的通用原理

- 问题1:聚焦爬虫爬取的数据是存储在哪里的？

- 都被存储在了相关的标签之中and相关标签的属性中

- 1.定位标签

- 2.取文本或者取属性

如何爬取图片呢？
import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' } #如何爬取图片 url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' img_data = requests.get(url,headers=headers).content #byte类型数据 with open('./img.jpg','wb') as fp: fp.write(img_data)

爬取图片
2、引用 urllib(建议不用，因为不能UA伪装)
#弊端：不能使用UA伪装 from urllib import request url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' request.urlretrieve(url,filename='./qiutu.jpg')

urllib
到糗事百科爬取图片
import re import os #1.使用通用爬虫将前3页对应的页面源码数据进行爬取 #通用的url模板(不可变) dirName = './imgLibs' if not os.path.exists(dirName): os.mkdir(dirName) url = 'https://www.qiushibaike.com/pic/page/%d/' for page in range(1,4): new_url = format(url%page) page_text = requests.get(new_url,headers=headers).text #每一个页码对应的页面源码数据 #在通用爬虫的基础上实现聚焦爬虫（每一个页码对应页面源码数据中解析出图片地址） ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' img_src_list = re.findall(ex,page_text,re.S) for src in img_src_list: src = 'https:'+src img_name = src.split('/')[-1] img_path = dirName+'/'+img_name #./imgLibs/xxxx.jpg request.urlretrieve(src,filename=img_path) print(img_name,'下载成功！！！')

#糗图爬取1-3页所有的图片
- bs4解析

- bs4解析的原理：

- 实例化一个BeautifulSoup的对象，需要将即将被解析的页面源码数据加载到该对象中

- 调用BeautifulSoup对象中的相关方法和属性进行标签定位和数据提取

- 环境的安装：

- pip install bs4

- pip install lxml

- BeautifulSoup的实例化：

- BeautifulSoup(fp,'lxml'):将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中

- BeautifulSoup(page_text,'lxml'):将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中

- 定位标签的操作：

- soup.tagName：定位到第一个出现的tagName标签　　　　　　　　　　　　soup.div

- 属性定位：soup.find('tagName',attrName='value')　　　　　　　　　　　　　　soup.find('div',class_='c1')　　　　

- 属性定位:soup.find_all('tagName',attrName='value'),返回值为列表:　　　　　　　soup.find_all('div',id='d1')

- 选择器定位：soup.select('选择器')　　　　　　　　　　　　　　　　　　　　　soup.select('#feng')　

- 层级选择器：>表示一个层级空格表示多个层级　　　　　　　　　　　　　　soup.select('.tang > ul > li')

- 取文本

- .string:获取直系的文本内容

- .text:获取所有的文本内容

- 取属性

- tagName['attrName']
from bs4 import BeautifulSoup fp = open('./test.html','r',encoding='utf-8') soup = BeautifulSoup(fp,'lxml') # 本地文件句柄 soup.div soup.find('div',class_='song') soup.find('a',id="feng") soup.find_all('div',class_="song") soup.select('#feng') soup.select('.tang > ul > li') # >表示一层级（直系） soup.select('.tang li') # 空格表示多个层级（孙子辈） a_tag = soup.select('#feng')[0] a_tag.text div = soup.div div.string # 获取直系文本 div = soup.find('div',class_="song") div.string # 所有文本 a_tag = soup.select('#feng')[0] a_tag['href']
Ok 来下载小说吧
fp = open('sanguo.txt','w',encoding='utf-8') main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(main_url,headers=headers).text #解析出章节名称和章节详情页的url soup = BeautifulSoup(page_text,'lxml') a_list = soup.select('.book-mulu > ul > li > a') #返回的列表中存储的是一个个a标签 for a in a_list: title = a.string detail_url = 'http://www.shicimingju.com'+a['href'] detail_page_text = requests.get(detail_url,headers=headers).text #解析详情页中的章节内容 soup = BeautifulSoup(detail_page_text,'lxml') content = soup.find('div',class_='chapter_content').text fp.write(title+':'+content+' ') print(title,'下载成功！') fp.close()

爬取三国整篇内容（章节名称+章节内容）
- xpath表达式:xpath方法的返回值一定是一个列表
  
  最左侧的/表示：xpath表达式一定要从根标签逐层进行标签查找和定位
  
  最左侧的//表示：xpath表达式可以从任意位置定位标签
  
  非最左侧的/:表示一个层级
  
  非最左侧的//：表示夸多个层级
  
  属性定位：//tagName[@attrName="value"] 　　 //div[@class='c1']
  
  索引定位：//tagName[index] 索引是从1开始　　 //li[1]
- 取文本：
  
  /text():直系文本内容
  
  //text():所有的文本内容
- 取属性：
  
  /@attrName @ href
  
  from lxml import etree tree = etree.parse('./test.html') tree.xpath('/html/head/title') tree.xpath('//title') tree.xpath('/html/body//p') tree.xpath('//p') tree.xpath('//div[@class="song"]') tree.xpath('//li[7]') tree.xpath('//a[@id="feng"]/text()')[0] tree.xpath('//div[@class="song"]//text()') tree.xpath('//a[@id="feng"]/@href')
  
  from lxml import etree url = 'https://www.qiushibaike.com/text/' page_text = requests.get(url,headers=headers).text #解析内容 tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div[1]/a[2]/h2/text()')[0]#实现局部解析 content = div.xpath('./a[1]/div/span//text()') content = ''.join(content) print(author,content)
  
  #爬取糗百中的段子内容和作者名称
  
  有时候爬取的数据出现乱码可以参考下面：
  
  　　 img_name = img_name.encode('iso-8859-1').decode('gbk')
  
  　　先用iso-8859-1 来编码再用gbk 解码即可
  
  import os dirName = './meinvLibs' if not os.path.exists(dirName): os.mkdir(dirName) url = 'http://pic.netbian.com/4kmeinv/index_%d.html' for page in range(1,11): if page == 1: new_url = 'http://pic.netbian.com/4kmeinv/' else: new_url = format(url%page) page_text = requests.get(new_url,headers=headers).text tree = etree.HTML(page_text) a_list = tree.xpath('//div[@class="slist"]/ul/li/a') for a in a_list: img_src = 'http://pic.netbian.com'+a.xpath('./img/@src')[0] img_name = a.xpath('./b/text()')[0] img_name = img_name.encode('iso-8859-1').decode('gbk') img_data = requests.get(img_src,headers=headers).content imgPath = dirName+'/'+img_name+'.jpg' with open(imgPath,'wb') as fp: fp.write(img_data) print(img_name,'下载成功！！！')
  
  #http://pic.netbian.com/4kmeinv/中文乱码的处理
  
  有时候网页变化莫测，即同一个位置的标签有两种或两种以上的写法会对爬虫爬取的规律性造成巨大麻烦
  
  可以用下面的方法来解决一下
  
  #https://www.aqistudy.cn/historydata/ page_text = requests.get('https://www.aqistudy.cn/historydata/',headers=headers).text tree = etree.HTML(page_text) # hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()') # all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text()') cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()') #提高xpath的通用性 cities
  
  所有城市名称
  
  上面的代码是这么写的 cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()')
  
  增加了 | 来并列两个表达式，即满足第一个表达式则第二个表达式忽略，第一个不满足，即用第二个表达式！
### 回顾- requests作用：模拟浏览器发起请求- urllib：requests的前身- requests模块的编码流程： - 指定url - 发起请求： - get（url,params,headers） - post（url,data,headers） - 获取响应数据 - 持久化存储 - 参数动态化： - 有些情况下我们是需要将请求参数进行更改。将get或者post请求对应的请求参数封装到一个字典（键值对==请求参数）中，然后将改字典作用到get方法的params参数中或者作用到psot方法的data参数中- UA检测（反爬机制）： - 什么是UA：请求载体的身份标识。服务器端会检测请求的UA来鉴定其身份。 - 反反爬策略：UA伪装。通过抓包工具捕获某一款浏览器的UA值，封装到字典中，且将该字典作用到headers参数中- 动态加载的数据 - 通过另一个单独的请求请求到的数据- 如果我们要对一个陌生的网站进行指定数据的爬取？ - 首先要确定爬取的数据在改网站中是否为动态加载的 - 是：通过抓包工具实现全局搜索，定位动态加载数据对应的数据包，从数据包中提取请求的url和请求参数。 - 不是：就可以直接将浏览器地址栏中的网址作为我们requests请求的url
查看全文

相关阅读:
liunx 解压与压缩
 缓存设计与优化
 易混乱javascript知识点简要记录
 初识RedisCluster集群
 Redis Sentinel(哨兵模式)
JavaScript作用域简单记录
 JavaScript引用类型简单记录
 redis主从复制初识
 javascript基础知识点
 持久化的一些问题

原文地址：https://www.cnblogs.com/zhuangdd/p/13694000.html

爬虫2 数据解析 --图片 、bs4 、xpath 、l乱码的一个解决方法 “|”

爬虫2 数据解析 --图片、bs4 、xpath 、l乱码的一个解决方法 “|”