zoukankan html css js c++ java

Python网络爬虫之三种数据解析方式

1. 正则解析

正则例题

import  re
# string1 = """<div>静夜思
# 窗前明月光
# 疑是地上霜
# 举头望明月
# 低头思故乡
# </div>"""
# print(re.findall('<div>(.*)</div>',string1,re.S))
#如果不使用re.S参数，则只在每一行内进行匹配，如果一行没有，就换下一行重新开始，不会跨行。
# 而使用re.S参数以后，正则表达式会将这个字符串作为一个整体，将“
”当做一个普通的字符加入到这个字符串中，在整体中进行匹配

#匹配以i 开头的行
# string = '''fall in love with you
# i love you very much
# i love she
# i love her'''
# print(re.findall('^i.*',string,re.M))
#re.M表示将字符串视为多行,从而^匹配每一行的行首,$匹配每一行的行尾

#提取出python
# key="javapythonc++php"
# print(re.findall('python',key))

#提取出hello world
# key="<html><h1>hello world<h1></html>"
# print(re.findall('<h1>(.*)<h1>',key)[0])

#提取170
# string = '我喜欢身高为170的女孩'
# print(re.findall('d+',string))

#提取出hit. :贪婪模式：尽可能多的匹配数据
# key='bobo@hit.edu.com'#想要匹配到hit.
# print(re.findall('h.*?.',key))

糗图下载案例

import requests
import re
import os

url = 'https://www.qiushibaike.com/pic/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
# 创建一个存储图片的文件夹
dir_name = 'qiutu'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
response = requests.get(url=url, headers=header)
# 获取字符串类型数据
page_text = response.text
# print(page_text)
# 使用正则进行数据解析（图片（img中src属性中存储的数据值））
src_list = re.findall('<div class="thumb">.*?<img src="(.*?)".*?>.*?</div>', page_text, re.S)
# 拼接图片的url
for src in src_list:
    # 获取了图片完整的url
    src = 'https:' + src
    # 下载图片（发请求）
    image_data = requests.get(url=src, headers=header).content

    fileName = src.split('/')[-1]
    filePath = dir_name + '/' + fileName

    with open(filePath, 'wb') as fp:
        fp.write(image_data)
        print('一张图片下载成功')

2. xpath

2.1 格式

from lxml import etree
    两种方式使用：将html文档变成一个对象，然后调用对象的方法去查找指定的节点
    （1）本地文件
        tree = etree.parse(文件名)
    （2）网络文件
        tree = etree.HTML(网页字符串)
    ret = tree.xpath(路径表达式)

3.bs4解析

环境安装

- windows
    （1）打开文件资源管理器(文件夹地址栏中)
    （2）地址栏上面输入 %appdata%
    （3）在这里面新建一个文件夹  pip
    （4）在pip文件夹里面新建一个文件叫做  pip.ini ,内容写如下即可
        [global]
        timeout = 6000
        index-url = https://mirrors.aliyun.com/pypi/simple/
        trusted-host = mirrors.aliyun.com
   - linux
    （1）cd ~
    （2）mkdir ~/.pip
    （3）vi ~/.pip/pip.conf
    （4）编辑内容，和windows一模一样
  - 需要安装：pip install bs4
    bs4在使用时候需要一个第三方库，把这个库也安装一下
    pip install lxml

用法

- from bs4 import BeautifulSoup
- 使用方式：可以将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的内容
  （1）转化本地文件：
      - soup = BeautifulSoup(open('本地文件'), 'lxml')
  （2）转化网络文件：
      - soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
  （3）打印soup对象显示内容为html文件中的内容

查看全文

相关阅读:
python的with语句
 flask如何实现https以及自定义证书的制作
 flask及扩展源码解读
 加密的那些事
 SQLALchemy如何查询mysql某个区间内的数据
 集群设备之间的资源共享
 pycryptodom的源码安装
 github创建项目，并提交本地文件
 响应头里的"Last-Modified"值是怎么来的？
SQL2005 数据库——查看索引

原文地址：https://www.cnblogs.com/quqinchao/p/9794597.html