zoukankan      html  css  js  c++  java
  • 爬虫入门之urllib库详解(二)

    爬虫入门之urllib库详解(二)

    1 urllib模块

    urllib模块是一个运用于URL的包
    urllib.request用于访问和读取URLS
    urllib.error包括了所有urllib.request导致的异常
    urllib.parse用于解析URLS
    urllib.robotparser用于解析robots.txt文件(网络蜘蛛)
    

    2 urllib读取网页的三种方式

    urlopen直接打开

    urlopen返回对象提供的方法
    read() , readline() ,readlines() , fileno() , close() :对HTTPResponse类型数据进行操作
    info():   返回HTTPMessage对象,表示远程服务器返回的头信息
    getcode():返回Http状态码。如果是http请求,200请求成功完成;404网址未找到
    geturl(): 返回请求的url
    
    url = "http://www.baidu.com"
    response = urllib.request.urlopen(url)
    print(response)
    
    import urllib.request
    response = urllib.request.urlopen('http://python.org/')
    html = response.read()
    

    采用User-Agent (用户代理,简称UA)

    UA用来包装头部的数据:
    -User-Agent :这个头部可以携带如下几条信息:浏览器名和版本号、操作系统名和版本号、默认语言
    -Referer:可以用来防止盗链,有一些网站图片显示来源http://***.com,就是检查Referer来鉴定的
    -Connection:表示连接状态,记录Session的状态
    
    header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"}
    request = urllib.request.Request(url, headers=header) # 构造一个请求对象发送请求,伪装浏览器访问
    response = urllib.request.urlopen(req)
    
    #常用的消息包头
    - Accept:text/html,image/*   (告诉服务器,浏览器可以接受文本,网页图片)
    - Accept-Charaset:ISO-8859-1   [接受字符编码:iso-8859-1]
    - Accept-Encoding:gzip,compress   [可以接受 gzip,compress压缩后数据]
    - Accept-Language:zh-cn     [浏览器支持的语言]   
    - Host:localhost:8080       [浏览器要找的主机]
    - Referer:http://localhost:8080/test/abc.html  [告诉服务器我来自哪里,常用于防止下载,盗链]
    - User-Agent:Mozilla/4.0(Com...)               [告诉服务器我的浏览器内核]
    - Cookie:     [会话]
    - Connection:close/Keep-Alive   [保持链接,发完数据后,我不关闭链接]
    - Date:        [浏览器发送数据的请求时间]
    

    定制header信息

    在 HTTP Request 中加入特定的 Header,来构造一个完整的HTTP请求消息。

    可以通过调用Request.add_header() 添加/修改一个特定的header 也可以通过调用Request.get_header()来查看已有的header。

    request.add_header("Connection", "keep-alive") # 一直活着
    
    print(request.get_full_url())  # 访问的网页链接
    print(request.get_host())       # 服务器域名
    print(request.get_method())     # get或post
    print(request.get_type())       # http/https/ftp
    
    response = urllib.request.urlopen(request)
    print(response.code)    # 状态码200, 404,500
    print(response.info)    # 网页详细信息
    
    data = response.read().decode("gb2312")
    print(response.code)    # 响应状态码
    return data
    

    我们都知道Http协议中参数的传输是"key=value"这种简直对形式的,如果要传多个参数就需要用“&”符号对键值对进行分割。如"?name1=value1&name2=value2",这样在服务端在收到这种字符串的时候,会用“&”分割出每一个参数,然后再用“=”来分割出参数值。

    3 urllib访问之get与post

    1 get方式: 通过将url + ”?” + data 拼接起来赋值给新的url

    四步走:

    1字典数据编码key=value构建url,headers 2构建请求(url , headers) 3打开请求返回响应 4读取响应

    #模拟百度搜索
    
    import urllib
    from urllib import request
    import urllib.parse
    def baiduApi(kw):
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
        }
        # https://www.baidu.com/s?wd=%E5%8D%83%E9%94%8B # url编码
        url = 'http://www.baidu.com/s?ie=utf-8&wd='+kw   #将信息拼接到url后进行构造
    
        req = urllib.request.Request(url, headers=header)
        response = urllib.request.urlopen(req)
        print(response.getcode())
        return response.read().decode('utf-8')
    
    if __name__ == '__main__':
        kw = input('请输入想查找的内容:')
        #百度搜索会进行url编码  
        kw = {'kw':kw}
        wd = urllib.parse.urlencode(kw)  #因此将字典形式的kw进行编码  wd=%E5%8D%83%E9%94%8B
        response = baiduApi(wd)
        print(response)
        
    #模拟智联招聘
    
    import urllib
    from urllib import request
    import urllib.parse
    import re
    
    def getJobInfo(kw):
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
        }
        url = "https://sou.zhaopin.com/jobs/searchresult.ashx?" + kw   #get拼接
    
        req = urllib.request.Request(url,headers=header)
        response = urllib.request.urlopen(req)   #对象
        html = response.read().decode('utf-8')
    
        jobNumre = '<em>(d+)</em>'  # 匹配岗位数量
        jobNum = re.findall(jobNumre,html)  #找到符合要求的岗位数量列表
        return jobNum[0]
    
    if __name__ == '__main__':
        jobList = ['java','python','go','php']
        jobNumDict = {}
        for job in jobList:
            kw = {'jl':'杭州','kw':job}
            kw = urllib.parse.urlencode(kw)  #编码例如  jl=%E6%9D%AD%E5%B7%9E&kw=java
            number = getJobInfo(kw)
            jobNumDict[job] = number
        print(jobNumDict)
    

    2 post方式: 定义字典类型value,保存username和password,通过urlencode对字典编码转二进制返回给data,将数据和url通过Request方法来申请访问

    五步走:

    1构建headers,url 2字典转二进制data 3构建请求(url , headers,data) 4打开请求返回响应 5读取响应

    #抓取网易云热评
    
    import urllib.request
    import urllib.parse
    import json
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
    }
    
    url = "https://music.163.com/weapi/v1/resource/comments/R_SO_4_547976490?csrf_token="  #url请求
    
    #form-data数据
    data = {
    "params": "3u5ErBfSCxBGdgjpJpTQyZVZgmPAv+aisCYZJ9pxk26DoOaS5on9xBjsE65yaS57u9XyxvCJIa78DXJathMsyiClN4LXqhonGNQrAtI2ajxsdW8FosN4kv8psGrRyCBsWrxSJQyfy5pfoeZwxLjB7jHtQkt9hglgZaAfj7ieRWq/XvX3DZtSgLcLrvH/SZOM",
    "encSecKey": "872312d7d8b04d2d5dab69d29c9bde5438337f0b3982887e3557468fe7b397de59e85ab349c07f32ef5902c40d57d023a454c3e1ed66205051264a723f20e61105752f16948e0369da48008acfd3617699f36192a75c3b26b0f9450b5663a69a7d003ffc4996e3551b74e22168b0c4edce08f9757dfbd83179148aed2a344826"
    }
    
    data = urllib.parse.urlencode(data).encode('utf-8')  #将data转为二进制 b
    req = urllib.request.Request(url,headers=headers,data=data)
    response = urllib.request.urlopen(req)
    hotComment = json.loads(response.read().decode('utf-8'))
    hotCommentList = hotComment['hotComments']
    
    for comment in hotCommentList:
        userId = comment['user']['userId']
        nickname = comment['user']['nickname']
        content = comment['content']
        print((userId,nickname,content))   
    

    3 post请求并将数据存入数据库

    #抓取阿里巴巴的岗位招聘
    
    import urllib
    from urllib import request
    import json
    import urllib.parse
    import pymysql
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
    }
    
    url = 'https://job.alibaba.com/zhaopin/socialPositionList/doList.json'  #url请求获取
    for i in range(0,20):
        data = {   #form-data获取的data数据
            'pageSize': 10,
            't': 0.9980968383643134,
            'pageIndex': i
        }
        data = urllib.parse.urlencode(data).encode('utf-8')  #url 编码  数据必须转为二进制
        req = urllib.request.Request(url,headers=headers,data=data)  #Request必须接收二进制的data
        response = urllib.request.urlopen(req)
        jsonData = json.loads(response.read().decode('utf-8'))
        jobList = jsonData['returnValue']['datas']
    
        for job in jobList:
            degree = job['degree']  # 学历
            departmentName = job["departmentName"]  # 部门
            description = job['description']  # 岗位要求
            firstCategory = job['firstCategory']  # 类型
            workExperience = job['workExperience']  # 要求
            with open('ali.txt','a+',encoding='utf-8',errors='ignore') as f:  #将数据写入txt文本
                f.write(str((degree,departmentName,description,firstCategory,workExperience))+'
    ')
                f.flush()
    #创建连接
    conn = pymysql.connect(host='ip地址', user='root', password="密码",database='spidder', port=3306,charset='utf8')
    #创建游标
    cursor = conn.cursor()
    with open('ali.txt','r',encoding='utf-8',errors='ignore') as f:
        while True:
            jobTextInfo = f.readline()
            if not jobTextInfo:
                break
            jobTextInfo = eval(jobTextInfo)
            #插入sql语句
            sql = 'insert into ali(degree,departmentName,description,firstCategory,workExperience) VALUES (%r,%r,%r,%r,%r)' %(jobTextInfo[0], jobTextInfo[1], jobTextInfo[2], jobTextInfo[3], jobTextInfo[4])
            cursor.execute(sql) #执行sql语句
            conn.commit()  #提交
    #关闭连接
    cursor.close()
    conn.close()
    

    4 json编码解析

    当请求响应结果为json数据时,出现了二进制编码转不了情况,可以设置为非ascii编码

    import urllib.request
    import urllib.parse
    import json
    
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }
    
    post_url = "http://fanyi.baidu.com/sug"
    
    data = {
        "kw":"baby"
    }
    data = urllib.parse.urlencode(data).encode('utf-8')
    req = urllib.request.Request(url=post_url,headers=headers,data=data)
    response = urllib.request.urlopen(req)
    content = response.read().decode('utf-8')  #str字符串,但是嵌套的数据没解析为中文
    
    #将字符串转json对象
    obj = json.loads(content)
    #将json对象转换成非ascii编码对象
    string = json.dumps(obj,ensure_ascii=False)
    print(string)
    
    
    #输出
    {"errno": 0, "data": [{"k": "baby", "v": "n. 婴儿; 婴孩; 幼崽; 宝贝儿; vt. 把…当作婴孩看待,娇养; 纵容; adj. 孩子的;"}, {"k": "babysitter", "v": "n. 临时照顾幼儿者;"}, {"k": "babysit", "v": "vi. 临时受雇代外出的父母照料小孩;"}, {"k": "babysitting", "v": "n. 托婴服务; v. 临时受雇代外出的父母照料小孩( babysit的现在分词 );"}, {"k": "baby sitter", "v": "n. <美>代人临时照顾婴孩者;"}]}
    

    4 处理HTTPS请求 SSL证书验证

    现在随处可见 https 开头的网站,urllib2可以为 HTTPS 请求验证SSL证书,就像web浏览器一样,如果网站的SSL证书是经过CA认证的,则能够正常访问,如:https://www.baidu.com/等...

    如果SSL证书验证不通过,或者操作系统不信任服务器的安全证书,比如浏览器在访问12306网站如:https://www.12306.cn/mormhweb/的时候,会警告用户证书不受信任。(据说 12306 网站证书是自己做的,没有通过CA认证)

    import urllib
    from urllib import request 
    # 1. 导入Python SSL处理模块
    import ssl
    
    # 2. 表示忽略未经核实的SSL证书认证
    context = ssl._create_unverified_context()
    
    url = "https://www.12306.cn/mormhweb/"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    request = urllib.request.Request(url, headers = headers)
    
    # 3. 在urlopen()方法里 指明添加 context 参数
    response = urllib.request.urlopen(request, context = context)
    print(response.read().decode())
    

    5 封装

    常见功能可以进行简单的封装

    import urllib.request
    import urllib.parse
    
    def create_request(category,page):
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
        }
        base_url = "http://tieba.baidu.com/f?ie=utf-8&"
        data = {
            'kw':category,
            'pn':(page-1)*50
        }
        data = urllib.parse.urlencode(data)
        url = base_url + data
        req = urllib.request.Request(url,headers=headers)
        return req
    
    def get_content(request):
        response = urllib.request.urlopen(request)
        content = response.read().decode('utf-8')
        return content
    
    def save_path(content,path,page):
        filename = path + 'tieba' + str(page) + '.html'
        with open(filename,'w',encoding='utf-8') as fp:
            fp.write(content)
    
    def main():
    
        category = input('请输入查询类型:')
        start_page = int(input('请输入起始页码:'))
        end_page = int(input('请输入终止页码:'))
        print('数据下载中..')
        for page in range(start_page,end_page+1):
    
            request = create_request(category,page)
            content = get_content(request)
            save_path(content,'./tieba',page)
        print('下载成功')
    
    if __name__ == '__main__':
        main()
    
  • 相关阅读:
    机器学习&数据挖掘笔记_16(常见面试之机器学习算法思想简单梳理)
    C++笔记(3):一些C++的基础知识点
    机器学习&数据挖掘笔记_15(关于凸优化的一些简单概念)
    机器学习&数据挖掘笔记_14(GMM-HMM语音识别简单理解)
    机器学习&数据挖掘笔记_13(用htk完成简单的孤立词识别)
    Deep learning:四十三(用Hessian Free方法训练Deep Network)
    机器学习&数据挖掘笔记_12(对Conjugate Gradient 优化的简单理解)
    Deep learning:四十二(Denoise Autoencoder简单理解)
    Deep learning:四十一(Dropout简单理解)
    算法设计和数据结构学习_6(单链表的递归逆序)
  • 原文地址:https://www.cnblogs.com/why957/p/9213246.html
Copyright © 2011-2022 走看看