zoukankan      html  css  js  c++  java
  • Python爬虫学习笔记(四)

    Request:

    Test1(基本属性:POST):

    代码1:

    import requests
    
    # 发送POST请求
    data = {
    
    }
    response = requests.post(url, data=data)
    POST请求

    Test2(auth认证):

    代码2:

    import requests
    
    # 发送POST请求
    data = {
    
    }
    response = requests.post(url, data=data)
    
    #内网 => 需要认证
    auth = (user, pwd)
    response = requests.get(url, auth=auth)
    auth认证

    Test3(添加代理IP):

    代码3:

    import requests
    
    # 1.请求url
    url = 'http://www.baidu.com'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
    
    }
    free_proxy = {
        'http': '58.249.55.222:9797'
    }
    response = requests.get(url=url, headers=headers, proxies=free_proxy)
    print(response.status_code)
    添加代理IP

    Test4(SSL认证):

    代码4:

    import requests
    
    url = 'https://www.12306.cn/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
    
    }
    response = requests.get(url=url, headers=headers)
    data = response.content.decode()
    with open('03-ssl.html', 'w', encoding='utf-8')as f:
        f.write(data)
    错误示例

    错误原因:

    https是有第三方CA证书认证的
    但12306虽然是https,但是他不是CA证书,他是自己颁布的证书
    解决方法:
    告诉web的服务器忽略证书访问
    错误原因及解决方法

    正确操作:

    import requests
    
    url = 'https://www.12306.cn/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
    
    }
    response = requests.get(url=url, headers=headers, verify=False)
    data = response.content.decode()
    with open('03-ssl.html', 'w', encoding='utf-8')as f:
        f.write(data)
    忽略证书进行SSL认证

    返回:

    Test5(Cookies):

    以抓取https://www.yaozh.com/为例

    代码5:

    import requests
    
    # 请求数据url
    member_url = 'https://www.yaozh.com/mamber'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
    
    }
    # cookies字符串
    cookies = 'acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4; PHPSESSID=em18pu35k89loei7glff6tthu5; _ga=GA1.2.1727298837.1614668307; _gid=GA1.2.781607072.1614668307; _gat=1; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1614668307; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1614668313; yaozh_logintime=1614668314; yaozh_user=1038868%09s1mpL3; yaozh_userId=1038868; yaozh_jobstatus=kptta67UcJieW6zKnFSe2JyYnoaSZ5htnZqdg26qb21rg66flM6bh5%2BscZdyVNaWz9Gwl4Ny2G%2BenofNlKqpl6XKppZVnKmflWlxg2lolJeaA7663ea03b06b64c50D898E7C9Aa493SkZSblWmHcNiemZtVq56lloN0pG2SaZ%2BGam2SbGiZnZWSmZyYaodw4g%3D%3Dc0addf8502ad2cb2f15b3ee2b0e1386b; db_w_auth=854448%09s1mpL3; UtzD_f52b_saltkey=oZ3HIEZl; UtzD_f52b_lastvisit=1614664715; UtzD_f52b_lastact=1614668315%09uc.php%09; UtzD_f52b_auth=25786v5G66Jeq3x5p14pj246attR7pdjk5X409nOsIqKeaEypUvH%2BA1LTN4dNrChFQS1hmPQNfLGoyIxAD8PRqgk4bA; yaozh_uidhas=1; yaozh_mylogin=1614668317; acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4'
    # 但需要的是cookies的字典类型
    cookies_dict = {
    
    }
    cookies_list = cookies.split('; ')
    for cookie in cookies_list:
        cookies_dict[cookie.split('=')[0]] = cookie.split('=')[1]
    
    response = requests.get(url=member_url, headers=headers, cookies=cookies_dict)
    data = response.content.decode()
    with open('05-cookie.html', 'w', encoding='utf-8')as f:
        f.write(data)
    Cookies - 代码传参登陆

    返回:

    由返回页面可知,登录成功。

    Test6(cookies - session):

    以抓取https://www.yaozh.com/为例

    代码6:

    Cookies - 代码带着cookie去请求数据

    返回:

    由返回页面可知,登录成功。

    数据解析:

    ·正则表达式:

    Test1(正则表达式 - 贪婪模式):

    代码1:

    import re
    
    # 贪婪模式:从开头匹配到结尾
    # 非贪婪模式:使用?
    one = 'mdfjiefhuehfgieufn213431241234n'
    pattern = re.compile('m(.*)n')
    result = pattern.findall(one)
    print(result)
    贪婪模式

    返回1:

    ['dfjiefhuehfgieufn213431241234']
    View Code

    Test2(正则表达式 - 费贪婪模式):

    代码2:

    import re
    
    # 贪婪模式:从开头匹配到结尾
    # 非贪婪模式:使用?
    one = 'mdfjiefhuehfgieufn213431241234n'
    pattern = re.compile('m(.*?)n')
    result = pattern.findall(one)
    print(result)
    View Code

    返回2:

    ['dfjiefhuehfgieuf']
    View Code

    Test3(正则表达式 - 匹配换行符):

    代码3:

    import re
    
    # .除了换行符之外的匹配
    one = """
        maiwgdyuwagdwyadg
        1234567612121134n
    """
    pattern = re.compile('m(.*)n')
    result = pattern.findall(one)
    print(result)
    不匹配换行符

    返回3:

    []

    代码4:

    import re
    
    # .除了换行符之外的匹配
    one = """
        maiwgdyuwagdwyadg
        1234567612121134n
    """
    pattern = re.compile('m(.*)n', re.S)
    result = pattern.findall(one)
    print(result)
    匹配换行符

    返回4:

    代码5:

    import re
    
    # .除了换行符之外的匹配
    one = """
        maiwgdyuwagdwyadg
        1234567612121134N
    """
    pattern = re.compile('m(.*)n', re.S | re.I)
    result = pattern.findall(one)
    print(result)
    忽略大小写

    返回5:

    Test4(正则表达式 - 纯数字的正则):

    代码6:

    import re
    
    # 纯数字的正则 d 0 - 9之间的一个数
    pattern = re.compile('^d+$')
    one = '1234'
    
    # 匹配判断的方法
    result = pattern.match(one)
    
    print(result.group())
    View Code

    返回:

    1234

    Test5(正则表达式 - 范围运算):

    代码7:

    import re
    
    # 范围运算 [123]  [1-9]
    one = '798345'
    
    pattern = re.compile('[1-9]')
    result = pattern.findall(one)
    print(result)
    View Code

    返回:

    ['7', '9', '8', '3', '4', '5']

    注:

    • match:从头匹配,匹配一次
    • search:从任意位置,匹配一次
    • findall:查找符合正则的内容
    • sub:替换字符串
    • split:拆分
  • 相关阅读:
    遗传算法优化BP神经网络——非线性函数拟合
    BP神经网络的数据分类——语音特征信号分类
    HTML5新特性
    HTML5——FileReader详解
    javaScript构造函数显式提供返回值
    react截图组件react-cropper的使用方法
    Ant Design of React的安装和使用方法
    flex布局
    CSS常用的元素居中方法
    javaScript异步编程
  • 原文地址:https://www.cnblogs.com/3cH0-Nu1L/p/14469756.html
Copyright © 2011-2022 走看看