zoukankan      html  css  js  c++  java
  • 5、Python Requests库高级操作【1】

    概要:

    • cookie反爬处理机制
    • 代理机制

    1、cookie反爬处理机制

    案例1:

    爬取雪球网站中相关的新闻数据

    url:https://xueqiu.com/

    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61'
    }
    url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=66208&size=15'
    json_data = requests.get(url=url,headers=headers).json()
    print(json_data)
    
    结果:
    {'error_description': '遇到错误,请刷新页面或者重新登录帐号后再试',
     'error_uri': '/statuses/hot/listV2.json',
     'error_data': None,
     'error_code': '400016'}
    

    上述代码没有获取想要的数据,问题原因?

    • 通过requests模块模拟浏览器发请求,模拟的程度不够
      • 重点体现在请求头信息中。
    方式一解决:

    手动处理将cookice信息添加到请求头中即可

    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
        'Cookie':'aliyungf_tc=AQAAANW0eGAa9gsAWFJ5eyKzKfjHGKly; acw_tc=2760824315924751287984449e7480f71707bc6a5df30bb7aa7b4a2af287e9; xq_a_token=ea139be840cf88ff8c30e6943cf26aba8ad77358; xqat=ea139be840cf88ff8c30e6943cf26aba8ad77358; xq_r_token=863970f9d67d944596be27965d13c6929b5264fe; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTU5NDAwMjgwOCwiY3RtIjoxNTkyNDc1MDcxODU5LCJjaWQiOiJkOWQwbjRBWnVwIn0.lQ6Kp8fZUrBSjbQEUpv0PmLn2hZ3-ixDvYgNPr8kRMNLt5CBxMwwAY9FrMxg9gt6UTA4OJQ1Gyx7oePO1xJJsifvAha_o92wdXP55KBKoy8YP1y2rgh48yj8q61yyY8LpRTHP5RKOZQITh0umvflW4zpv05nPr7C8fHTME6Y80KspMLzOPw2xl7WFsTGrkaLH8yw6ltKvnupK7pQb1Uw3xfzM1TzgCoxWatfjUHjMZguAkrUnPKauEJBekeeh3eVaqjmZ7NzRWtLAww8egiBqMmjv5uGMBJAuuEBFcMiFZDIbGdsrJPQMGJdHRAmgQgcVSGamW8QWkzpyd8Tkgqbwg; u=161592475128805; device_id=24700f9f1986800ab4fcc880530dd0ed; Hm_lvt_1db88642e346389874251b5a1eded6e3=1592475130,1592475300; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1592475421'
    }
    url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=66208&size=15'
    json_data = requests.get(url=url,headers=headers).json()
    print(json_data)
    
    方式二解决:

    自动处理

    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    }
    
    sess = requests.Session() #返回一个session对象
    #第一次调用session一定是为了捕获cookie
    main_url = 'https://xueqiu.com/'
    sess.get(url=main_url,headers=headers) #目的:尝试捕获cookie,cookie就会被存储到session
    
    url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=65993&size=15'
    #已经表示携带cookie发起了请求
    json_data = sess.get(url=url,headers=headers).json()
    print(json_data)
    
    爬虫中cookie的处理方式有两种
    • 手动处理
      • 将抓包工具中的cookie写入到headers中即可
    • 自动处理
      • session对象。该对象可以像requests一样进行get和post请求的发送。唯一的不同之处在于,如果使用session进行请求发送的,如果在请求中产生了cookie,则cookie会被自动保存到该session对象中。
      • 在爬虫使用session对象,该对象至少要被爬虫程序调用两次。

    2、代理机制

    • 什么是代理

      • 代理服务器
    • 代理的作用

      • 转发请求&响应。
    • 代理和爬虫之间的关联

      • 如果短时间内,向服务器端发起了高频的网络请求,服务器端会检测到这样的异常现象,可能就会将客户端的ip禁掉,表示当前客户端就无法再次访问该服务器端。
    • 代理的基本概念

      • 代理的匿名度
        • 透明:目的服务器可以知道你使用了代理,也知道你的真实ip
        • 匿名:知道你使用了代理,不知道你的真实ip
        • 高匿:不知道使用了代理,也不知道真实ip
      • 代理的类型:
        • http:该代理只可以转发http协议的请求
        • https:只可以转发https协议的请求
    • 如何获取代理?

    智连HTTP使用

    1、注册账号,添加当前本机的IP地址到白名单

    2、土豪可自行购买,在这,我使用的免费的,每天有免费的代理IP

    3、浏览器访问生成的代理ip链接,有1条代理IP

    4、没有使用代理ip,查询本机ip

    手动查询:

    代码查询:

    from lxml import etree
    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
        'Cookie':'BIDUPSID=E7A59AC0805CB60E79CF4C235E336D22; PSTM=1591782404; BAIDUID=C4749F7C59E75CCF5509B5A8E38221E9:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=GVTdU1jWlZ1ZzRxOVhmT29rdmVNWnFrUTE2bW41VkxQUEp-Z0FDaGJjVjhkUkpmSVFBQUFBJCQAAAAAAAAAAAEAAADaHpEpzOG~qcrH0afPsLXEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHzo6l586Opee; BDUSS_BFESS=GVTdU1jWlZ1ZzRxOVhmT29rdmVNWnFrUTE2bW41VkxQUEp-Z0FDaGJjVjhkUkpmSVFBQUFBJCQAAAAAAAAAAAEAAADaHpEpzOG~qcrH0afPsLXEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHzo6l586Opee; BDSFRCVID=YuCOJexroG3_K8nugpyUb4TrBgKKe4oTDYLEptww7unQ1ttVJeC6EG0Pts1-dEu-EHtdogKK0gOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbIJVI-yfIvbfP0kM4r5hnLfbeT22-ustHcC2hcH0KLKMhbzDb8-5t-UBPO30j5baKkt2lbJJMb1MRjvyn6cjT8Nb4TXKMrGt6Pfap5TtUJaSDnTDMRhqtK7jq3yKMnitIj9-pnG2hQrh459XP68bTkA5bjZKxtq3mkjbPbDfn028DKuDT0ajjcbeausaI6B2Cvt3Rrj5njEDbIk-PnVept9yPnZKxtqtDjq0qTnbt5JftT_-6OxKxDQXHrAKR5nWncKWbO1bnnNbb55QMcvLRtg5lr405OTaaIO0KJc0RoNs5CwhPJvyT8DXnO7L4nlXbrtXp7_2J0WStbKy4oTjxL1Db3JKjvMtIFtVD8MtKDhMK8Gen6sMtu_MhOJbI6QM67O3tI8Kbu3JhrGXU6qLT5Xh-jD5j-qbCvtXl5tJUbKEx52M4oN5l0njxQybt7N0GuO_xt2tqvIo4LC3xonDh8e3H7MJUntKeCDQ-bO5hvvhb6O3M7lMUKmDloOW-TB5bbPLUQF5l8-sq0x0bOte-bQbG_EJ6nK24oa3RTeb6rjDnCrWJoTXUI82h5y05JdWbbXLR3MMx5DMMb4L4JvyPQ3hnORXx74-TnfLK3aafDaf-bKy4oTjxL1Db3Jb5_L5gTtsl5dbnboepvoD-Jc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC05JCnP; H_PS_PSSID=32100_1432_31326_21114_31254_32046_31708_30824_32110_26350_22157; delPer=0; PSINO=3; ZD_ENTRY=baidu'
    }
    #没有使用代理获取的本机ip
    url = 'https://www.baidu.com/s?ie=UTF-8&wd=ip'
    # url = 'https://www.sogou.com/web?query=ip'
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    #在xpath表达式中不可以出现tbody标签,否则会解析出错
    ip_data = tree.xpath('//*[@id="1"]/div[1]/div[1]/div[2]/table//tr/td/span/text()')
    # ip_data = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')
    print(ip_data)
    >>>
    
    ['本机IP:xa0123.121.82.88']
    

    5、使用代理IP

    from lxml import etree
    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    }
    
    url = 'https://www.sogou.com/web?query=ip'
    #proxies参数:请求设置代理
    page_text = requests.get(url=url,headers=headers,proxies={'https':'42.203.39.97:12154'}).text
    with open('./ip.html','w',encoding='utf-8') as fp:
        fp.write(page_text)
    

    6、打开生成的html文件查看IP信息

    3、代理池构建

    案例2:

    通过进行高频率数据访问爬取https://www.xicidaili.com/,结果本机IP被禁止访问,

    from lxml import etree
    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    }
    
    #尝试发起高频请求,让对方服务器将本机ip禁掉
    url = 'https://www.xicidaili.com/nn/%d'
    ips = []
    for page in range(1,30):
        new_url = format(url%page)
        page_text = requests.get(new_url,headers=headers).text
        tree = etree.HTML(page_text)
        #排除掉第一个tr标题标签属性
        tr_list = tree.xpath('//*[@id="ip_list"]//tr')[1:]
        for tr in tr_list:
            ip = tr.xpath('./td[2]/text()')[0]
            ips.append(ip)
    print(len(ips))	#2900
    #运行多次程序之后,发现本机ip已被禁掉,再次访问https://www.xicidaili.com/已经无法进行访问
    
    

    解决方法:

    使用代理池进行数据爬取次数限制破解

    重新获取3个代理IP

    构建一个代理池:大列表,需要装载多个不同的代理

    from lxml import etree
    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    }
    
    proxy_list = [] #代理池
    proxy_url = 'http://ip.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=3&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'
    page_text = requests.get(url=proxy_url,headers=headers).text
    tree = etree.HTML(page_text)
    ips_list = tree.xpath('//body//text()')
    for ip in ips_list:
        dic = {'https':ip}
        proxy_list.append(dic)
    print(proxy_list)
    >>>
    [{'https': '122.143.86.183:28803'}, {'https': '182.202.223.253:26008'}, {'https': '119.114.239.41:50519'}]
    

    完整代码

    import random
    from lxml import etree
    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    }
    
    proxy_list = [] #代理池
    proxy_url = 'http://ip.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=3&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'
    page_text = requests.get(url=proxy_url,headers=headers).text
    tree = etree.HTML(page_text)
    ips_list = tree.xpath('//body//text()')
    for ip in ips_list:
        dic = {'https':ip}
        proxy_list.append(dic)
    
    url = 'https://www.xicidaili.com/nn/%d'
    ips = []
    for page in range(1,5):
        new_url = format(url%page)
        page_text = requests.get(new_url,headers=headers,proxies=random.choice(proxy_list)).text
        tree = etree.HTML(page_text)
        tr_list = tree.xpath('//*[@id="ip_list"]//tr')[1:]
        for tr in tr_list:
            ip = tr.xpath('./td[2]/text()')[0]
            ips.append(ip)
    print(len(ips))
    >>>400
    
    

    windows也可对通过浏览器进行设置代理操作

  • 相关阅读:
    C# 图解教程 第一章 C#和.NET框架
    How I explained OOD to my wife(转)
    ListView 无 DataSource 依然用 DataPager 翻页
    【树莓派】crontab的两个问题
    【CentOS 7】scp示例
    【CentOS 7】nginx配置web服务器
    【CentOS_7】安装nginx
    【python 2.7】获取外部参数
    【python 2.7】输入任意字母数字,输出其对应的莫尔斯码并播放声音
    【python 2.7】python读取json数据存入MySQL
  • 原文地址:https://www.cnblogs.com/remixnameless/p/13160412.html
Copyright © 2011-2022 走看看