zoukankan      html  css  js  c++  java
  • requests模块的基本操作

    1.requests模块的get请求

    需求:爬取sogou首页页面数据

    import requests
    
    url = "https://www.sogou.com/"
    response = requests.get(url=url)
    # 获取字符串形式的页面数据
    page = response.text
    with open("./sogou.html", "w", encoding="utf-8") as fp:
        fp.write(page)
    

    其他的一些方法如下:

    # 获取二进制/byte形式的页面数据
    print(response.content)
    # 获取响应状态码
    print(response.status_code)
    # 获取响应的头信息
    print(response.headers)
    # 获取请求的url
    print(response.url)

    至于带参数的get请求,直接调用get方法或者使用字典参数的方法,如下代码:

    import requests
    
    url = "https://www.sogou.com/web"
    
    # 将参数封装到字典中
    params = {
        'query': "周杰伦",
        'ie': "utf-8",
    }
    
    response = requests.get(url=url, params=params)
    
    print(response.status_code)
    print(response.content)

    同样方法还有headers参数。

    2.post请求

    登陆豆瓣电影,获取登陆成功后的数据(这里由于豆瓣的url已经更换,所以只是示例)

    import requests
    
    # 目前这个url已经失效了,这里只做示例
    url = "https://accounts.douban.com/login"
    
    # 封装post请求的参数
    data = {
        "source": "movie",
        "redir": "https://movie.douban.com/",
        "form_email": "1111",  # 你的账号密码
        "form_password": "11111",
        "login": "登录",
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }
    # 发起post请求
    response = requests.post(url=url, data=data, headers=headers)
    
    
    print(response.status_code)
    print(response.text)
    with open("./douban.html", "w", encoding="utf-8") as fp:
        fp.write(response.text)

    3.Ajax的get请求

    需求:抓取豆瓣电影上排行榜上爱情片的详情

    import requests
    
    url = "https://movie.douban.com/j/chart/top_list?"
    
    params = {
        "type": "5",
        "interval_id": "100:90",
        "action": "",
        "start": "120",
        "limit": "20",
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }
    
    response = requests.get(url=url, params=params, headers=headers)
    
    print(response.text)

    4.Ajax的post请求

    需求:爬取肯德基城市餐厅的位置数据

    import requests
    
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
    
    params = {
        "cname": "",
        "pid": "",
        "keyword": "北京",
        "pageIndex": "1",
        "pageSize": "10",
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }
    
    response = requests.post(url=url, params=params, headers=headers)
    
    print(response.text)
    

    5.综合操作

    需求:爬取搜狗知乎某一个词条多个页码的页面数据

    import requests
    import os
    
    # 创建一个文件夹
    if not os.path.exists("./pages"):
        os.mkdir("./pages")
    
    url = "https://zhihu.sogou.com/zhihu?"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }
    
    # 搜索的词条
    word = input("please enter your word:")
    # 指定页码的范围
    start_num = int(input("enter start page number:"))
    end_num = int(input("enter end page number:"))
    
    for page in range(start_num, end_num+1):
        param = {
            "query": word,
            "page": page,
            "ie": "utf-8",
        }
        response = requests.get(url=url, params=param, headers=headers)
        filename = word +  str(page) + ".html"
        # 持久化数据
        with open("pages/%s" % filename, "w", encoding="utf-8") as fp:
            fp.write(response.text)
    

    6.cookie操作

    流程:1.登录,获取cookie  2.在发起个人主页请求时,需要cookie携带到该请求中

    注意: session对象,发送请求(会将cookie对象进行自动存储)

    import requests
    
    session = requests.session()
    
    # 发起登录请求
    login_url = "https://accounts.douban.com/passport/login"
    
    data = {
        "source": 'None',
        "redir": "https://movie.douban.com/people/123/",
        "form_email": "123",  # 你的账号密码
        "form_password": "123",
        "login": "登录",
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }
    
    session_response = session.post(url=login_url, data=data, headers=headers)
    
    url = 'https://movie.douban.com/people/123/'
    response = session.get(url=url, headers=headers)
    page = response.text
    with open("./doubanlogin.html", "w", encoding="utf-8") as fp:
        fp.write(page)

    注意:由于豆瓣的api更换了上述参数失效了,了解流程就好。当然你可以直接构造cookie来模拟登录,当然这样非常繁琐。

    7.代理操作

    import requests
    
    proxies = {
        "http": "http://10.10.1.10:3128",
        "https": "http://10.10.1.10:1080",
    }
    
    url = "https://www.taobao.com"
    
    requests.get(url=url, proxies=proxies)

    当然这个代理是无效的,要换成你自己的有效代理。requests支持socks协议的代理,需要用到socks库。

    对于requests模块,远不止这些功能,需要的自己详细了解。

  • 相关阅读:
    探索事务日志与恢复模式(1-13)
    sql server 复制、镜像常见故障处理
    (3.2)mysqldump之备份单个表及脚本批量备份
    Log Explorer 恢复误删除、更新数据
    ApexSQL Log 从意外UPDATE和DELETE操作中恢复SQL Server数据
    ApexSQL Recover 恢复一个被drop的表的数据
    数据库参数调优--自动更新统计信息
    T-SQL利用笛卡尔积/窗口函数_分析函数/表连接累计、累加
    【生产问题】-dbcc checkdb报错-数据页故障
    (4.4)dbcc checkdb 数据页修复
  • 原文地址:https://www.cnblogs.com/haoqirui/p/10630231.html
Copyright © 2011-2022 走看看