zoukankan      html  css  js  c++  java
  • Python爬虫2

    import requests
    
    response=requests.get("https://www.baidu.com")
    #print(response)
    #print(type(response))
    print(response.text)
    print (response.encoding)
    print(response.content.decode("utf-8"))
     
    r.text返回网页的源代码
    r.content返回源代码的字节码,.decode(编码)使用某种编码解析字节码
    r.encoding 返回识别的编码,如果识别错误,就会乱码
    r.status_code返回状态码
    print(response.status_code)--返回状态码结果是200
    response.get()参数
    • url 请求的地址
    • params 请求网址附带的参数
    • headers 请求网址附带的参数头
    response=requests.get("http://www.antvv.com/?cate=4")
    print(response.text)
    
    
    a={}
    response=requests.get("http://www.antvv.com",params=a)
    用来测试http请求的网址
    http://httpbin.org/get 获取电脑信息 http://httpbin.org/post
    response=requests.get("http://httpbin.org/get")
    print(response.text)

    返回的结果是

    D:ProgramDataAnaconda3python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.19.1", 
        "X-Amzn-Trace-Id": "Root=1-5e9fd4f8-4e3d91cc100f2c6674d3c0b2"
      }, 
      "origin": "124.64.16.230", 
      "url": "http://httpbin.org/get"
    }
    
    
    Process finished with exit code 0

    在User-Agent中可以看到是爬虫信息,此时可以给定义一个headers

    自定义UA,

    user-agent:服务器识别用户当前使用什么浏览器,如果不设置就是Python-requests,可以使用headers参数设定

    referer:上一页请求的地址,也就是你从哪个页面跳转到当前页面,部分网站会拦截referer不正确的请求

    headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    "Referer":"http://httpbin.org"
    } response=requests.get("http://httpbin.org/get",headers=headers) print(response.text)

    返回的结果为:

    D:ProgramDataAnaconda3python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400", 
        "X-Amzn-Trace-Id": "Root=1-5e9fd5d3-e0944316a8c4783b8e08fd2e"
      }, 
      "origin": "124.64.16.230", 
      "url": "http://httpbin.org/get"
    }

    此时便看不出User-Agent是什么了

    • stream流式传输
    # 获取图片
    url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
    r=requests.get(url,headers=headers)
    print(r.content)
    with open("1.jpg",'wb') as file:
        file.write(r.content)
    # 获取图片 用流的方式
    url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
    r=requests.get(url,headers=headers,stream=True)
    # print(r.content)
    with open("1.jpg",'wb') as file:
        for j in r.iter_content(102400):
            file.write(j)
            print(j)
    •  timeout 设定超时时间,超过时间则会报错
    url="https://www.zhihu.com"
    try:
        r=requests.get(url,timeout=2)
        print(r.text)
    except   BaseException:
        print("超时了")
    • proxiesd代理
    #proxies代理
    url="http://httpbin.org/get"
    proxies={
        "http":"182.35.84.181:9999",
        "https":"",
    }
    r=requests.get(url,proxies=proxies)
    print(r.text)
    • SSL 
      verify=False 不强制认证证书,如遇到sslError可以设定,现在12306不要求验证了
     import requests
     response=requests.get('http://www.12306.cn',verify=False)
     print(response.status_code)
     print(response.content.decode('utf-8'))
    •  json格式的返回值
    url="http://httpbin.org/get"
    r=requests.get(url)
    resp_str=r.text
    print(resp_str)
    print(type(resp_str))

    输出结果

    D:ProgramDataAnaconda3python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.19.1", 
        "X-Amzn-Trace-Id": "Root=1-5ea13127-3bf7c712fb636862fd58c91c"
      }, 
      "origin": "117.136.0.252", 
      "url": "http://httpbin.org/get"
    }
    
    <class 'str'>
    • json.loads()
      json.loads()把Python字符串转换成Python的字典或者列表
    url="http://httpbin.org/get"
    r=requests.get(url)
    resp_str=r.text
    
    import json
    resp_dict=json.loads(resp_str)#json.loads()把Python字符串转换成Python的字典或者列表
    print(resp_dict)
    print(type(resp_dict))
    print(resp_dict['url'])

    输出结果

    D:ProgramDataAnaconda3python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
    {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea134f7-dc8428a433fcf066a2fde876'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}
    <class 'dict'>
    http://httpbin.org/get
    
    Process finished with exit code 0
    • json.dumps()
    print(json.dumps({"name":"tom",'age':18,'sex':"male"}))#json.dumps(Python-obj)是把Python的字典或者列表转换成json字符串
    print(type(json.dumps({"name":"tom",'age':18,'sex':"male"})))
    • r.json()解析json字符串,这是requests模块带的json解析
    url="http://httpbin.org/get"
    r=requests.get(url)
    resp_str=r.json()
    print(resp_str)

    输出结果

    D:ProgramDataAnaconda3python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
    {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea13733-900d16cad13db5581550d818'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}
    
    Process finished with exit code 0
    •  post方式---向服务器上传图片等
    #post 方式
    url="http://httpbin.org/post"
    data={
        "uname":"admin",
        "upwd":"123456" #此处uname,upwd查看http://www.antvv.com/login/login.html 右键查看源代码
    }
    r=request.post(url,data=data)
    print(r.text)
    • files向服务器发送文件
    #post 方式
    url="http://httpbin.org/post"
    data={
        "uname":"admin",
        "upwd":"123456" #此处uname,upwd查看http://www.antvv.com/login/login.html 右键查看源代码
    }
    #files向服务器发送文件
    files={
        "img1":open("./1.jpg",'rb')
    }
    r=requests.post(url,data=data,files=files)
    print(r.text)
     
  • 相关阅读:
    poj 1584
    poj 1113 & poj 2187
    pku 1321 棋盘问题
    poj 1408
    pku 2251 Dungeon Master
    sdut oj 2218 Give Me an E
    Android工程 单元测试
    Android Timer编写方式
    去除工程的.svn隐藏文件夹
    Android 绑定远程服务出现 Not Allowed to bind service
  • 原文地址:https://www.cnblogs.com/smile502/p/12704920.html
Copyright © 2011-2022 走看看