zoukankan      html  css  js  c++  java
  • The website is API(1)

    Requests 自动爬取HTML页面 自动网路请求提交

    robots 网络爬虫排除标准

    Beautiful Soup 解析HTML页面

    实战

    Re 正则表达式详解提取页面关键信息

    Scrapy*框架

    第一周:规则

    第一单元:Requests库入门

    1.安装

    以管理员身份运行命令提示符

    输入 pip install request

    验证:

    >>> import requests
    >>> r = requests.get("http://www.baidu.com")
    >>> r.status_code
    200

    requests.request():构造一个请求,支撑以各个方法的基础方法

    requests.get():获取HTML网页的主要方法,对应于HTTP的GET

    requests.get(url,params=None,**kwargs)

    url:拟获取页面的url链接

    params:url中的额外参数,字典或字节流格式,可选

    **kwargs:12个控制访问的参数

    Response对象的属性

    r.status_code:HTTP请求的返回状态,200表示连接成功,404表示失败

    r.text:HTTP响应内容的字符串形式,即,url对应的页面内容

    r.encoding:从HTTP header中猜测的响应内容编码方式

    r.apparent_encoding:从内容中分析出响应内容编码方式

    r.content:HTTP响应内容的二进制形式

    通用代码框架:

    >>> import requests
    >>> def getHTMLText(url):
        try:
            r = requests.get(url,timeout=30)
            r.raise_for_status()#如果状态不是200,引发HTTPEorror异常
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return "产生异常"

      >>> if __name__ == "__main__":
                url="www.baidu.com"
                print(getHTMLText(url))

    
    


    产生异常

     

    requests.head():网页头,HEAD

    requests.post():向HTML网页提交POST请求的方法,POST

    requests.put():PUT

    requests.patch():局部修改请求,PATCH

    requests.delete():删除请求,DELETE

     requests.request(method,url,**kwargs)

    method:请求方式,对应get/put/post等七种

    r = requests.request('GET',url,**kwargs)

    r = requests.request('HEAD',url,**kwargs)

    r = requests.request('POST',url,**kwargs)

    r = requests.request('PUT',url,**kwargs)

    r = requests.request('PATCH',url,**kwargs)

    r = requests.request('delete',url,**kwargs)

    r = requests.request('OPTIONS',url,**kwargs)

     **kwargs:控制访问的参数,可选

    params:字典或字节序列,作为参数增加到url中

    data:字典、字节序列或文件对象,作为Request的内容

    json:JSON格式的数据

    headers:

    https://www.baidu.com/robots.txt

    Requests库爬取实例

    >>> import requests
    >>> url = "https://item.jd.com/2967929.html"
    >>> try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[:1000])
    except:
        print("爬取失败")
    
        
    <!DOCTYPE HTML>
    <html lang="zh-CN">
    <head>
        <!-- shouji -->
        <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
        <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title>
        <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/>
        <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货,并包括HUAWEI荣耀8网购指南,以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息,网购华为荣耀8上京东,放心又轻松" />
        <meta name="format-detection" content="telephone=no">
        <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/2967929.html">
        <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/2967929.html">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <link rel="canonical" href="//item.jd.com/2967929.html"/>
            <link rel="dns-prefetch" href="//misc.360buyimg.com"/>
        <link rel="dns-prefetch" href="//static.360buyimg.com"/>
        <link rel="dns-prefetch" href="//img10.360buyimg.com"/>
        <link rel="dns
    >>> import requests
    >>> url = "https://www.amazon.cn/gp/product/B01MBL5Z3Y"
    >>> try:
        kv = {'user-agent':'Mozilla/5.0'}
        r = requests.get(url,headers = kv)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[1000:2000])
    except:
        print("Fail")
    
        
           ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
            ue_sn = "opfcaptcha.amazon.cn",
            ue_id = 'HB12BAYVB85FMA4VRS38';
    }
    </script>
    </head>
    <body>
    
    <!--
            To discuss automated access to Amazon data please contact api-services-support@amazon.com.
            For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
    -->
    
    <!--
    Correios.DoNotSend
    -->
    
    <div class="a-container a-padding-double-large" style="min-350px;padding:44px 0 !important">
    
        <div class="a-row a-spacing-double-large" style=" 350px; margin: 0 auto">
    
            <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
    
            <div class="a-box a-alert a-alert-info a-spacing-base">
                <div class="a-box-inner">

     百度360搜索关键词提交

    import requests
    keyword = 'Python'
    try:
        kv = {'q':keyword}
        r = requests.get("http://www.so.com/s",params = kv)
        print(r.request.url)
        r.raise_for_status()
        print(len(r.text))
    except:
        print("爬取失败")

    图片下载

    import requests
    import os
    url = "http://wx1.sinaimg.cn/mw600/0076BSS5ly1g6hmmj82tpj30u018wdos.jpg"
    root = "E://pics//"
    path = root + url.split('/')[-1]
    try:
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            r = requests.get(url)
            with open(path,'wb') as f:
                f.write(r.content)
                f.close()
                print("文件保存成功")
        else:
            print("文件已存在")
    except:
        print("爬取失败")

    IP地址查询

    import requests
    url = "http://m.ip138.com/ip.asp?ip="
    try:
        r = requests.get(url+'202.204.80.112')
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[-300:])
    except:
        print("爬取失败")
  • 相关阅读:
    Omi框架学习之旅
    Omi框架学习之旅
    Omi框架学习之旅
    加密解密
    RSA加密解密
    CMDB后台管理(AutoServer)
    CMDB Autoclient思路分析
    CMDB开发(需求分析)
    Django之model操作(续)
    Django之Model操作
  • 原文地址:https://www.cnblogs.com/kmxojer/p/11260085.html
Copyright © 2011-2022 走看看