zoukankan      html  css  js  c++  java
  • 爬虫实战2 亚马逊

    import requests
    r= requests.get('https://www.amazon.cn/dp/B01MYH8A99')
    print(r.status_code)
    r.encoding = r.apparent_encoding
    print(r.text)
    print(r.request.headers)

    503

    部分截取

    div class="a-box-inner">
    <i class="a-icon a-icon-alert"></i>
    <h4>请输入您在下方看到的字符</h4>
    <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>
    </div>
    </div>

    {'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

    'python-requests/2.18.4 其实在实战1就讲过这是一条爬虫请求,被对方拒绝了,像实战1一样现在我们更改头部来模拟浏览器

    200
    {'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
    <!DOCTYPE html>
    <!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
    <!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
    <!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->
    <!--[if gt IE 8]><!-->
    <html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <title dir="ltr">Amazon CAPTCHA</title>
    <meta name="viewport" content="width=device-width">
    <link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">
    <script>

    if (true === true) {
    var ue_t0 = (+ new Date()),
    ue_csm = window,
    ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
    ue_furl = "fls-cn.amazon.cn",
    ue_mid = "AAHKV2X7AFYLW",

    代码框架

    import requests
    def getHtmlText(url):
        try:
            kv = {'user-agent': 'Mozilla/5.0'}
            r = requests.get(url, headers=kv)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text[1000:2000]
        except:
            return  '产生异常'
    
    if __name__ == '__main__':
        url='https://www.amazon.cn/gp/product/B01MTMZYBE/ref=s9_acss_bw_cg_Kindle_11a1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-1&pf_rd_r=T8Y1JWWVNAA1AM9KE1SZ&pf_rd_t=101&pf_rd_p=ac9fd05e-c480-475b-a825-83c445252a6d&pf_rd_i=1991234071'
        print(getHtmlText(url))
  • 相关阅读:
    linux常用命令
    Win10正式版快捷键大全,Win10快捷组合键汇总
    cmd命令大全
    Wordpress目录结构
    dedecms目录结构,非常全
    wordpress文件系统结构
    WordPress数据库及各表结构
    网站跳转代码的实现途径
    主机屋----常用程序安装链接数据库教程
    CMS问答错误提示 ----------Deprecated: Function set_magic_quotes_runtime() is deprecated in D:wwwrootxianfanetwwwrootaskincludecommon.inc.php on line 15
  • 原文地址:https://www.cnblogs.com/tingtin/p/12904620.html
Copyright © 2011-2022 走看看