zoukankan      html  css  js  c++  java
  • 爬虫-day02-抓取和分析

    ###页面抓取###
    1、urllib3
        是一个功能强大且好用的HTTP客户端,弥补了Python标准库中的不足
        安装: pip install urllib3
        使用:
    import urllib3
    http = urllib3.PoolManager()
    response = http.request('GET', 'http://news.qq.com')
    print(response.headers)
    result = response.data.decode('gbk')
    print(result)
     
    发送HTTPS协议的请求
    安装依赖 : pip install certifi
    import  certifi
    import urllib3
    http = urllib3.PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs = certifi.where()) #添加证书
    resp = http.request('GET', 'http://news.baidu.com/')
    print(resp.data.decode('utf-8'))
     
    ####带上参数
    import urllib3
    from urllib.parse import urlencode
    http = urllib3.PoolManager()
    args = {'wd' : '人民币'}
    # url = 'http://www.baidu.com/s?%s' % (args)
    url = 'http://www.baidu.com/s?%s' % (urlencode(args))
    print(url)
    # resp = http.request('GET' , url)
    # print(resp.data.decode('utf-8'))
     
    headers = {
        'Accept' : 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01',
        'Accept-Encoding' : 'gzip, deflate, br',
        'Accept-Language' : 'zh-CN,zh;q=0.9',
        'Connection' : 'keep-alive',
        'Host' : 'www.baidu.com',
        'Referer' : 'https://www.baidu.com/s?wd=人民币',
        'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
    }
    resp8 = requests.get(url8, fields=args8, headers=headers8)
    print(resp8.text)
     
     
     
     
  • 相关阅读:
    使用CDN后,PHP如何获取用户的真是IP?
    git常用命令整理
    svn常用命令
    Ansible 运维自动化(一)
    grep 简单笔记
    sed 笔记
    awk命令笔记
    无限极分类(一)获得树结构
    php自定义函数求取平方根
    class path resource [config.xml] cannot be opened because it does not exist
  • 原文地址:https://www.cnblogs.com/Albert-w/p/9013194.html
Copyright © 2011-2022 走看看