zoukankan      html  css  js  c++  java
  • 爬虫-day02-抓取和分析

    ###页面抓取###
    1、urllib3
        是一个功能强大且好用的HTTP客户端,弥补了Python标准库中的不足
        安装: pip install urllib3
        使用:
    import urllib3
    http = urllib3.PoolManager()
    response = http.request('GET', 'http://news.qq.com')
    print(response.headers)
    result = response.data.decode('gbk')
    print(result)
     
    发送HTTPS协议的请求
    安装依赖 : pip install certifi
    import  certifi
    import urllib3
    http = urllib3.PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs = certifi.where()) #添加证书
    resp = http.request('GET', 'http://news.baidu.com/')
    print(resp.data.decode('utf-8'))
     
    ####带上参数
    import urllib3
    from urllib.parse import urlencode
    http = urllib3.PoolManager()
    args = {'wd' : '人民币'}
    # url = 'http://www.baidu.com/s?%s' % (args)
    url = 'http://www.baidu.com/s?%s' % (urlencode(args))
    print(url)
    # resp = http.request('GET' , url)
    # print(resp.data.decode('utf-8'))
     
    headers = {
        'Accept' : 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01',
        'Accept-Encoding' : 'gzip, deflate, br',
        'Accept-Language' : 'zh-CN,zh;q=0.9',
        'Connection' : 'keep-alive',
        'Host' : 'www.baidu.com',
        'Referer' : 'https://www.baidu.com/s?wd=人民币',
        'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
    }
    resp8 = requests.get(url8, fields=args8, headers=headers8)
    print(resp8.text)
     
     
     
     
  • 相关阅读:
    java基础(7)
    log4j日志打印级别动态调整
    前端学习
    windows下 使用vs command tools 和mingw 分别编译 openssl
    收尾作业(3)
    收尾作业(2)
    收尾作业(1)
    收尾作业第一个接口
    图形建模需求
    收尾作业2
  • 原文地址:https://www.cnblogs.com/Albert-w/p/9013194.html
Copyright © 2011-2022 走看看