zoukankan      html  css  js  c++  java
  • 爬虫入门学习

    在python3中将urllib3重构后变为urllib.requesr使用,网页在抓取之后要指定decode解码。

    urllib.encode变为urllib.parse.quote()编码 或者urllib.parse.unquote() 解码  汉字的编码  但这样返回的会是byte类型

    若希望是字符串类型得使用:urllib.parse.urlencode()

    通常为了通过服务器的检测,会更改请求头的部分数据,以伪装成浏览器来访问。

    此时User-Agent,设置为浏览器模式尤为重要,尽量不要设置支持gzip压缩方式接收数据

    print(response.getcode())    #显示状态码
    print(response.geturl()) #显示实际链接地址
    print(response.info()) #显示服务器的报头
     1 import urllib.request
     2 
     3 url="http://www.hao123.com/"
     4 ua_headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
     5 
     6 # 构造一个待访问链接对象
     7 request=urllib.request.Request(url=url,headers=ua_headers)
     8 # 构造一个请求访问对象
     9 response=urllib.request.urlopen(request)
    10 
    11 html=response.read()
    12 
    13 print(html.decode("utf-8"))
    14 
    15 print(response.getcode(),response.geturl())

     ******************************************************************************************

    关于User Agent的说明:

    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"

    Mozilla/5.0:浏览器通用标识符

    Windows NT 10.0; Win64; x64:操作系统的版本

     AppleWebKit/537.36 (KHTML, like Gecko):浏览器的内核

    Chrome/72.0.3626.109 Safari/537.36:浏览器真实的版本信息

    常见的User Agent内核通常有:火狐 欧朋 谷歌

    ua_list = [
        "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
        "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
        "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
    ]

    访问服务器之前设置请求头:

     1 import random
     2 import urllib.request
     3 
     4 url="http://www.baidu.com/"
     5 ua_list = [
     6     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
     7     "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
     8     "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
     9     "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    10     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    11     "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
    12 ]
    13 
    14 #在列表里随机选择一个User Agent
    15 user_agent=random.choice(ua_list)
    16 #构造一个请求
    17 request=urllib.request.Request(url)
    18 #设置或者添加一个HTTP请求头
    19 request.add_header("User-Agent",user_agent)
    20 
    21 head=request.get_header("User-agent")
    22 print(head)

     实现一个简单的按条件查询页面:

     1 import urllib.request
     2 import urllib.parse
     3 import random
     4 
     5 #目标地址
     6 url="http://www.baidu.com/s"
     7 #伪造客户端 http请求头
     8 ua_list = [
     9     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    10     "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    11     "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    12     "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    13     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    14     "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
    15 ]
    16 #随机选择一个作为请求头
    17 user_agent=random.choice(ua_list)
    18 #接收查询字段
    19 select=input("请输入要查询的关键字:")
    20 
    21 #编码
    22 wd={"wd":select}
    23 wd=urllib.parse.urlencode(wd)
    24 #拼接完整url地址
    25 url=url+"?"+wd 
    26 
    27 #创建一个请求对象
    28 request=urllib.request.Request(url)
    29 #设置请求头的user-agent
    30 request.add_header("User-Agent",user_agent)
    31 #访问目标服务器
    32 response=urllib.request.urlopen(request)
    33 #读取并按照utf-8解码
    34 html=response.read().decode("utf-8")
    35 
    36 # print(html)
    37 print(">>>>>>"+url)
  • 相关阅读:
    宁波工程学院2020新生校赛C
    宁波工程学院2020新生校赛B
    宁波工程学院2020新生校赛A -恭喜小梁成为了宝可梦训练家~(水题)
    POJ 1611
    牛客算法周周练11E
    牛客算法周周练11C
    牛客算法周周练11A
    CodeForces 1176C
    CodeForces 445B
    UVALive 3027
  • 原文地址:https://www.cnblogs.com/wen-kang/p/10417621.html
Copyright © 2011-2022 走看看