zoukankan      html  css  js  c++  java
  • Scrapy框架中的 UA伪装

    例如:百度输入ip查看是自己本机的ip,通过UA伪装成其他机器的ip,

    爬虫代码:

     1 import scrapy
     2 
     3 
     4 class UatestSpider(scrapy.Spider):
     5     name = 'UATest'
     6     # allowed_domains = ['www.xxx.com']
     7     start_urls = ['https://www.baidu.com/s?wd=ip']
     8     def parse(self, response):
     9         with open('./ip.html','w',encoding='utf-8')as fp:
    10             fp.write(response.text)
    11             print('over!!!')
    爬虫代码

    Middlewares中间件代码:

     1 from scrapy import signals
     2 from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
     3 import  random
     4 user_agent_list = [
     5         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
     6         "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
     7         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
     8         "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
     9         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
    10         "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    11         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
    12         "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    13         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
    14         "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    15         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
    16         "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    17         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
    18         "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    19         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    20         "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    21         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
    22         "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    23         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
    24         "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    25         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    26         "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    27         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    28         "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    29         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    30         "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    31         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    32         "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    33         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
    34         "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    35         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    36         "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    37         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
    38         "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    39         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
    40         "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    41 ]
    42 
    43 class UAPool(UserAgentMiddleware):
    44     def process_request(self,request,spider):
    45         ua=random.choice(user_agent_list)
    46         request.headers['User-Agent']=ua
    47         print(request.headers['User-Agent'])
    48 
    49 proxy_http = ['125.27.10.150:56292','114.34.168.157:46160']
    50 proxy_https = ['1.20.101.81:35454','113.78.254.156:9000']
    51 class UapoolDownloaderMiddleware(object):
    52     #request参数就是拦截到的 请求对象
    53     #spider就是爬虫对象
    54     def process_request(self, request, spider):
    55         if request.url.split(':')[0]=='https':
    56             request.meta['proxy']='https://'+random.choice(proxy_https)
    57         else:
    58             request.meta['proxy'] = 'http://' + random.choice(proxy_http)
    59         print(request.meta['proxy'])
    60         return None
    middlewares

    注:setting需要解开中间件,并添加自己写的中间件类

  • 相关阅读:
    如何获得Spring容器里管理的Bean,。不论是Service层,还是实体Dao层
    解析PHP中的file_get_contents获取远程页面乱码的问题【转】
    CSS中应用position的absolute和relative的属性制作浮动层
    css position 绝对定位和相对定位
    html bootstrap 表头固定在顶部,表列 可以自由滚动的效果
    php工具 phpstorm 的快捷键 的使用(待添加
    关于PHP HTML <input type="file" name="img"/>上传图片,图片大小,宽高,后缀名。
    Thinkphp 3.2 添加 验证码 如何添加。
    网页自适应@media
    如何让div上下左右都居中
  • 原文地址:https://www.cnblogs.com/duanhaoxin/p/10138809.html
Copyright © 2011-2022 走看看