zoukankan      html  css  js  c++  java
  • 在Scrapy中使用IP池或用户代理(python3)

    一、创建Scrapy工程

    1 scrapy startproject 工程名


    二、进入工程目录,根据爬虫模板生成爬虫文件

    1 scrapy genspider -l # 查看可用模板
    2 scrapy genspider -t 模板名 爬虫文件名 允许的域名

    三、定义爬取关注的数据(items.py文件)


    四、编写爬虫文件


    五、设置IP池或用户代理
    (1)设置IP池
    步骤1:在settings.py文件中添加代理服务器的IP信息,如:

    1 # 设置IP池
    2 IPPOOL = [
    3   {"ipaddr": "221.230.72.165:80"},
    4   {"ipaddr": "175.154.50.162:8118"},
    5   {"ipaddr": "111.155.116.212:8123"}
    6 ]

    步骤2:创建下载中间文件middlewares.py(与settings.py同一个目录),如:

    #创建方法,cmd命令行,如项目为modetest,

    E:workspacePyCharmcodeSpacemodetestmodetest>echo #middlewares.py

     1 # -*- coding: utf-8 -*-
     2 # 导入随机模块
     3 import random
     4 # 导入settings文件中的IPPOOL
     5 from .settings import IPPOOL
     6 # 导入官方文档对应的HttpProxyMiddleware
     7 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
     8 
     9 class IPPOOlS(HttpProxyMiddleware):
    10   # 初始化
    11   def __init__(self, ip=''):
    12   self.ip = ip
    13 
    14   # 请求处理
    15   def process_request(self, request, spider):
    16   # 先随机选择一个IP
    17   thisip = random.choice(IPPOOL)
    18   print("当前使用IP是:"+ thisip["ipaddr"])
    19   request.meta["proxy"] = "http://"+thisip["ipaddr"]

    步骤3:在settings.py中配置下载中间件

    1 # 配置下载中间件的连接信息
    2 DOWNLOADER_MIDDLEWARES = {
    3   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':123,
    4   'modetest.middlewares.IPPOOlS' : 125    
    5 }

    (2)设置用户代理
    步骤1:在settings.py文件中添加用户代理池的信息(配置几个浏览器'User-Agent'),如:

    1 # 设置用户代理池
    2 UPPOOL = [
    3   "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"
    4 ]

    步骤2:创建下载中间文件uamid.py(与settings.py同一个目录),如:

    #创建方法,cmd命令行,如项目为modetest,

    E:workspacePyCharmcodeSpacemodetestmodetest>echo #uamind.py

     1 # -*- coding: utf-8 -*-#
     2 # 导入随机模块
     3 import random
     4 # 导入settings文件中的UPPOOL
     5 from .settings import UPPOOL
     6 # 导入官方文档对应的HttpProxyMiddleware
     7 from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
     8 
     9 class Uamid(UserAgentMiddleware):
    10   # 初始化 注意一定要user_agent,不然容易报错   
    11   def __init__(self, user_agent=''):
    12   self.user_agent = user_agent
    13 
    14   # 请求处理
    15   def process_request(self, request, spider):
    16     # 先随机选择一个用户代理
    17     thisua = random.choice(UPPOOL)
    18     print("当前使用User-Agent是:"+thisua)
    19     request.headers.setdefault('User-Agent',thisua)

    步骤3:在settings.py中配置下载中间件

    1 # 配置下载中间件的连接信息
    2 DOWNLOADER_MIDDLEWARES = {
    3   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 2,
    4   'modetest.uamid.Uamid': 1    
    5 }

    总而言之,有时候避免不了配置这类信息,所以直接在settings中都一起配置了如下,直接粘贴到settings.py文件的最后面

     1 #========================================
     2 
     3 # 设置IP池和用户代理
     4 
     5 #  禁止本地Cookie
     6 COOKIES_ENABLED = False
     7 
     8 # 设置IP池
     9 IPPOOL = [
    10     {"ipaddr": "221.230.72.165:80"},
    11     {"ipaddr": "175.154.50.162:8118"},
    12     {"ipaddr": "111.155.116.212:8123"}
    13 ]
    14 
    15 # 设置用户代理池
    16 UPPOOL = [
    17     "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"
    18 ]
    19 
    20 # 配置下载中间件的连接信息
    21 DOWNLOADER_MIDDLEWARES = {
    22     #'scrapy.contrib.downloadermiddlewares.httpproxy.HttpProxyMiddleware':123,
    23     #'modetest.middlewares.IPPOOlS' : 125,
    24     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 2,
    25     'modetest.uamid.Uamid': 1
    26 }
    27 
    28 #============================================
  • 相关阅读:
    Python单元测试框架之pytest 2 -- 生成测试报告
    Python单元测试框架之pytest 1 ---如何执行测试用例
    pytest学习笔记
    Python selenium —— selenium与自动化测试成神之路
    [LeetCode 41.] 缺失的第一个正数
    [LeetCode 802.] 找到最终的安全状态
    [LeetCoe 1116.] 打印零与奇偶数
    [LeetCode 1115.] 交替打印FooBar
    [LeetCode 146.] LRU 缓存机制
    [LeetCode 300.] 最长递增子序列
  • 原文地址:https://www.cnblogs.com/xiaomingzaixian/p/7121280.html
Copyright © 2011-2022 走看看