zoukankan      html  css  js  c++  java
  • 五种ip proxy的设置方法

    我们在制作爬虫爬取想要的资料时,由于是计算机自动抓取,强度大、速度快,通常会给网站服务器带来巨大压力,所以同一个IP反复爬取同一个网页,就很可能被封,在这里介绍相关的技巧,以免被封;但在制作爬虫时,还是要适当加入延时代码,以减少对目标网站的影响。

    一、requests设置代理:

    import requests

    proxies = { "http": "http://192.10.1.10:8080", "https": "http://193.121.1.10:9080", }

    requests.get("http://targetwebsite.com", proxies=proxies)

    二、Selenium+Chrome设置代理:

    from selenium import webdriver

    PROXY = "192.206.133.227:8080"

    chrome_options = webdriver.ChromeOptions()

    chrome_options.add_argument('--proxy-server={0}'.format(PROXY))

    browser = webdriver.Chrome(chrome_options=chrome_options)

    browser.get('www.targetwebsize.com')

    print(browser.page_source)

    brsowser.close()

    三、Selenium+Phantomjs设置代理:

    # 利用DesiredCapabilities(代理设置)参数值,重新打开一个sessionId.

    proxy=webdriver.Proxy()

    proxy.proxy_type=ProxyType.MANUAL

    proxy.http_proxy='192.25.171.51:8080'

    # 将代理设置添加到webdriver.DesiredCapabilities.PHANTOMJS中

    proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)

    browser.start_session(webdriver.DesiredCapabilities.PHANTOMJS)

    browser.get('http://www.targetwebsize.com')

    print(browser.page_source)

    # 还原为系统代理只需将proxy_type重新设置一次

    proxy.proxy_type=ProxyType.DIRECT

    proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)

    browser.start_session(webdriver.DesiredCapabilities.PHANTOMJS)

    四、爬虫框架scrapy设置代理:

    在setting.py中添加代理IP

    PROXIES = ['http://173.207.95.27:8080',

    'http://111.8.100.99:8080',

    'http://126.75.99.113:8080',

    'http://68.146.165.226:3128']

    而后,在middlewares.py文件中,添加下面的代码。

    import scrapy from scrapy

    import signals

    import random

    classProxyMiddleware(object):

    ''' 设置Proxy '''

    def__init__(self, ip):

    self.ip = ip

    @classmethod

    deffrom_crawler(cls, crawler):

    return cls(ip=crawler.settings.get('PROXIES'))

    defprocess_request(self, request, spider):

    ip = random.choice(self.ip)

    request.meta['proxy'] = ip

    最后将我们自定义的类添加到下载器中间件设置中,如下。

    DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.ProxyMiddleware': 543,}

    五、Python异步Aiohttp设置代理:

    proxy="http://192.121.1.10:9080"

    asyncwithaiohttp.ClientSession()assession:

    asyncwithsession.get("http://python.org",proxy=proxy)asresp:

    print(resp.status)

    # https方法一:
    # connector = SocksConnector.from_url('socks5://localhost:1080', rdns=True)
    # async with aiohttp.ClientSession(connector=connector) as sess:
    # https方法二:
    async with aiohttp.ClientSession() as session:
    session.proxies = {'http': 'socks5h://127.0.0.1:1080',
    'https': 'socks5h://127.0.0.1:1080'}
    headers = {'content-type': 'image/gif',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }
    cookies = {'cookies_are': 'working'}
    # proxy = "http://127.0.0.1:1080"
    with async_timeout.timeout(10):#设置请求的最长时间为10s
    # async with sess.get(url, proxy="http://54.222.232.0:3128") as res:
    async with session.get(url,headers=headers,cookies=cookies, verify_ssl=False) as res:
    text = await res.text()
    print(text)
  • 相关阅读:
    jQuery 在 IE 上 clone checkbox 的問題。
    C/C++ typedef用法
    C++继承
    map常用操作
    C++ JsonCpp 使用(含源码下载)
    string常用操作
    C++虚函数
    STL容器迭代过程中删除元素技巧(转)
    关于IE下用HTTPS无法下载/打开文件(转)
    C++STL概览
  • 原文地址:https://www.cnblogs.com/du-jun/p/10710833.html
Copyright © 2011-2022 走看看