zoukankan      html  css  js  c++  java
  • 简单爬虫-爬取免费代理ip

    环境:python3.6

    主要用到模块:requests,PyQuery

    代码比较简单,不做过多解释了

    #!usr/bin/python
    # -*- coding: utf-8 -*-
    import requests
    from pyquery import PyQuery as pq
    
    
    class GetProxy(object):
        def __init__(self):
            # 代理ip网站
            self.url = 'http://www.xicidaili.com/nn/'
            self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
            self.file = r'F:pythoncode2get_proxyproxies.txt'
            # 用于检查代理ip是否可用
            self.check_url = 'https://www.python.org/'
            self.title = 'Welcome to Python.org'
    
    
        def get_page(self):
            response = requests.get(self.url, headers=self.header)
            # print(response.status_code)
            return response.text
    
        def page_parse(self, response):
            stores = []
            result = pq(response)('#ip_list')
            for p in result('tr').items():
                if p('tr > td').attr('class') == 'country':
                    ip = p('td:eq(1)').text()
                    port = p('td:eq(2)').text()
                    protocol = p('td:eq(5)').text().lower()
                    # if protocol == 'socks4/5':
                    #     protocol = 'socks5'
                    proxy = '{}://{}:{}'.format(protocol, ip, port)
                    stores.append(proxy)
            return stores
    
        def start(self):
            response = self.get_page()
            proxies = self.page_parse(response)
            print(len(proxies))
            file = open(self.file, 'w')
            i = 0
            for proxy in proxies:
                try:
                    check = requests.get(self.check_url, headers=self.header, proxies={'http': proxy}, timeout=5)
                    check_char = pq(check.text)('head > title').text()
                    if check_char == self.title:
                        print('%s is useful'%proxy)
                        file.write(proxy + '
    ')
                        i += 1
                except Exception as e:
                    continue
            file.close()
            print('Get %s proxies'%i)
    
    
    if __name__ == '__main__':
        get = GetProxy()
        get.start()
  • 相关阅读:
    Summary for sql join in Oracle DB
    Merge data into table in Oracle
    PLSQL存储过程传出大量异常错误信息
    oracle 11g plsql解析json数据示例
    识别'低效执行'的SQL语句
    如何开启MySQL 5.7.12 的二进制日志
    Linux下ps命令详解 Linux下ps命令的详细使用方法
    Linux(Unix)时钟同步ntpd服务配置方法
    MySQL 常用命令总结
    MySQL 数据库通过日志恢复
  • 原文地址:https://www.cnblogs.com/thunderLL/p/6569067.html
Copyright © 2011-2022 走看看