zoukankan      html  css  js  c++  java
  • 爬虫模块之解决IO

    一 asyncio模块

     asyncio模块:主要是帮我们检测IO(只能是网路IO)。

     @asyncio.coroutine:装饰器

     tasks:任务列表

     get_event_loop:起任务

     run_until_complete:提交的方式,检测任务的执行

     asgncio.gather(任务列表):直接执行任务

     close:关闭任务

     open_connection:建立链接

     yield from:如果阻塞就切换到另外一个任务

     sleep:模仿网络阻塞IO

     write:将数据包准备好

     send.drain:发送数据包

     read:接收数据

    # import asyncio
    #
    # @asyncio.coroutine
    # def task(task_id,senconds):
    #     print('%s is runing' %task_id)
    #     yield from asyncio.sleep(senconds)
    #     print('%s is done' %task_id)
    #
    #
    # tasks=[
    #     task(1,3),
    #     task(2,2),
    #     task(3,1)
    # ]
    #
    # loop=asyncio.get_event_loop()
    # loop.run_until_complete(asyncio.gather(*tasks))
    # loop.close()
    
    
    #1、按照TCP:建立连接(IO阻塞)
    #2、按照HTTP协议:url,请求方法,请求头,请求头
    #3、发送Request请求(IO)
    #4、接收Respone响应(IO)
    import asyncio
    
    @asyncio.coroutine
    def get_page(host,port=80,url='/'): #https://  www.baidu.com:80  /
        print('GET:%s' %host)
        recv,send=yield from asyncio.open_connection(host=host,port=port)
    
        http_pk="""GET %s HTTP/1.1
    Host:%s
    
    """ %(url,host)
        send.write(http_pk.encode('utf-8'))
    
        yield from send.drain()
    
        text=yield from recv.read()
    
        print('host:%s size:%s' %(host,len(text)))
    
        #解析功能
    
    
    
    #http://www.cnblogs.com/linhaifeng/articles/7806303.html
    #https://wiki.python.org/moin/BeginnersGuide
    #https://www.baidu.com/
    
    tasks=[
        get_page('www.cnblogs.com',url='/linhaifeng/articles/7806303.html'),
        get_page('wiki.python.org',url='/moin/BeginnersGuide'),
        get_page('www.baidu.com',),
    ]
    
    loop=asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()
    View Code

    二 aiohttp模块

     aiohttp.request:发送一个request请求

    import asyncio
    import aiohttp #pip3 install aiohttp
    
    @asyncio.coroutine
    def get_page(url): #https://  www.baidu.com:80  /
        print('GET:%s' %url)
        response=yield from aiohttp.request('GET',url=url)
    
        data=yield from response.read()
    
        print('url:%s size:%s' %(url,len(data)))
    
    
    #http://www.cnblogs.com/linhaifeng/articles/7806303.html
    #https://wiki.python.org/moin/BeginnersGuide
    #https://www.baidu.com/
    
    tasks=[
        get_page('http://www.cnblogs.com/linhaifeng/articles/7806303.html'),
        get_page('https://wiki.python.org/moin/BeginnersGuide'),
        get_page('https://www.baidu.com/',),
    ]
    
    loop=asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()
    View Code

    三 twisted模块

     twisted:异步IO框架模块

     getpage:发送请求

     internet.reactor:

     addCalllback:绑定回调函数

     defer.DeferredList:

     reactor.run:起循环来负责执行任务

     addBoth:所有的任务都执行完毕过后执行的事,接收的参数是回调函数返回的结果

     reactor.stop:终止程序的执行

    '''
    #问题一:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
    https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    pip3 install C:UsersAdministratorDownloadsTwisted-17.9.0-cp36-cp36m-win_amd64.whl
    pip3 install twisted
    
    #问题二:ModuleNotFoundError: No module named 'win32api'
    https://sourceforge.net/projects/pywin32/files/pywin32/
    
    #问题三:openssl
    pip3 install pyopenssl
    '''
    
    #twisted基本用法
    from twisted.web.client import getPage,defer
    from twisted.internet import reactor
    
    def all_done(arg):
        # print(arg)
        reactor.stop()
    
    def callback(res):
        print(res)
        return 1
    
    defer_list=[]
    urls=[
        'http://www.baidu.com',
        'http://www.bing.com',
        'https://www.python.org',
    ]
    for url in urls:
        obj=getPage(url.encode('utf=-8'),)
        obj.addCallback(callback)
        defer_list.append(obj)
    
    defer.DeferredList(defer_list).addBoth(all_done)
    
    reactor.run()
    
    
    
    
    #twisted的getPage的详细用法
    from twisted.internet import reactor
    from twisted.web.client import getPage
    import urllib.parse
    
    
    def one_done(arg):
        print(arg)
        reactor.stop()
    
    post_data = urllib.parse.urlencode({'check_data': 'adf'})
    post_data = bytes(post_data, encoding='utf8')
    headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
    response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),
                       method=bytes('POST', encoding='utf8'),
                       postdata=post_data,
                       cookies={},
                       headers=headers)
    response.addBoth(one_done)
    
    reactor.run()
    View Code

    四 trnado模块

    from tornado.httpclient import AsyncHTTPClient
    from tornado.httpclient import HTTPRequest
    from tornado import ioloop
    
    
    def handle_response(response):
        """
        处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop()
        :param response: 
        :return: 
        """
        if response.error:
            print("Error:", response.error)
        else:
            print(response.body)
    
    
    def func():
        url_list = [
            'http://www.baidu.com',
            'http://www.bing.com',
        ]
        for url in url_list:
            print(url)
            http_client = AsyncHTTPClient()
            http_client.fetch(HTTPRequest(url), handle_response)
    
    
    ioloop.IOLoop.current().add_callback(func)
    ioloop.IOLoop.current().start()
    View Code

     

     

     

  • 相关阅读:
    C# 操作Orcle数据库
    WinDbg排查CPU高的问题
    NetCore微服务实战体系:日志管理
    NetCore微服务实战体系:Grpc+Consul 服务发现
    解惑求助-关于NetCore2.2中间件响应的问题
    EF Join连接查询的坑
    给DataTable添加行的几种方式
    [C#] 折腾海康威视的人体测温 模组
    [WPF 学习] 15.播放百度合成的语音
    [WPF 学习] 14.PlaceHolder的简单实现
  • 原文地址:https://www.cnblogs.com/fangjie0410/p/8277390.html
Copyright © 2011-2022 走看看