zoukankan      html  css  js  c++  java
  • asyncio 学习

    来自:https://www.syncd.cn/article/asyncio_article_02

     

     

    一、asyncio之—-入门初探

    通过上一篇关于asyncio的整体介绍,看过之后基本对asyncio就有一个基本认识,如果是感兴趣的小伙伴相信也会尝试写一些小代码尝试用了,那么这篇文章会通过一个简单的爬虫程序,从简单到复杂,一点一点的改进程序以达到我们想要的效果.

    https://github.com/HackerNews/API 这里是关于HN的API的使用说明,这次写的爬虫就是调用这里的api接口,用到的模块是aiohttp 发起的请求,切记这里是不能用requests模块的。关于aiohttp的文档:https://aiohttp.readthedocs.io/en/stable/

    下面我们看具体的代码实现,这个代码主要就是爬取其中一个连接下的所有评论,如果不传递id的情况,默认就是爬取id为8863的评论

    1. import asyncio
    2. import argparse
    3. import logging
    4. from urllib.parse import urlparse, parse_qs
    5. from datetime import datetime
    6. import aiohttp
    7. import async_timeout
    8. LOGGER_FORMAT = '%(asctime)s %(message)s'
    9. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    10. FETCH_TIMEOUT = 10
    11. parser = argparse.ArgumentParser(
    12. description='获取所有请求url的所有评论')
    13. parser.add_argument('--id', type=int, default=8863,
    14. help='请求的id, 默认id 是8863')
    15. parser.add_argument('--url', type=str, help='HN的url地址')
    16. parser.add_argument('--verbose', action='store_true', help='详细的输出')
    17. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    18. log = logging.getLogger()
    19. log.setLevel(logging.INFO)
    20. fetch_counter = 0
    21. async def fetch(session, url):
    22. """
    23. 通过aiohttp访问url并返回json格式数据
    24. """
    25. global fetch_counter
    26. with async_timeout.timeout(FETCH_TIMEOUT):
    27. fetch_counter += 1
    28. # 因为接口需要翻墙才能访问,所以这里我用的是我本地的代理
    29. async with session.get(url,proxy='http://127.0.0.1:1081') as response:
    30. return await response.json()
    31. async def post_number_of_comments(loop, session, post_id):
    32. """
    33. 递归获取当前请求url的所有评论
    34. """
    35. url = URL_TEMPLATE.format(post_id)
    36. now = datetime.now()
    37. response = await fetch(session, url)
    38. log.debug('{:^6} > Fetching of {} took {} seconds'.format(
    39. post_id, url, (datetime.now() - now).total_seconds()))
    40. if 'kids' not in response: #表示没有评论
    41. return 0
    42. # 获取当前请求的url的评论的数量
    43. number_of_comments = len(response['kids'])
    44. log.debug('{:^6} > Fetching {} child posts'.format(
    45. post_id, number_of_comments))
    46. tasks = [post_number_of_comments(
    47. loop, session, kid_id) for kid_id in response['kids']]
    48. # 获取所有协程的执行的结果
    49. results = await asyncio.gather(*tasks)
    50. number_of_comments += sum(results)
    51. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    52. return number_of_comments
    53. def id_from_HN_url(url):
    54. """
    55. 获取运行时传递的参数中的id
    56. """
    57. parse_result = urlparse(url)
    58. try:
    59. return parse_qs(parse_result.query)['id'][0]
    60. except (KeyError, IndexError):
    61. return None
    62. async def main(loop, post_id):
    63. now = datetime.now()
    64. async with aiohttp.ClientSession(loop=loop) as session:
    65. now = datetime.now()
    66. comments = await post_number_of_comments(loop, session, post_id)
    67. log.info(
    68. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    69. (datetime.now() - now).total_seconds(), fetch_counter))
    70. return comments
    71. if __name__ == '__main__':
    72. args = parser.parse_args()
    73. if args.verbose:
    74. log.setLevel(logging.DEBUG)
    75. post_id = id_from_HN_url(args.url) if args.url else args.id
    76. loop = asyncio.get_event_loop()
    77. comments = loop.run_until_complete(main(loop, post_id))
    78. log.info("-- Post {} has {} comments".format(post_id, comments))
    79. loop.close()

    再次提醒该url请求的时候是需要翻墙才能访问到,所以我这里加了本地的代理,以便能够爬取到内容,正常的请求结果如下:

    1. [23:24:37] > Calculating comments took 2.98 seconds and 73 fetches
    2. [23:24:37] -- Post 8863 has 72 comments

    如果没有翻墙就是如下错误了:

    1. Traceback (most recent call last):
    2. File "/Users/zhaofan/vs_python/python_asyncio/ex1.py", line 41, in fetch
    3. async with session.get(url) as response:
    4. File "/usr/local/lib/python3.7/site-packages/aiohttp/client.py", line 1005, in __aenter__
    5. self._resp = await self._coro
    6. File "/usr/local/lib/python3.7/site-packages/aiohttp/client.py", line 476, in _request
    7. timeout=real_timeout
    8. File "/usr/local/lib/python3.7/site-packages/aiohttp/connector.py", line 522, in connect
    9. proto = await self._create_connection(req, traces, timeout)
    10. File "/usr/local/lib/python3.7/site-packages/aiohttp/connector.py", line 854, in _create_connection
    11. req, traces, timeout)
    12. File "/usr/local/lib/python3.7/site-packages/aiohttp/connector.py", line 974, in _create_direct_connection
    13. req=req, client_error=client_error)
    14. File "/usr/local/lib/python3.7/site-packages/aiohttp/connector.py", line 924, in _wrap_create_connection
    15. await self._loop.create_connection(*args, **kwargs))
    16. File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 946, in create_connection
    17. await self.sock_connect(sock, address)
    18. File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 464, in sock_connect
    19. return await fut
    20. concurrent.futures._base.CancelledError
    21. During handling of the above exception, another exception occurred:
    22. Traceback (most recent call last):
    23. File "/Users/zhaofan/vs_python/python_asyncio/ex1.py", line 115, in <module>
    24. comments = loop.run_until_complete(main(loop, post_id))
    25. File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    26. return future.result()
    27. File "/Users/zhaofan/vs_python/python_asyncio/ex1.py", line 99, in main
    28. comments = await post_number_of_comments(loop, session, post_id)
    29. File "/Users/zhaofan/vs_python/python_asyncio/ex1.py", line 51, in post_number_of_comments
    30. response = await fetch(session, url)
    31. File "/Users/zhaofan/vs_python/python_asyncio/ex1.py", line 42, in fetch
    32. return await response.json()
    33. File "/usr/local/lib/python3.7/site-packages/async_timeout/__init__.py", line 45, in __exit__
    34. self._do_exit(exc_type)
    35. File "/usr/local/lib/python3.7/site-packages/async_timeout/__init__.py", line 92, in _do_exit
    36. raise asyncio.TimeoutError
    37. concurrent.futures._base.TimeoutError

    还有就是上面的代码中我们使用了results = await asyncio.gather(*tasks)
    等待所有的协程执行完成并返回结果,关于gather的官网文档地址:https://docs.python.org/3/library/asyncio-task.html#asyncio.gather

    并且在上面的使用中我们也用到了递归,你可能感觉还挺简单的,代码看着和我们平时的写的阻塞式的代码好像区别也不是特别大,保持这种愉悦感,接着看

    二、asyncio之—-更进一步

    那么我们现在想要的是当我们的爬虫程序爬取评论的时候,我们想要当评论超过一定阈值的贴帖子发邮件通知告诉我们,其实这个功能是非常有必要的,就拿我的个人博客站来说,如果你想要经常看我的文章,又不想经常来我的站看,只想看大家都关注的那些文章,或者评论比较多的文章,所以我们接着将我们的代码进行更改:

    1. import asyncio
    2. import argparse
    3. import logging
    4. import random
    5. from urllib.parse import urlparse, parse_qs
    6. from datetime import datetime
    7. import aiohttp
    8. import async_timeout
    9. LOGGER_FORMAT = '%(asctime)s %(message)s'
    10. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    11. FETCH_TIMEOUT = 10
    12. # 我们设置的评论的阈值
    13. MIN_COMMENTS = 2
    14. parser = argparse.ArgumentParser(
    15. description='获取所有请求url的所有评论')
    16. parser.add_argument('--id', type=int, default=8863,
    17. help='请求的id, 默认id 是8863')
    18. parser.add_argument('--url', type=str, help='HN的url地址')
    19. parser.add_argument('--verbose', action='store_true', help='详细的输出')
    20. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    21. log = logging.getLogger()
    22. log.setLevel(logging.INFO)
    23. fetch_counter = 0
    24. async def fetch(session, url):
    25. """
    26. 通过aiohttp访问url并返回json格式数据
    27. """
    28. global fetch_counter
    29. with async_timeout.timeout(FETCH_TIMEOUT):
    30. fetch_counter += 1
    31. # 因为接口需要翻墙才能访问,所以这里我用的是我本地的代理
    32. async with session.get(url,proxy='http://127.0.0.1:1081') as response:
    33. return await response.json()
    34. async def post_number_of_comments(loop, session, post_id):
    35. """
    36. 递归获取当前请求url的所有评论
    37. """
    38. url = URL_TEMPLATE.format(post_id)
    39. now = datetime.now()
    40. response = await fetch(session, url)
    41. log.debug('{:^6} > Fetching of {} took {} seconds'.format(
    42. post_id, url, (datetime.now() - now).total_seconds()))
    43. if 'kids' not in response: #表示没有评论
    44. return 0
    45. # 获取当前请求的url的评论的数量
    46. number_of_comments = len(response['kids'])
    47. log.debug('{:^6} > Fetching {} child posts'.format(
    48. post_id, number_of_comments))
    49. tasks = [post_number_of_comments(
    50. loop, session, kid_id) for kid_id in response['kids']]
    51. # 获取所有任务的结果
    52. results = await asyncio.gather(*tasks)
    53. number_of_comments += sum(results)
    54. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    55. if number_of_comments> MIN_COMMENTS:
    56. await email_post(response)
    57. return number_of_comments
    58. async def email_post(post):
    59. """
    60. 模拟发邮件的动作,并没有真的发邮件
    61. """
    62. await asyncio.sleep(random.random()*3)
    63. log.info("email send success")
    64. def id_from_HN_url(url):
    65. """
    66. 获取运行时传递的参数中的id
    67. """
    68. parse_result = urlparse(url)
    69. try:
    70. return parse_qs(parse_result.query)['id'][0]
    71. except (KeyError, IndexError):
    72. return None
    73. async def main(loop, post_id):
    74. now = datetime.now()
    75. async with aiohttp.ClientSession(loop=loop) as session:
    76. now = datetime.now()
    77. comments = await post_number_of_comments(loop, session, post_id)
    78. log.info(
    79. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    80. (datetime.now() - now).total_seconds(), fetch_counter))
    81. return comments
    82. if __name__ == '__main__':
    83. args = parser.parse_args()
    84. if args.verbose:
    85. log.setLevel(logging.DEBUG)
    86. post_id = id_from_HN_url(args.url) if args.url else args.id
    87. loop = asyncio.get_event_loop()
    88. comments = loop.run_until_complete(main(loop, post_id))
    89. log.info("-- Post {} has {} comments".format(post_id, comments))
    90. loop.close()

    运行结果如下:

    1. [23:24:17] email send success
    2. [23:24:18] email send success
    3. [23:24:18] email send success
    4. [23:24:19] email send success
    5. [23:24:19] email send success
    6. [23:24:20] email send success
    7. [23:24:21] email send success
    8. [23:24:21] email send success
    9. [23:24:24] email send success
    10. [23:24:24] > Calculating comments took 10.09 seconds and 73 fetches
    11. [23:24:24] -- Post 8863 has 72 comments

    你会发现这次花费的时间比我们之前多了,因为我们在发送邮件的地方是 await email_post(response) 那么我们的的程序再这里就会等到知道这个任务完成,其实对我们来说我们更关注的是我们的主要任务,获取所有的评论结果,而发送邮件通知我们的次级任务,那么我们需要怎么改进,让我们的主要的任务继续执行,不用去等待子任务的执行呢?在asyncio的api文档中有ensure_future ,这个需要注意:在python3.7之前用的是这个方法,但3.7之后更推荐用create_task的方法 具体地址为:https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task
    这里明确说明了:

    asyncio.create_task(coro)
    Wrap the coro coroutine into a Task and schedule its execution. Return the Task object.
    The task is executed in the loop returned by get_running_loop(), RuntimeError is raised if there is no running loop in current thread.
    This function has been added in Python 3.7. Prior to Python 3.7, the low-level asyncio.ensure_future() function can be used instead:

    通过这个方法我们可以将我们的任务安排一个协程运行,将其包装在Task对象中并返回它,既然这样我们就将代码继续更改:
    将await email_post(response) 这样代码替换为:asyncio.ensure_future(email_post(response))

    但是当我们运行后发现不幸的事情发生了:

    1. [23:40:06] email send success
    2. [23:40:06] > Calculating comments took 3.30 seconds and 73 fetches
    3. [23:40:06] -- Post 8863 has 72 comments
    4. [23:40:06] Task was destroyed but it is pending!
    5. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x1087dde58>()]>>
    6. [23:40:06] Task was destroyed but it is pending!
    7. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9e4f8>()]>>
    8. [23:40:06] Task was destroyed but it is pending!
    9. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9e9a8>()]>>
    10. [23:40:06] Task was destroyed but it is pending!
    11. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9e918>()]>>
    12. [23:40:06] Task was destroyed but it is pending!
    13. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9ee88>()]>>
    14. [23:40:06] Task was destroyed but it is pending!
    15. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9ef48>()]>>
    16. [23:40:06] Task was destroyed but it is pending!
    17. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x108a9efd8>()]>>
    18. [23:40:06] Task was destroyed but it is pending!
    19. task: <Task pending coro=<email_post() done, defined at ex1.py:76> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x1087dde28>()]>>

    看到这个错误不要慌,这个也是很多初学asyncio的或者刚开始用的时候都会碰到的问题,并且这个问题我们在上一篇asyncio的文章也说明了原因,在这里其实就是post_number_of_comments协程返回后立即强行关闭循环,让我们的log_post任务没有时间完成,怎么解决呢? 我们继续改代码:

    1. import asyncio
    2. import argparse
    3. import logging
    4. import random
    5. from urllib.parse import urlparse, parse_qs
    6. from datetime import datetime
    7. import aiohttp
    8. import async_timeout
    9. LOGGER_FORMAT = '%(asctime)s %(message)s'
    10. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    11. FETCH_TIMEOUT = 10
    12. # 我们设置的评论的阈值
    13. MIN_COMMENTS = 2
    14. parser = argparse.ArgumentParser(
    15. description='获取所有请求url的所有评论')
    16. parser.add_argument('--id', type=int, default=8863,
    17. help='请求的id, 默认id 是8863')
    18. parser.add_argument('--url', type=str, help='HN的url地址')
    19. parser.add_argument('--verbose', action='store_true', help='详细的输出')
    20. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    21. log = logging.getLogger()
    22. log.setLevel(logging.INFO)
    23. fetch_counter = 0
    24. async def fetch(session, url):
    25. """
    26. 通过aiohttp访问url并返回json格式数据
    27. """
    28. global fetch_counter
    29. with async_timeout.timeout(FETCH_TIMEOUT):
    30. fetch_counter += 1
    31. # 因为接口需要翻墙才能访问,所以这里我用的是我本地的代理
    32. async with session.get(url,proxy='http://127.0.0.1:1081') as response:
    33. return await response.json()
    34. async def post_number_of_comments(loop, session, post_id):
    35. """
    36. 递归获取当前请求url的所有评论
    37. """
    38. url = URL_TEMPLATE.format(post_id)
    39. now = datetime.now()
    40. response = await fetch(session, url)
    41. log.debug('{:^6} > Fetching of {} took {} seconds'.format(
    42. post_id, url, (datetime.now() - now).total_seconds()))
    43. if 'kids' not in response: #表示没有评论
    44. return 0
    45. # 获取当前请求的url的评论的数量
    46. number_of_comments = len(response['kids'])
    47. log.debug('{:^6} > Fetching {} child posts'.format(
    48. post_id, number_of_comments))
    49. tasks = [post_number_of_comments(
    50. loop, session, kid_id) for kid_id in response['kids']]
    51. # 获取所有任务的结果
    52. results = await asyncio.gather(*tasks)
    53. number_of_comments += sum(results)
    54. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    55. if number_of_comments> MIN_COMMENTS:
    56. # await email_post(response)
    57. asyncio.ensure_future(email_post(response))
    58. return number_of_comments
    59. async def email_post(post):
    60. """
    61. 模拟发邮件的动作,并没有真的发邮件
    62. """
    63. await asyncio.sleep(random.random()*3)
    64. log.info("email send success")
    65. def id_from_HN_url(url):
    66. """
    67. 获取运行时传递的参数中的id
    68. """
    69. parse_result = urlparse(url)
    70. try:
    71. return parse_qs(parse_result.query)['id'][0]
    72. except (KeyError, IndexError):
    73. return None
    74. async def main(loop, post_id):
    75. now = datetime.now()
    76. async with aiohttp.ClientSession(loop=loop) as session:
    77. now = datetime.now()
    78. comments = await post_number_of_comments(loop, session, post_id)
    79. log.info(
    80. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    81. (datetime.now() - now).total_seconds(), fetch_counter))
    82. return comments
    83. if __name__ == '__main__':
    84. args = parser.parse_args()
    85. if args.verbose:
    86. log.setLevel(logging.DEBUG)
    87. post_id = id_from_HN_url(args.url) if args.url else args.id
    88. loop = asyncio.get_event_loop()
    89. comments = loop.run_until_complete(main(loop, post_id))
    90. log.info("-- Post {} has {} comments".format(post_id, comments))
    91. pending_tasks = [
    92. task for task in asyncio.Task.all_tasks() if not task.done()
    93. ]
    94. loop.run_until_complete(asyncio.gather(*pending_tasks))
    95. loop.close()

    运行之后结果如下:

    1. [23:47:24] email send success
    2. [23:47:25] email send success
    3. [23:47:25] > Calculating comments took 3.29 seconds and 73 fetches
    4. [23:47:25] -- Post 8863 has 72 comments
    5. [23:47:25] email send success
    6. [23:47:25] email send success
    7. [23:47:25] email send success
    8. [23:47:26] email send success
    9. [23:47:26] email send success
    10. [23:47:27] email send success
    11. [23:47:27] email send success

    一切似乎又恢复了正常,这里我们用到了asyncio的一个方法
    asyncio.Task.all_tasks()

    这个其实还是非常有用的可以获取当前我们的loop的所有的任务的情况,我们这里是通过task.done() 来判断任务是否完成了,从而把没有让没有完成的任务都能够继续完成,但是我们这样做有一个不好的地方就是asyncio.Task.all_tasks() 将所有的任务都拿到手了,可是有些并不是我们关注的,我们就只想要控制我们自己关注的,那么我们就可以将发邮件这个次级任务专门放到一起,这样方面我们后面处理,代码更改为:

    1. import asyncio
    2. import argparse
    3. import logging
    4. import random
    5. from urllib.parse import urlparse, parse_qs
    6. from datetime import datetime
    7. import aiohttp
    8. import async_timeout
    9. LOGGER_FORMAT = '%(asctime)s %(message)s'
    10. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    11. FETCH_TIMEOUT = 10
    12. # 我们设置的评论的阈值
    13. MIN_COMMENTS = 2
    14. parser = argparse.ArgumentParser(
    15. description='获取所有请求url的所有评论')
    16. parser.add_argument('--id', type=int, default=8863,
    17. help='请求的id, 默认id 是8863')
    18. parser.add_argument('--url', type=str, help='HN的url地址')
    19. parser.add_argument('--verbose', action='store_true', help='详细的输出')
    20. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    21. log = logging.getLogger()
    22. log.setLevel(logging.INFO)
    23. fetch_counter = 0
    24. async def fetch(session, url):
    25. """
    26. 通过aiohttp访问url并返回json格式数据
    27. """
    28. global fetch_counter
    29. with async_timeout.timeout(FETCH_TIMEOUT):
    30. fetch_counter += 1
    31. # 因为接口需要翻墙才能访问,所以这里我用的是我本地的代理
    32. async with session.get(url,proxy='http://127.0.0.1:1081') as response:
    33. return await response.json()
    34. async def post_number_of_comments(loop, session, post_id):
    35. """
    36. 递归获取当前请求url的所有评论
    37. """
    38. url = URL_TEMPLATE.format(post_id)
    39. now = datetime.now()
    40. response = await fetch(session, url)
    41. log.debug('{:^6} > Fetching of {} took {} seconds'.format(
    42. post_id, url, (datetime.now() - now).total_seconds()))
    43. if 'kids' not in response: #表示没有评论
    44. return 0
    45. # 获取当前请求的url的评论的数量
    46. number_of_comments = len(response['kids'])
    47. log.debug('{:^6} > Fetching {} child posts'.format(
    48. post_id, number_of_comments))
    49. tasks = [post_number_of_comments(
    50. loop, session, kid_id) for kid_id in response['kids']]
    51. # 获取所有任务的结果
    52. results = await asyncio.gather(*tasks)
    53. number_of_comments += sum(results)
    54. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    55. if number_of_comments> MIN_COMMENTS:
    56. # await email_post(response)
    57. task_registry.append(asyncio.ensure_future(email_post(response)))
    58. return number_of_comments
    59. async def email_post(post):
    60. """
    61. 模拟发邮件的动作,并没有真的发邮件
    62. """
    63. await asyncio.sleep(random.random()*3)
    64. log.info("email send success")
    65. def id_from_HN_url(url):
    66. """
    67. 获取运行时传递的参数中的id
    68. """
    69. parse_result = urlparse(url)
    70. try:
    71. return parse_qs(parse_result.query)['id'][0]
    72. except (KeyError, IndexError):
    73. return None
    74. async def main(loop, post_id):
    75. now = datetime.now()
    76. async with aiohttp.ClientSession(loop=loop) as session:
    77. now = datetime.now()
    78. comments = await post_number_of_comments(loop, session, post_id)
    79. log.info(
    80. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    81. (datetime.now() - now).total_seconds(), fetch_counter))
    82. return comments
    83. if __name__ == '__main__':
    84. args = parser.parse_args()
    85. if args.verbose:
    86. log.setLevel(logging.DEBUG)
    87. post_id = id_from_HN_url(args.url) if args.url else args.id
    88. task_registry = [] # 用于存放我们发送邮件的次级任务
    89. loop = asyncio.get_event_loop()
    90. comments = loop.run_until_complete(main(loop, post_id))
    91. log.info("-- Post {} has {} comments".format(post_id, comments))
    92. pending_tasks = [
    93. task for task in task_registry if not task.done()
    94. ]
    95. loop.run_until_complete(asyncio.gather(*pending_tasks))
    96. loop.close()

    执行结果如下:

    1. [23:54:10] > Calculating comments took 8.33 seconds and 73 fetches
    2. [23:54:10] -- Post 8863 has 72 comments
    3. [23:54:11] email send success
    4. [23:54:11] email send success
    5. [23:54:11] email send success
    6. [23:54:12] email send success
    7. [23:54:12] email send success
    8. [23:54:12] email send success
    9. [23:54:12] email send success
    10. [23:54:13] email send success
    11. [23:54:13] email send success

    看到这里,你是不是发现其实python的asyncio也没有那么难,貌似还挺好用的,那么我们接着最后一部分

    asyncio之—-华山论剑

    通过上面的代码的不断改进, 我们也渐渐更加熟悉asyncio 的用法,但是相对来说还是太简单,因为到目前为止,我们都在爬取一个url 下的所有评论,那么如果我想要获取多个url下的评论信息需要怎么改进代码。在HN 的API文档中有一个获取top 500的接口, 那么我们只获取前500中的前几个的所有评论,当然这个top 500 的内容每天肯能都会更新,甚至可能一天之内都会更新,所以我们的任务需要可以获取一次之后过一会再次获取一次数据,这样我们就能总是获取最新的数据,我们将代码继续改进:

    1. import asyncio
    2. import argparse
    3. import logging
    4. from datetime import datetime
    5. import aiohttp
    6. import async_timeout
    7. LOGGER_FORMAT = '%(asctime)s %(message)s'
    8. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    9. TOP_STORIES_URL = "https://hacker-news.firebaseio.com/v0/topstories.json"
    10. FETCH_TIMEOUT = 10
    11. parser = argparse.ArgumentParser(
    12. description='获取Hacker News 文章的评论数')
    13. parser.add_argument(
    14. '--period', type=int, default=5, help='每个任务的间隔时间')
    15. parser.add_argument(
    16. '--limit', type=int, default=5,
    17. help='获取top 500的前n 数量内容默认是前500的前5个')
    18. parser.add_argument('--verbose', action='store_true', help='更加详细的输出')
    19. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    20. log = logging.getLogger()
    21. log.setLevel(logging.INFO)
    22. fetch_counter = 0
    23. async def fetch(session, url):
    24. """
    25. 请求url地址返回json格式数据
    26. """
    27. global fetch_counter
    28. with async_timeout.timeout(FETCH_TIMEOUT):
    29. fetch_counter += 1
    30. async with session.get(url, proxy="http://127.0.0.1:1080") as response:
    31. return await response.json()
    32. async def post_number_of_comments(loop, session, post_id):
    33. """
    34. 获取当前文章的数据,并递归获取所有评论
    35. """
    36. url = URL_TEMPLATE.format(post_id)
    37. now = datetime.now()
    38. response = await fetch(session, url)
    39. log.debug('{:^6} > Fetching of {} took {} seconds'.format(
    40. post_id, url, (datetime.now() - now).total_seconds()))
    41. if 'kids' not in response: # 没有评论
    42. return 0
    43. # 获取当前文章的评论数量
    44. number_of_comments = len(response['kids'])
    45. log.debug('{:^6} > Fetching {} child posts'.format(
    46. post_id, number_of_comments))
    47. tasks = [post_number_of_comments(
    48. loop, session, kid_id) for kid_id in response['kids']]
    49. # 这里递归请求获取每条评论的评论
    50. results = await asyncio.gather(*tasks)
    51. # 获取当前文章的总评论数
    52. number_of_comments += sum(results)
    53. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    54. return number_of_comments
    55. async def get_comments_of_top_stories(loop, session, limit, iteration):
    56. """
    57. 获取top 500de 前5个
    58. """
    59. response = await fetch(session, TOP_STORIES_URL)
    60. tasks = [post_number_of_comments(
    61. loop, session, post_id) for post_id in response[:limit]]
    62. results = await asyncio.gather(*tasks)
    63. for post_id, num_comments in zip(response[:limit], results):
    64. log.info("Post {} has {} comments ({})".format(
    65. post_id, num_comments, iteration))
    66. async def poll_top_stories_for_comments(loop, session, period, limit):
    67. """
    68. 定时去请求获取前top 500 url
    69. """
    70. global fetch_counter
    71. iteration = 1
    72. while True:
    73. now = datetime.now()
    74. log.info("Calculating comments for top {} stories. ({})".format(
    75. limit, iteration))
    76. await get_comments_of_top_stories(loop, session, limit, iteration)
    77. log.info(
    78. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    79. (datetime.now() - now).total_seconds(), fetch_counter))
    80. log.info("Waiting for {} seconds...".format(period))
    81. iteration += 1
    82. fetch_counter = 0
    83. # 每个任务的间隔
    84. await asyncio.sleep(period)
    85. async def main(loop, period, limit):
    86. async with aiohttp.ClientSession(loop=loop) as session:
    87. comments = await poll_top_stories_for_comments(loop, session, period, limit)
    88. return comments
    89. if __name__ == '__main__':
    90. args = parser.parse_args()
    91. if args.verbose:
    92. log.setLevel(logging.DEBUG)
    93. loop = asyncio.get_event_loop()
    94. loop.run_until_complete(main(loop, args.period, args.limit))
    95. loop.close()

    查看运行结果如下:

    1. [16:24:28] Calculating comments for top 5 stories. (1)
    2. [16:24:41] Post 19334909 has 156 comments (1)
    3. [16:24:41] Post 19333600 has 147 comments (1)
    4. [16:24:41] Post 19335363 has 9 comments (1)
    5. [16:24:41] Post 19330812 has 341 comments (1)
    6. [16:24:41] Post 19333479 has 81 comments (1)
    7. [16:24:41] > Calculating comments took 12.17 seconds and 740 fetches
    8. [16:24:41] Waiting for 5 seconds...
    9. [16:24:46] Calculating comments for top 5 stories. (2)
    10. [16:24:50] Post 19334909 has 156 comments (2)
    11. [16:24:50] Post 19333600 has 147 comments (2)
    12. [16:24:50] Post 19335363 has 9 comments (2)
    13. [16:24:50] Post 19330812 has 341 comments (2)
    14. [16:24:50] Post 19333479 has 81 comments (2)
    15. [16:24:50] > Calculating comments took 4.75 seconds and 740 fetches
    16. [16:24:50] Waiting for 5 seconds...
    17. Traceback (most recent call last):

    运行结果我们看出来其实我们的每个任务并不是间隔5s,因为我的任务在 await get_comments_of_top_stories(loop, session, limit, iteration)
    我们必须等到这个地方完成之后才会进入下次循环,但是其实有时候我们并不想等待,而是直接想要继续往下走,那么我们还是通过老办法通过ensure_future 实现,我们将那一行代码更改为:

    asyncio.ensure_future(get_comments_of_top_stories(loop, session, limit, iteration))

    再次运行结果之后:

    1. [16:44:07] Calculating comments for top 5 stories. (1)
    2. [16:44:07] > Calculating comments took 0.00 seconds and 0 fetches
    3. [16:44:07] Waiting for 5 seconds...
    4. [16:44:12] Calculating comments for top 5 stories. (2)
    5. [16:44:12] > Calculating comments took 0.00 seconds and 49 fetches
    6. [16:44:12] Waiting for 5 seconds...
    7. [16:44:17] Calculating comments for top 5 stories. (3)
    8. [16:44:17] > Calculating comments took 0.00 seconds and 1044 fetches
    9. [16:44:17] Waiting for 5 seconds...
    10. [16:44:21] Post 19334909 has 159 comments (1)
    11. [16:44:21] Post 19333600 has 150 comments (1)
    12. [16:44:21] Post 19335363 has 13 comments (1)
    13. [16:44:21] Post 19330812 has 342 comments (1)
    14. [16:44:21] Post 19333479 has 81 comments (1)
    15. [16:44:22] Post 19334909 has 159 comments (3)
    16. [16:44:22] Post 19333600 has 150 comments (3)
    17. [16:44:22] Post 19335363 has 13 comments (3)
    18. [16:44:22] Post 19330812 has 342 comments (3)
    19. [16:44:22] Post 19333479 has 81 comments (3)
    20. [16:44:22] Calculating comments for top 5 stories. (4)
    21. [16:44:22] > Calculating comments took 0.00 seconds and 1158 fetches
    22. [16:44:22] Waiting for 5 seconds...
    23. [16:44:23] Post 19334909 has 159 comments (2)
    24. [16:44:23] Post 19333600 has 150 comments (2)
    25. [16:44:23] Post 19335363 has 13 comments (2)
    26. [16:44:23] Post 19330812 has 342 comments (2)
    27. [16:44:23] Post 19333479 has 81 comments (2)
    28. [16:44:26] Post 19334909 has 159 comments (4)
    29. [16:44:26] Post 19333600 has 150 comments (4)
    30. [16:44:26] Post 19335363 has 13 comments (4)
    31. [16:44:26] Post 19330812 has 343 comments (4)
    32. [16:44:26] Post 19333479 has 81 comments (4)
    33. [16:44:27] Calculating comments for top 5 stories. (5)
    34. [16:44:27] > Calculating comments took 0.00 seconds and 754 fetches
    35. [16:44:27] Waiting for 5 seconds...

    这样我们每次任务的间隔倒是是5s了但是又一个问题出现了,花费0s并且0个fetch到,并且续的fetch数量也都不对 ,其实造成这个的原因都是因为不再等待get_comments_of_top_stories(loop, session, limit, iteration)造成的

    这个时候你是不是又想到了你的老朋友 callback 呢 哈哈哈! 改进代码如下:

    1. import asyncio
    2. import argparse
    3. import logging
    4. from datetime import datetime
    5. import aiohttp
    6. import async_timeout
    7. LOGGER_FORMAT = '%(asctime)s %(message)s'
    8. URL_TEMPLATE = "https://hacker-news.firebaseio.com/v0/item/{}.json"
    9. TOP_STORIES_URL = "https://hacker-news.firebaseio.com/v0/topstories.json"
    10. FETCH_TIMEOUT = 10
    11. parser = argparse.ArgumentParser(
    12. description='获取Hacker News 文章的评论数')
    13. parser.add_argument(
    14. '--period', type=int, default=5, help='每个任务的间隔时间')
    15. parser.add_argument(
    16. '--limit', type=int, default=5,
    17. help='获取top 500的前n 数量内容默认是前500的前5个')
    18. parser.add_argument('--verbose', action='store_true', help='更加详细的输出')
    19. logging.basicConfig(format=LOGGER_FORMAT, datefmt='[%H:%M:%S]')
    20. log = logging.getLogger()
    21. log.setLevel(logging.INFO)
    22. class URLFetcher():
    23. def __init__(self):
    24. self.fetch_counter = 0
    25. async def fetch(self, session, url):
    26. with async_timeout.timeout(FETCH_TIMEOUT):
    27. self.fetch_counter += 1
    28. async with session.get(url, proxy="http://127.0.0.1:1080") as response:
    29. return await response.json()
    30. async def post_number_of_comments(loop, session, fetcher, post_id):
    31. """
    32. 获取当前文章的数据,并递归获取所有评论
    33. """
    34. url = URL_TEMPLATE.format(post_id)
    35. response = await fetcher.fetch(session, url)
    36. # 没有评论
    37. if response is None or 'kids' not in response:
    38. return 0
    39. number_of_comments = len(response['kids'])
    40. # # 获取当前文章的评论数量
    41. tasks = [post_number_of_comments(
    42. loop, session, fetcher, kid_id) for kid_id in response['kids']]
    43. # 这里递归请求获取每条评论的评论
    44. results = await asyncio.gather(*tasks)
    45. # 获取当前文章的总评论数
    46. number_of_comments += sum(results)
    47. log.debug('{:^6} > {} comments'.format(post_id, number_of_comments))
    48. return number_of_comments
    49. async def get_comments_of_top_stories(loop, session, limit, iteration):
    50. """
    51. 获取top 500de 前5个
    52. """
    53. fetcher = URLFetcher()
    54. response = await fetcher.fetch(session, TOP_STORIES_URL)
    55. tasks = [post_number_of_comments(
    56. loop, session, fetcher, post_id) for post_id in response[:limit]]
    57. results = await asyncio.gather(*tasks)
    58. for post_id, num_comments in zip(response[:limit], results):
    59. log.info("Post {} has {} comments ({})".format(
    60. post_id, num_comments, iteration))
    61. return fetcher.fetch_counter
    62. async def poll_top_stories_for_comments(loop, session, period, limit):
    63. """
    64. 定时去请求获取前top 500 url
    65. """
    66. iteration = 1
    67. while True:
    68. log.info("Calculating comments for top {} stories. ({})".format(
    69. limit, iteration))
    70. future = asyncio.ensure_future(
    71. get_comments_of_top_stories(loop, session, limit, iteration))
    72. now = datetime.now()
    73. # 这里通过回调的方式获取每次爬取评论的耗时以及爬取的评论的数量
    74. def callback(fut):
    75. fetch_count = fut.result()
    76. log.info(
    77. '> Calculating comments took {:.2f} seconds and {} fetches'.format(
    78. (datetime.now() - now).total_seconds(), fetch_count))
    79. future.add_done_callback(callback)
    80. log.info("Waiting for {} seconds...".format(period))
    81. iteration += 1
    82. await asyncio.sleep(period)
    83. async def main(loop, period, limit):
    84. async with aiohttp.ClientSession(loop=loop) as session:
    85. comments = await poll_top_stories_for_comments(loop, session, period, limit)
    86. return comments
    87. if __name__ == '__main__':
    88. args = parser.parse_args()
    89. if args.verbose:
    90. log.setLevel(logging.DEBUG)
    91. loop = asyncio.get_event_loop()
    92. loop.run_until_complete(main(loop, args.period, args.limit))
    93. loop.close()

    这次当我们再次执行代码运行结果如下:

    1. [17:00:17] Calculating comments for top 5 stories. (1)
    2. [17:00:17] Waiting for 5 seconds...
    3. [17:00:22] Calculating comments for top 5 stories. (2)
    4. [17:00:22] Waiting for 5 seconds...
    5. [17:00:27] Calculating comments for top 5 stories. (3)
    6. [17:00:27] Waiting for 5 seconds...
    7. [17:00:30] Post 19334909 has 163 comments (1)
    8. [17:00:30] Post 19333600 has 152 comments (1)
    9. [17:00:30] Post 19335363 has 14 comments (1)
    10. [17:00:30] Post 19330812 has 346 comments (1)
    11. [17:00:30] Post 19335853 has 1 comments (1)
    12. [17:00:30] > Calculating comments took 2.31 seconds and 682 fetches
    13. [17:00:32] Calculating comments for top 5 stories. (4)
    14. [17:00:32] Waiting for 5 seconds...
    15. [17:00:33] Post 19334909 has 163 comments (2)
    16. [17:00:33] Post 19333600 has 152 comments (2)
    17. [17:00:33] Post 19335363 has 14 comments (2)
    18. [17:00:33] Post 19330812 has 346 comments (2)
    19. [17:00:33] Post 19335853 has 1 comments (2)
    20. [17:00:33] > Calculating comments took 0.80 seconds and 682 fetches
    21. [17:00:34] Post 19334909 has 163 comments (3)
    22. [17:00:34] Post 19333600 has 152 comments (3)
    23. [17:00:34] Post 19335363 has 14 comments (3)
    24. [17:00:34] Post 19330812 has 346 comments (3)
    25. [17:00:34] Post 19335853 has 1 comments (3)
    26. [17:00:34] > Calculating comments took 1.24 seconds and 682 fetches
    27. [17:00:37] Calculating comments for top 5 stories. (5)
    28. [17:00:37] Waiting for 5 seconds...
    29. [17:00:42] Post 19334909 has 163 comments (5)
    30. [17:00:42] Post 19333600 has 152 comments (5)
    31. [17:00:42] Post 19335363 has 15 comments (5)
    32. [17:00:42] Post 19330812 has 346 comments (5)
    33. [17:00:42] Post 19335853 has 1 comments (5)
    34. [17:00:42] > Calculating comments took 4.55 seconds and 683 fetches
    35. [17:00:42] Calculating comments for top 5 stories. (6)
    36. [17:00:42] Waiting for 5 seconds...

    到这里为止,我们的代码基本已经改的可以了,我们的结果也终于达到了一个我们满意的结果。

    小结

    其实对我个人来说,在整理整理之前我自己对asyncio的用法也有很多地方理解的不清楚,也是摸着石头过河,碰到问题解决问题,在整理的过程中,其实对我自己来说很多之前模糊的地方也清晰了很多。

  • 相关阅读:
    安装Hadoop
    爬虫综合大作业
    爬取全部校园新闻
    理解爬虫原理
    中文词频统计与词云生成
    复合数据类型,英文词频统计
    字符串操作、文件操作,英文词频统计预处理
    了解大数据的特点、来源与数据呈现方式
    大数据应用期末总评
    分布式文件系统HDFS 练习
  • 原文地址:https://www.cnblogs.com/pythonClub/p/10498046.html
Copyright © 2011-2022 走看看