zoukankan      html  css  js  c++  java
  • Tornado使用-队列Queue

    1.tornado队列的特点
    和python标准队列queue相比,tornado的队列Queue支持异步

    2.Queue常用方法
    Queue.get()
    会暂停,直到queue中有元素

    Queue.put()
    对有最大长度限制的队列,会暂停,直到队列有空闲空间

    Queue.task_done()
    对每一个get元素,紧接着调用task_done(),表示这个任务执行完毕

    Queue.join()
    等待,直到所有任务都执行完毕,即所有元素都调用了task_done()

    3.示例
    给出一个地址http://www.tornadoweb.org/en/stable/,分析页面中所有以这个url为前缀的链接,
    并依次访问,解析,直到找出所有的url

      1 #!/usr/bin/env python
      2 
      3 import time
      4 from datetime import timedelta
      5 
      6 try:
      7     from HTMLParser import HTMLParser
      8     from urlparse import urljoin, urldefrag
      9 except ImportError:
     10     from html.parser import HTMLParser
     11     from urllib.parse import urljoin, urldefrag
     12 
     13 from tornado import httpclient, gen, ioloop, queues
     14 
     15 base_url = 'http://www.tornadoweb.org/en/stable/'
     16 concurrency = 10
     17 
     18 
     19 @gen.coroutine
     20 def get_links_from_url(url):
     21     """Download the page at `url` and parse it for links.
     22 
     23     Returned links have had the fragment after `#` removed, and have been made
     24     absolute so, e.g. the URL 'gen.html#tornado.gen.coroutine' becomes
     25     'http://www.tornadoweb.org/en/stable/gen.html'.
     26     """
     27     try:
     28         response = yield httpclient.AsyncHTTPClient().fetch(url)
     29         print('fetched %s' % url)
     30 
     31         html = response.body if isinstance(response.body, str) 
     32             else response.body.decode()
     33         urls = [urljoin(url, remove_fragment(new_url))
     34                 for new_url in get_links(html)]
     35     except Exception as e:
     36         print('Exception: %s %s' % (e, url))
     37         raise gen.Return([])
     38 
     39     raise gen.Return(urls)
     40 
     41 
     42 def remove_fragment(url):
     43     pure_url, frag = urldefrag(url)
     44     return pure_url
     45 
     46 
     47 def get_links(html):
     48     class URLSeeker(HTMLParser):
     49         def __init__(self):
     50             HTMLParser.__init__(self)
     51             self.urls = []
     52 
     53         def handle_starttag(self, tag, attrs):
     54             href = dict(attrs).get('href')
     55             if href and tag == 'a':
     56                 self.urls.append(href)
     57 
     58     url_seeker = URLSeeker()
     59     url_seeker.feed(html)
     60     return url_seeker.urls
     61 
     62 
     63 @gen.coroutine
     64 def main():
     65     q = queues.Queue()
     66     start = time.time()
     67     fetching, fetched = set(), set()
     68 
     69     @gen.coroutine
     70     def fetch_url():
     71         current_url = yield q.get()
     72         try:
     73             if current_url in fetching:
     74                 return
     75 
     76             print('fetching %s' % current_url)
     77             fetching.add(current_url)
     78             urls = yield get_links_from_url(current_url)
     79             fetched.add(current_url)
     80 
     81             for new_url in urls:
     82                 # Only follow links beneath the base URL
     83                 if new_url.startswith(base_url):
     84                     yield q.put(new_url)
     85 
     86         finally:
     87             q.task_done()
     88 
     89     @gen.coroutine
     90     def worker():
     91         while True:
     92             yield fetch_url()
     93 
     94     q.put(base_url)
     95 
     96     # Start workers, then wait for the work queue to be empty.
     97     for _ in range(concurrency):
     98         worker()
     99     yield q.join(timeout=timedelta(seconds=300))
    100     assert fetching == fetched
    101     print('Done in %d seconds, fetched %s URLs.' % (
    102         time.time() - start, len(fetched)))
    103 
    104 
    105 if __name__ == '__main__':
    106     import logging
    107     logging.basicConfig()
    108     io_loop = ioloop.IOLoop.current()
    109     io_loop.run_sync(main)
  • 相关阅读:
    大话算法-排序-希尔排序
    Linux三剑客-awk
    Linux三剑客-sed
    大话算法-排序-冒泡排序
    大话算法-排序-选择排序
    删除Win10菜单中的幽灵菜单(ms-resource:AppName/Text )
    微信推送模板消息
    获取当前域名的根域
    MVC 全站开启缓存,缓解服务器的请求压力
    MVC 开启gzip压缩
  • 原文地址:https://www.cnblogs.com/cutesnow/p/14171465.html
Copyright © 2011-2022 走看看