zoukankan      html  css  js  c++  java
  • Python的Web编程[0] -> Web客户端[1] -> Web 页面解析

     Web页面解析 / Web page parsing


    1 HTMLParser解析

    下面介绍一种基本的Web页面HTML解析的方式,主要是利用Python自带的html.parser模块进行解析。其主要步骤为:

    1. 创建一个新的Parser类,继承HTMLParser类;
    2. 重载handler_starttag等方法,实现指定功能;
    3. 实例化新的Parser并将HTML文本feed给类实例。

     完整代码

     1 from html.parser import HTMLParser
     2 
     3 # An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered
     4 # Subclass HTMLParser and override its methods to implement the desired behavior
     5 
     6 class MyHTMLParser(HTMLParser):
     7     # attrs is the attributes set in HTML start tag
     8     def handle_starttag(self, tag, attrs):
     9         print('Encountered a start tag:', tag)
    10         for attr in attrs:
    11             print('     attr:', attr)
    12 
    13     def handle_endtag(self, tag):
    14         print('Encountered an end tag :', tag)
    15 
    16     def handle_data(self, data):
    17         print('Encountered some data  :', data)
    18 
    19 parser = MyHTMLParser()
    20 parser.feed('<html><head><title>Test</title></head>'
    21             '<body><h1>Parse me!</h1></body></html>'
    22             '<img src="python-logo.png" alt="The Python logo">')

    代码中首先对模块进行导入,派生一个新的 Parser 类,随后重载方法,当遇到起始tag时,输出并判断是否有定义属性,有则输出,遇到终止tag与数据时同样输出。

    Note: handle_starttag()函数的attrs为由该起始tag属性组成的元组元素列表,即列表中包含元组,元组中第一个参数为属性名,第二个参数为属性值。

    输出结果

    Encountered a start tag: html  
    Encountered a start tag: head  
    Encountered a start tag: title  
    Encountered some data  : Test  
    Encountered an end tag : title  
    Encountered an end tag : head  
    Encountered a start tag: body  
    Encountered a start tag: h1  
    Encountered some data  : Parse me!  
    Encountered an end tag : h1  
    Encountered an end tag : body  
    Encountered an end tag : html  
    Encountered a start tag: img  
         attr: ('src', 'python-logo.png')  
         attr: ('alt', 'The Python logo')  

    从输出中可以看到,解析器将HTML文本进行了解析,并且输出了tag中包含的属性。

    2 BeautifulSoup解析

    接下来介绍一种第三方的HTML页面解析包BeautifulSoup,同时与HTMLParser进行对比。

    首先需要进行BeautifulSoup的安装,安装方式如下,

    pip install beautifulsoup4    

    完整代码

     1 from html.parser import HTMLParser
     2 from io import StringIO
     3 from urllib import request
     4 
     5 from bs4 import BeautifulSoup, SoupStrainer
     6 from html5lib import parse, treebuilders
     7 
     8 
     9 URLs = ('http://python.org',
    10         'http://www.baidu.com')
    11 
    12 def output(x):
    13     print('
    '.join(sorted(set(x))))
    14 
    15 def simple_beau_soup(url, f):
    16     'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors'
    17     # BeautifulSoup returns a BeautifulSoup instance
    18     # find_all function returns a bs4.element.ResultSet instance, 
    19     # which contains bs4.element.Tag instances,
    20     # use tag['attr'] to get attribute of tag
    21     output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a'))
    22 
    23 def faster_beau_soup(url, f):
    24     'faster_beau_soup() - use BeautifulSoup to parse only anchor tags'
    25     # Add find_all('a') function
    26     output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a'))
    27 
    28 def htmlparser(url, f):
    29     'htmlparser() - use HTMLParser to parse anchor tags'
    30     class AnchorParser(HTMLParser):
    31         def handle_starttag(self, tag, attrs):
    32             if tag != 'a':
    33                 return
    34             if not hasattr(self, 'data'):
    35                 self.data = []
    36             for attr in attrs:
    37                 if attr[0] == 'href':
    38                     self.data.append(attr[1])
    39     parser = AnchorParser()
    40     parser.feed(f.read())
    41     output(request.urljoin(url, x) for x in parser.data)
    42     print('DONE')
    43     
    44 def html5libparse(url, f):
    45     'html5libparse() - use html5lib to parser anchor tags'
    46     #output(request.urljoin(url, x.attributes['href']) for x in parse(f) if isinstance(x, treebuilders.etree.Element) and x.name == 'a')
    47 
    48 def process(url, data):
    49     print('
    *** simple BeauSoupParser')
    50     simple_beau_soup(url, data)
    51     data.seek(0)
    52     print('
    *** faster BeauSoupParser')
    53     faster_beau_soup(url, data)
    54     data.seek(0)
    55     print('
    *** HTMLParser')
    56     htmlparser(url, data)
    57     data.seek(0)
    58     print('
    *** HTML5lib')
    59     html5libparse(url, data)
    60     data.seek(0)
    61 
    62 if __name__=='__main__':
    63     for url in URLs:
    64         f = request.urlopen(url)
    65         data = StringIO(f.read().decode())
    66         f.close()
    67         process(url, data)
    View Code

    分段解释

    首先将所需模块进行导入,其中StringIO模块用来实现字符串缓存容器,

     1 from html.parser import HTMLParser
     2 from io import StringIO
     3 from urllib import request
     4 
     5 from bs4 import BeautifulSoup, SoupStrainer
     6 from html5lib import parse, treebuilders
     7 
     8 
     9 URLs = ('http://python.org',
    10         'http://www.baidu.com')

    接着定义一个输出函数,利用集合消除重复参数同时进行换行分离。

    1 def output(x):
    2     print('
    '.join(sorted(set(x))))

    此处定义一个简单的bs解析函数,首先利用BeautifulSoup类传入HTML文本以及features(新版提示使用‘html5lib’),生成一个BeautifulSoup实例,再利用find_all()函数返回所有tag为‘a’的链接锚集合类(bs4.element.Tag),通过Tag获取href属性,最后利用urljoin函数生成链接并输出。

    1 def simple_beau_soup(url, f):
    2     'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors'
    3     # BeautifulSoup returns a BeautifulSoup instance
    4     # find_all function returns a bs4.element.ResultSet instance, 
    5     # which contains bs4.element.Tag instances,
    6     # use tag['attr'] to get attribute of tag
    7     output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a'))

    接着定义一个新的解析函数,这个函数可以通过参数传入parse_only来设置需要解析的锚标签,从而加快解析的速度。

    Note: 这部分存在一个问题,当使用‘html5lib’特性时,是不支持parse_only参数的,因此会对整个标签进行搜索。有待解决。

    1 def faster_beau_soup(url, f):
    2     'faster_beau_soup() - use BeautifulSoup to parse only anchor tags'
    3     # Add find_all('a') function
    4     output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a'))

    再定义一个用html方式进行解析的函数,可参见前节使用方式,首先建立一个锚解析的类,在遇到起始标签时,判断是否为‘a’锚,在进入时判断是否有data属性,没有的话初始化属性为空,随后对attrs参数遍历,获取href参数。最后生成实例并feed数据。

     1 def htmlparser(url, f):
     2     'htmlparser() - use HTMLParser to parse anchor tags'
     3     class AnchorParser(HTMLParser):
     4         def handle_starttag(self, tag, attrs):
     5             if tag != 'a':
     6                 return
     7             if not hasattr(self, 'data'):
     8                 self.data = []
     9             for attr in attrs:
    10                 if attr[0] == 'href':
    11                     self.data.append(attr[1])
    12     parser = AnchorParser()
    13     parser.feed(f.read())
    14     output(request.urljoin(url, x) for x in parser.data)
    15     print('DONE')

    最后定义一个process函数,对于传入的data,每次使用完后都需要seek(0)将光标移回初始。

     1 def process(url, data):
     2     print('
    *** simple BeauSoupParser')
     3     simple_beau_soup(url, data)
     4     data.seek(0)
     5     print('
    *** faster BeauSoupParser')
     6     faster_beau_soup(url, data)
     7     data.seek(0)
     8     print('
    *** HTMLParser')
     9     htmlparser(url, data)
    10     data.seek(0)
    11     print('
    *** HTML5lib')
    12     html5libparse(url, data)
    13     data.seek(0)

    最终解析的结果为网页内所有的链接。

    1 if __name__=='__main__':
    2     for url in URLs:
    3         f = request.urlopen(url)
    4         data = StringIO(f.read().decode())
    5         f.close()
    6         process(url, data)

    运行输出结果

    *** simple BeauSoupParser
    http://blog.python.org
    http://bottlepy.org
    http://brochure.getpython.info/
    http://buildbot.net/
    http://docs.python.org/3/tutorial/
    http://docs.python.org/3/tutorial/controlflow.html
    http://docs.python.org/3/tutorial/controlflow.html#defining-functions
    http://docs.python.org/3/tutorial/introduction.html#lists
    http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
    http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
    http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
    http://flask.pocoo.org/
    http://ipython.org
    http://jobs.python.org
    http://pandas.pydata.org/
    http://planetpython.org/
    http://plus.google.com/+Python
    http://pycon.blogspot.com/
    http://pyfound.blogspot.com/
    http://python.org
    http://python.org#content
    http://python.org#python-network
    http://python.org#site-map
    http://python.org#top
    http://python.org/
    http://python.org/about/
    http://python.org/about/apps
    http://python.org/about/apps/
    http://python.org/about/gettingstarted/
    http://python.org/about/help/
    http://python.org/about/legal/
    http://python.org/about/quotes/
    http://python.org/about/success/
    http://python.org/about/success/#arts
    http://python.org/about/success/#business
    http://python.org/about/success/#education
    http://python.org/about/success/#engineering
    http://python.org/about/success/#government
    http://python.org/about/success/#scientific
    http://python.org/about/success/#software-development
    http://python.org/accounts/login/
    http://python.org/accounts/signup/
    http://python.org/blogs/
    http://python.org/community/
    http://python.org/community/awards
    http://python.org/community/diversity/
    http://python.org/community/forums/
    http://python.org/community/irc/
    http://python.org/community/lists/
    http://python.org/community/logos/
    http://python.org/community/merchandise/
    http://python.org/community/sigs/
    http://python.org/community/workshops/
    http://python.org/dev/
    http://python.org/dev/core-mentorship/
    http://python.org/dev/peps/
    http://python.org/dev/peps/peps.rss
    http://python.org/doc/
    http://python.org/doc/av
    http://python.org/doc/essays/
    http://python.org/download/alternatives
    http://python.org/download/other/
    http://python.org/downloads/
    http://python.org/downloads/mac-osx/
    http://python.org/downloads/release/python-2714/
    http://python.org/downloads/release/python-364/
    http://python.org/downloads/source/
    http://python.org/downloads/windows/
    http://python.org/events/
    http://python.org/events/calendars/
    http://python.org/events/python-events
    http://python.org/events/python-events/543/
    http://python.org/events/python-events/611/
    http://python.org/events/python-events/past/
    http://python.org/events/python-user-group/
    http://python.org/events/python-user-group/605/
    http://python.org/events/python-user-group/619/
    http://python.org/events/python-user-group/620/
    http://python.org/events/python-user-group/past/
    http://python.org/jobs/
    http://python.org/privacy/
    http://python.org/psf-landing/
    http://python.org/psf/
    http://python.org/psf/donations/
    http://python.org/psf/sponsorship/sponsors/
    http://python.org/shell/
    http://python.org/success-stories/
    http://python.org/success-stories/industrial-light-magic-runs-python/
    http://python.org/users/membership/
    http://roundup.sourceforge.net/
    http://tornadoweb.org
    http://trac.edgewall.org/
    http://twitter.com/ThePSF
    http://wiki.python.org/moin/Languages
    http://wiki.python.org/moin/TkInter
    http://www.ansible.com
    http://www.djangoproject.com/
    http://www.facebook.com/pythonlang?fref=ts
    http://www.pylonsproject.org/
    http://www.riverbankcomputing.co.uk/software/pyqt/intro
    http://www.saltstack.com
    http://www.scipy.org
    http://www.web2py.com/
    http://www.wxpython.org/
    https://bugs.python.org/
    https://devguide.python.org/
    https://docs.python.org
    https://docs.python.org/3/license.html
    https://docs.python.org/faq/
    https://github.com/python/pythondotorg/issues
    https://kivy.org/
    https://mail.python.org/mailman/listinfo/python-dev
    https://pypi.python.org/
    https://status.python.org/
    https://wiki.gnome.org/Projects/PyGObject
    https://wiki.python.org/moin/
    https://wiki.python.org/moin/BeginnersGuide
    https://wiki.python.org/moin/Python2orPython3
    https://wiki.python.org/moin/PythonBooks
    https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
    https://wiki.qt.io/PySide
    https://www.openstack.org
    https://www.python.org/psf/codeofconduct/
    javascript:;
    
    *** faster BeauSoupParser
    
    Warning (from warnings module):
      File "C:Python35libsite-packagess4uilder\_html5lib.py", line 63
        warnings.warn("You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.")
    UserWarning: You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.
    http://blog.python.org
    http://bottlepy.org
    http://brochure.getpython.info/
    http://buildbot.net/
    http://docs.python.org/3/tutorial/
    http://docs.python.org/3/tutorial/controlflow.html
    http://docs.python.org/3/tutorial/controlflow.html#defining-functions
    http://docs.python.org/3/tutorial/introduction.html#lists
    http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
    http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
    http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
    http://flask.pocoo.org/
    http://ipython.org
    http://jobs.python.org
    http://pandas.pydata.org/
    http://planetpython.org/
    http://plus.google.com/+Python
    http://pycon.blogspot.com/
    http://pyfound.blogspot.com/
    http://python.org
    http://python.org#content
    http://python.org#python-network
    http://python.org#site-map
    http://python.org#top
    http://python.org/
    http://python.org/about/
    http://python.org/about/apps
    http://python.org/about/apps/
    http://python.org/about/gettingstarted/
    http://python.org/about/help/
    http://python.org/about/legal/
    http://python.org/about/quotes/
    http://python.org/about/success/
    http://python.org/about/success/#arts
    http://python.org/about/success/#business
    http://python.org/about/success/#education
    http://python.org/about/success/#engineering
    http://python.org/about/success/#government
    http://python.org/about/success/#scientific
    http://python.org/about/success/#software-development
    http://python.org/accounts/login/
    http://python.org/accounts/signup/
    http://python.org/blogs/
    http://python.org/community/
    http://python.org/community/awards
    http://python.org/community/diversity/
    http://python.org/community/forums/
    http://python.org/community/irc/
    http://python.org/community/lists/
    http://python.org/community/logos/
    http://python.org/community/merchandise/
    http://python.org/community/sigs/
    http://python.org/community/workshops/
    http://python.org/dev/
    http://python.org/dev/core-mentorship/
    http://python.org/dev/peps/
    http://python.org/dev/peps/peps.rss
    http://python.org/doc/
    http://python.org/doc/av
    http://python.org/doc/essays/
    http://python.org/download/alternatives
    http://python.org/download/other/
    http://python.org/downloads/
    http://python.org/downloads/mac-osx/
    http://python.org/downloads/release/python-2714/
    http://python.org/downloads/release/python-364/
    http://python.org/downloads/source/
    http://python.org/downloads/windows/
    http://python.org/events/
    http://python.org/events/calendars/
    http://python.org/events/python-events
    http://python.org/events/python-events/543/
    http://python.org/events/python-events/611/
    http://python.org/events/python-events/past/
    http://python.org/events/python-user-group/
    http://python.org/events/python-user-group/605/
    http://python.org/events/python-user-group/619/
    http://python.org/events/python-user-group/620/
    http://python.org/events/python-user-group/past/
    http://python.org/jobs/
    http://python.org/privacy/
    http://python.org/psf-landing/
    http://python.org/psf/
    http://python.org/psf/donations/
    http://python.org/psf/sponsorship/sponsors/
    http://python.org/shell/
    http://python.org/success-stories/
    http://python.org/success-stories/industrial-light-magic-runs-python/
    http://python.org/users/membership/
    http://roundup.sourceforge.net/
    http://tornadoweb.org
    http://trac.edgewall.org/
    http://twitter.com/ThePSF
    http://wiki.python.org/moin/Languages
    http://wiki.python.org/moin/TkInter
    http://www.ansible.com
    http://www.djangoproject.com/
    http://www.facebook.com/pythonlang?fref=ts
    http://www.pylonsproject.org/
    http://www.riverbankcomputing.co.uk/software/pyqt/intro
    http://www.saltstack.com
    http://www.scipy.org
    http://www.web2py.com/
    http://www.wxpython.org/
    https://bugs.python.org/
    https://devguide.python.org/
    https://docs.python.org
    https://docs.python.org/3/license.html
    https://docs.python.org/faq/
    https://github.com/python/pythondotorg/issues
    https://kivy.org/
    https://mail.python.org/mailman/listinfo/python-dev
    https://pypi.python.org/
    https://status.python.org/
    https://wiki.gnome.org/Projects/PyGObject
    https://wiki.python.org/moin/
    https://wiki.python.org/moin/BeginnersGuide
    https://wiki.python.org/moin/Python2orPython3
    https://wiki.python.org/moin/PythonBooks
    https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
    https://wiki.qt.io/PySide
    https://www.openstack.org
    https://www.python.org/psf/codeofconduct/
    javascript:;
    
    *** HTMLParser
    http://blog.python.org
    http://bottlepy.org
    http://brochure.getpython.info/
    http://buildbot.net/
    http://docs.python.org/3/tutorial/
    http://docs.python.org/3/tutorial/controlflow.html
    http://docs.python.org/3/tutorial/controlflow.html#defining-functions
    http://docs.python.org/3/tutorial/introduction.html#lists
    http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
    http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
    http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
    http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
    http://flask.pocoo.org/
    http://ipython.org
    http://jobs.python.org
    http://pandas.pydata.org/
    http://planetpython.org/
    http://plus.google.com/+Python
    http://pycon.blogspot.com/
    http://pyfound.blogspot.com/
    http://python.org
    http://python.org#content
    http://python.org#python-network
    http://python.org#site-map
    http://python.org#top
    http://python.org/
    http://python.org/about/
    http://python.org/about/apps
    http://python.org/about/apps/
    http://python.org/about/gettingstarted/
    http://python.org/about/help/
    http://python.org/about/legal/
    http://python.org/about/quotes/
    http://python.org/about/success/
    http://python.org/about/success/#arts
    http://python.org/about/success/#business
    http://python.org/about/success/#education
    http://python.org/about/success/#engineering
    http://python.org/about/success/#government
    http://python.org/about/success/#scientific
    http://python.org/about/success/#software-development
    http://python.org/accounts/login/
    http://python.org/accounts/signup/
    http://python.org/blogs/
    http://python.org/community/
    http://python.org/community/awards
    http://python.org/community/diversity/
    http://python.org/community/forums/
    http://python.org/community/irc/
    http://python.org/community/lists/
    http://python.org/community/logos/
    http://python.org/community/merchandise/
    http://python.org/community/sigs/
    http://python.org/community/workshops/
    http://python.org/dev/
    http://python.org/dev/core-mentorship/
    http://python.org/dev/peps/
    http://python.org/dev/peps/peps.rss
    http://python.org/doc/
    http://python.org/doc/av
    http://python.org/doc/essays/
    http://python.org/download/alternatives
    http://python.org/download/other/
    http://python.org/downloads/
    http://python.org/downloads/mac-osx/
    http://python.org/downloads/release/python-2714/
    http://python.org/downloads/release/python-364/
    http://python.org/downloads/source/
    http://python.org/downloads/windows/
    http://python.org/events/
    http://python.org/events/calendars/
    http://python.org/events/python-events
    http://python.org/events/python-events/543/
    http://python.org/events/python-events/611/
    http://python.org/events/python-events/past/
    http://python.org/events/python-user-group/
    http://python.org/events/python-user-group/605/
    http://python.org/events/python-user-group/619/
    http://python.org/events/python-user-group/620/
    http://python.org/events/python-user-group/past/
    http://python.org/jobs/
    http://python.org/privacy/
    http://python.org/psf-landing/
    http://python.org/psf/
    http://python.org/psf/donations/
    http://python.org/psf/sponsorship/sponsors/
    http://python.org/shell/
    http://python.org/success-stories/
    http://python.org/success-stories/industrial-light-magic-runs-python/
    http://python.org/users/membership/
    http://roundup.sourceforge.net/
    http://tornadoweb.org
    http://trac.edgewall.org/
    http://twitter.com/ThePSF
    http://wiki.python.org/moin/Languages
    http://wiki.python.org/moin/TkInter
    http://www.ansible.com
    http://www.djangoproject.com/
    http://www.facebook.com/pythonlang?fref=ts
    http://www.pylonsproject.org/
    http://www.riverbankcomputing.co.uk/software/pyqt/intro
    http://www.saltstack.com
    http://www.scipy.org
    http://www.web2py.com/
    http://www.wxpython.org/
    https://bugs.python.org/
    https://devguide.python.org/
    https://docs.python.org
    https://docs.python.org/3/license.html
    https://docs.python.org/faq/
    https://github.com/python/pythondotorg/issues
    https://kivy.org/
    https://mail.python.org/mailman/listinfo/python-dev
    https://pypi.python.org/
    https://status.python.org/
    https://wiki.gnome.org/Projects/PyGObject
    https://wiki.python.org/moin/
    https://wiki.python.org/moin/BeginnersGuide
    https://wiki.python.org/moin/Python2orPython3
    https://wiki.python.org/moin/PythonBooks
    https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
    https://wiki.qt.io/PySide
    https://www.openstack.org
    https://www.python.org/psf/codeofconduct/
    javascript:;
    DONE
    
    *** HTML5lib
    
    *** simple BeauSoupParser
    http://e.baidu.com/?refer=888
    http://home.baidu.com
    http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
    http://ir.baidu.com
    http://jianyi.baidu.com/
    http://map.baidu.com
    http://map.baidu.com/m?word=&fr=ps01000
    http://music.baidu.com/search?fr=ps&ie=utf-8&key=
    http://news.baidu.com
    http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
    http://tieba.baidu.com
    http://tieba.baidu.com/f?kw=&fr=wwwt
    http://v.baidu.com
    http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
    http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
    http://www.baidu.com/
    http://www.baidu.com/cache/sethelp/help.html
    http://www.baidu.com/duty/
    http://www.baidu.com/gaoji/preferences.html
    http://www.baidu.com/more/
    http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
    http://www.hao123.com
    http://xueshu.baidu.com
    http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
    https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
    javascript:;
    
    *** faster BeauSoupParser
    http://e.baidu.com/?refer=888
    http://home.baidu.com
    http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
    http://ir.baidu.com
    http://jianyi.baidu.com/
    http://map.baidu.com
    http://map.baidu.com/m?word=&fr=ps01000
    http://music.baidu.com/search?fr=ps&ie=utf-8&key=
    http://news.baidu.com
    http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
    http://tieba.baidu.com
    http://tieba.baidu.com/f?kw=&fr=wwwt
    http://v.baidu.com
    http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
    http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
    http://www.baidu.com/
    http://www.baidu.com/cache/sethelp/help.html
    http://www.baidu.com/duty/
    http://www.baidu.com/gaoji/preferences.html
    http://www.baidu.com/more/
    http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
    http://www.hao123.com
    http://xueshu.baidu.com
    http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
    https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
    javascript:;
    
    *** HTMLParser
    http://e.baidu.com/?refer=888
    http://home.baidu.com
    http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
    http://ir.baidu.com
    http://jianyi.baidu.com/
    http://map.baidu.com
    http://map.baidu.com/m?word=&fr=ps01000
    http://music.baidu.com/search?fr=ps&ie=utf-8&key=
    http://news.baidu.com
    http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
    http://tieba.baidu.com
    http://tieba.baidu.com/f?kw=&fr=wwwt
    http://v.baidu.com
    http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
    http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
    http://www.baidu.com/
    http://www.baidu.com/cache/sethelp/help.html
    http://www.baidu.com/duty/
    http://www.baidu.com/gaoji/preferences.html
    http://www.baidu.com/more/
    http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
    http://www.hao123.com
    http://xueshu.baidu.com
    http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
    https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
    javascript:;
    DONE
    
    *** HTML5lib
    View Code

    参考链接


    《Python 核心编程 第3版》

  • 相关阅读:
    设计模式03-工厂方法
    设计模式02-抽象工厂
    设计模式01-什么是设计模式
    工作流activiti-03数据查询(流程定义 流程实例 代办任务) 以及个人小练习
    工作流activiti-02事物控制、流程引擎创建
    工作流activiti-01个人小结
    jQuery.extend 函数详解
    hibernate框架学习之数据查询(QBC)
    hibernate框架学习之多表查询helloworld
    hibernate框架学习之数据查询(HQL)helloworld
  • 原文地址:https://www.cnblogs.com/stacklike/p/8244925.html
Copyright © 2011-2022 走看看