zoukankan      html  css  js  c++  java
  • HTMLParser in python

    You can know form the name that the HTMLParser is something used to parse HTML files.  In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.

    htmllib.HTMLParser

    This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.

    >>> from cStringIO import StringIO
    >>> from formatter import DumbWriter, AbstractFormatter
    >>> from htmllib import HTMLParser
    >>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
    >>>

    It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc. 

    HTMLParser.HTMLParser

    In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO.  For more information you can go to this URL :

    https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

    Here is some briefly introduction for this module.

    See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.

    >>> from HTMLParser import HTMLParser
    >>> class MyHTMLParser(HTMLParser):
    ...     def handle_starttag(self, tag, attrs):
    ...             print "Encountered a start tag:", tag
    ...     def handle_endtag(self, tag):
    ...             print "Encountered an end tag :", tag
    ...     def handle_data(self, data):
    ...              print "Encountered some data  :", data
    ...
    >>> parser = MyHTMLParser()
    >>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
    Encountered a start tag: html
    Encountered a start tag: head
    Encountered a start tag: title
    Encountered some data  : Test
    Encountered an end tag : title
    Encountered an end tag : head
    Encountered a start tag: body
    Encountered a start tag: h1
    Encountered some data  : Parse me!
    Encountered an end tag : h1
    Encountered an end tag : body
    Encountered an end tag : html
    

      

    Another case here, in the htmllib.HTMLParser, there was two functions as below,

    HTMLParser.anchor_bgn(href, name, type)
    This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist.
    
    HTMLParser.anchor_end()
    This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

    With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:

    >>> from urlparse import urlparse
    >>> from formatter import DumbWriter, AbstractFormatter
    >>> from cStringIO import StringIO
    >>> from htmllib import HTMLParser
    >>>
    >>> def parseAndGetLinks():
    ...     parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
    ...     parser.feed(open(file).read())
    ...     parser.close()
    ...     return parser.anchorlist
    ...
    >>> file='/tmp/a.ttt'
    >>> parseAndGetLinks()
    ['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

    But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.

     1 >>> from HTMLParser import HTMLParser
     2 >>> class myHtmlParser(HTMLParser):
     3 ...     def __init__(self):
     4 ...             HTMLParser.__init__(self)
     5 ...             self.anchorlist=[]
     6 ...     def handle_starttag(self, tag, attrs):
     7 ...                     if tag=='a' or tag=='A':
     8 ...                             for t in attrs :
     9 ...                                     if t[0] == 'href' or t[0]=='HREF':
    10 ...                                             self.anchorlist.append(t[1])
    11 ...
    12 >>> file='/tmp/a.ttt'
    13 >>> parser=myHtmlParser()
    14 >>> parser.feed(open(file).read())
    15 >>> parser.anchorlist
    16 ['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
    17 >>>

    We look into the second code.

    line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.

    line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A',  then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist. 

    Then done.

  • 相关阅读:
    javascript 离开网页时 触发函数
    dhl:简单的WebConfig加密 连接字符加密解密
    javascript获取网页URL地址及参数等
    dhl:img 的src 在 ie7下是将全路径。>ie8和firefox没有问题
    有趣有用网址大全
    VS2010 项目引用了DLL文件,也写了Using,但是编译时提示:未能找到类型或命名空间名称
    iis6配置支持.net4.0
    闲语MVC3和Razor 转自:啊不
    dhl:4.0服务器端控件
    如何在C#中实现窗体全屏模式
  • 原文地址:https://www.cnblogs.com/kramer/p/3765495.html
Copyright © 2011-2022 走看看