zoukankan      html  css  js  c++  java
  • Html / XHtml 解析

      1 Html / XHtml 解析 - Parsing Html and XHtml
      2 
      3 HTMLParser 模块
      4     通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类,
      5     然后子类中实现处理的标签(<.>)的方法, 其实现是通过 '重写' 父类(HTMLParser)的
      6     handle_starttag(), handle_data(), handle_endtag() 等方法.
      7 
      8     例子,
      9         解析 htmlsample.html 中 <head> 标签,
     10             <-- htmlsample.html -->  -> 文件内容,
     11                 '
     12                 <html>
     13                 <head><title>404 Not Found</title></head>
     14                 <body bgcolor="white">
     15                 <center><h1>404 Not Found</h1></center>
     16                 <hr><center>nginx/1.12.2</center>
     17                 </body>
     18                 </html>
     19                 '
     20         from html.parser import HTMLParser
     21         class ParsingHeadT(HTMLParser):
     22             def __init__(self):
     23                 self.headtag =''
     24                 self.parsesemaphore = False
     25                 HTMLParser.__init__(self)
     26 
     27             def handle_starttag(self, tag, attrs): # enable semaphore
     28                 if tag == 'head':
     29                     self.parsesemaphore = True
     30 
     31             def handle_data(self, data):          # tag process as requirement
     32                 if self.parsesemaphore:
     33                     self.headtag = data
     34 
     35             def handle_endtag(self, tag):
     36                 if tag == 'head':
     37                     self.parsesemaphore = False
     38 
     39             def getheadtag(self):
     40                 return self.headtag
     41 
     42         if __name__ == "__main__":
     43             with open('htmlsample.html') as FH:
     44                 pht = ParsingHeadT()
     45                 pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
     46                                        # handle_starttag, handle_data and handle_endtag
     47                 print("Head Tag : %s" % pht.getheadtag())
     48 
     49         output,
     50            Head Tag : 404 Not Found
     51 
     52     上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的,
     53     比如 html 中的特殊字符 &copy (copyright 符号), &amp(& 逻辑与符号) 等,
     54         对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理,
     55             HTMLParser.handle_entityref(name)¶
     56                 This method is called to process a named character reference of the form
     57                 &name; (e.g. &gt;), where name is a general entity reference (e.g. 'gt').
     58                 This method is never called if convert_charrefs is True.
     59 
     60     字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换.
     61         HTMLParser.handle_charref(name)
     62             This method is called to process decimal and hexadecimal numeric character
     63             references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent
     64             for &gt; is &#62;, whereas the hexadecimal is &#x3E; in this case the method
     65             will receive '62' or 'x3E'. This method is never called if convert_charrefs is True.
     66 
     67     Note,
     68         幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html
     69         <head> tag 中加入一些特殊字符来看看.
     70             <-- htmlsample.html -->
     71             <html>
     72             <head><title>&#62 &#x3E 404 &copy Not &gt Found & </title></head>
     73             <body bgcolor="white">
     74             <center><h1>404 Not Found</h1></center>
     75             <hr><center>nginx/1.12.2</center>
     76             </body>
     77             </html>
     78 
     79         上例 Output,
     80                 Head Tag : > > 404 © Not > Found &
     81                 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况.
     82 
     83     然而, 在 html 的代码中存在一类 '非对称'的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子
     84     去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使
     85     其能够正确解析这些'非对称'标签.
     86         先扩展一下儿 htmlsample.html, 以 <li> 标签为例,
     87         <-- htmlsample.html -->
     88         <html>
     89         <head><title>&#62 &#x3E 404 &copy Not &gt Found &</title>
     90         <body bgcolor="white">
     91         <center><h1>404 Not Found</h1></center>
     92         <hr><center>nginx/1.12.2</center>
     93         <ul>
     94             <li> First Reason
     95             <li> Second Reason
     96         </body>
     97         </html>
     98 
     99         htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签
    100         没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.
    101 
    102         例,
    103             from html.parser import HTMLParser
    104             class Parser(HTMLParser):
    105                 def __init__(self):
    106                     self.taglevels = []     # track anchor
    107                     self.tags =['head','ul','li']
    108                     self.parsesemaphore = False
    109                     self.data = ''
    110                     HTMLParser.__init__(self)
    111 
    112                 def handle_starttag(self, tag, attrs): # enable semaphore
    113                     if len(self.taglevels) and self.taglevels[-1] == tag:
    114                         self.handle_endtag(tag)
    115                     self.taglevels.append(tag)
    116 
    117                     if tag in self.tags:
    118                         self.parsesemaphore = True
    119 
    120                 def handle_data(self, data):          # tag process as requirement
    121                     if self.parsesemaphore:
    122                         self.data += data
    123 
    124                 def handle_endtag(self, tag):
    125                     self.parsesemaphore = False
    126 
    127                 def gettag(self):
    128                     return self.data
    129 
    130             if __name__ == "__main__":
    131                 with open('htmlsample.html') as FH:
    132                     pht = Parser()
    133                     pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
    134                                            # handle_starttag, handle_data and handle_endtag
    135                     print("Head Tag : %s" % pht.gettag())
    136 
    137             Output,
    138                  Head Tag : > > 404 © Not > Found &
    139                  First Reason
    140                  Second Reason
    141 
    142 Reference,
    143     https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref
    144 
    145 Appendix,
    146     The example given by python Doc,
    147         from html.parser import HTMLParser
    148         from html.entities import name2codepoint
    149 
    150         class MyHTMLParser(HTMLParser):
    151             def handle_starttag(self, tag, attrs):
    152                 print("Start tag:", tag)
    153                 for attr in attrs:
    154                     print("     attr:", attr)
    155 
    156             def handle_endtag(self, tag):
    157                 print("End tag  :", tag)
    158 
    159             def handle_data(self, data):
    160                 print("Data     :", data)
    161 
    162             def handle_comment(self, data):
    163                 print("Comment  :", data)
    164 
    165             def handle_entityref(self, name):
    166                 c = chr(name2codepoint[name])
    167                 print("Named ent:", c)
    168 
    169             def handle_charref(self, name):
    170                 if name.startswith('x'):
    171                     c = chr(int(name[1:], 16))
    172                 else:
    173                     c = chr(int(name))
    174                 print("Num ent  :", c)
    175 
    176             def handle_decl(self, data):
    177                 print("Decl     :", data)
    178 
    179         parser = MyHTMLParser()
    180 
    181     Output,
    182         Parsing a doctype:
    183 
    184     # >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
    185     ...             '"http://www.w3.org/TR/html4/strict.dtd">')
    186         Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
    187         Parsing an element with a few attributes and a title:
    188 
    189 
    190     # >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
    191         Start tag: img
    192              attr: ('src', 'python-logo.png')
    193              attr: ('alt', 'The Python logo')
    194 
    195     # >>> parser.feed('<h1>Python</h1>')
    196         Start tag: h1
    197         Data     : Python
    198         End tag  : h1
    199         The content of script and style elements is returned as is, without further parsing:
    200 
    201 
    202     # >>> parser.feed('<style type="text/css">#python { color: green }</style>')
    203         Start tag: style
    204              attr: ('type', 'text/css')
    205         Data     : #python { color: green }
    206         End tag  : style
    207 
    208     # >>> parser.feed('<script type="text/javascript">'
    209     ...             'alert("<strong>hello!</strong>");</script>')
    210         Start tag: script
    211              attr: ('type', 'text/javascript')
    212         Data     : alert("<strong>hello!</strong>");
    213         End tag  : script
    214         Parsing comments:
    215 
    216     # >>> parser.feed('<!-- a comment -->'
    217     ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
    218         Comment  :  a comment
    219         Comment  : [if IE 9]>IE-specific content<![endif]
    220         Parsing named and numeric character references and converting them to the correct
    221         char (note: these 3 references are all equivalent to '>'):
    222 
    223     # >>> parser.feed('&gt;&#62;&#x3E;')
    224         Named ent: >
    225         Num ent  : >
    226         Num ent  : >
    227         Feeding incomplete chunks to feed() works, but handle_data() might be called more
    228         than once (unless convert_charrefs is set to True):
    229 
    230     # >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
    231     ...     parser.feed(chunk)
    232         Start tag: span
    233         Data     : buff
    234         Data     : ered
    235         Data     : text
    236         End tag  : span
    237         Parsing invalid HTML (e.g. unquoted attributes) also works:
    238 
    239     # >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
    240         Start tag: p
    241         Start tag: a
    242              attr: ('class', 'link')
    243              attr: ('href', '#main')
    244         Data     : tag soup
    245         End tag  : p
    246         End tag  : a
  • 相关阅读:
    修改ubuntu DNS的步骤/wget url报错: unable to resolve host address的解决方法
    MySQL5.7 Replication主从复制配置教程
    总结一下安装linux系统经验-版本选择-安装ubuntu
    分布式与集群的联系与区别
    spring 后置处理器BeanFactoryPostProcessor和BeanPostProcessor的用法和区别
    mysql几种性能测试的工具使用
    mysql max_allowed_packet查询和修改
    mysql主从复制(超简单)
    10 个免费的网络监控工具(转)
    DOS批处理中%cd%和%~dp0的区别
  • 原文地址:https://www.cnblogs.com/zzyzz/p/8037020.html
Copyright © 2011-2022 走看看