1 Html / XHtml 解析 - Parsing Html and XHtml 2 3 HTMLParser 模块 4 通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类, 5 然后子类中实现处理的标签(<.>)的方法, 其实现是通过 '重写' 父类(HTMLParser)的 6 handle_starttag(), handle_data(), handle_endtag() 等方法. 7 8 例子, 9 解析 htmlsample.html 中 <head> 标签, 10 <-- htmlsample.html --> -> 文件内容, 11 ' 12 <html> 13 <head><title>404 Not Found</title></head> 14 <body bgcolor="white"> 15 <center><h1>404 Not Found</h1></center> 16 <hr><center>nginx/1.12.2</center> 17 </body> 18 </html> 19 ' 20 from html.parser import HTMLParser 21 class ParsingHeadT(HTMLParser): 22 def __init__(self): 23 self.headtag ='' 24 self.parsesemaphore = False 25 HTMLParser.__init__(self) 26 27 def handle_starttag(self, tag, attrs): # enable semaphore 28 if tag == 'head': 29 self.parsesemaphore = True 30 31 def handle_data(self, data): # tag process as requirement 32 if self.parsesemaphore: 33 self.headtag = data 34 35 def handle_endtag(self, tag): 36 if tag == 'head': 37 self.parsesemaphore = False 38 39 def getheadtag(self): 40 return self.headtag 41 42 if __name__ == "__main__": 43 with open('htmlsample.html') as FH: 44 pht = ParsingHeadT() 45 pht.feed(FH.read()) # HTMLParser will invoke the replaced methods 46 # handle_starttag, handle_data and handle_endtag 47 print("Head Tag : %s" % pht.getheadtag()) 48 49 output, 50 Head Tag : 404 Not Found 51 52 上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的, 53 比如 html 中的特殊字符 © (copyright 符号), &(& 逻辑与符号) 等, 54 对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理, 55 HTMLParser.handle_entityref(name)¶ 56 This method is called to process a named character reference of the form 57 &name; (e.g. >), where name is a general entity reference (e.g. 'gt'). 58 This method is never called if convert_charrefs is True. 59 60 字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换. 61 HTMLParser.handle_charref(name) 62 This method is called to process decimal and hexadecimal numeric character 63 references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent 64 for > is >, whereas the hexadecimal is > in this case the method 65 will receive '62' or 'x3E'. This method is never called if convert_charrefs is True. 66 67 Note, 68 幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html 69 <head> tag 中加入一些特殊字符来看看. 70 <-- htmlsample.html --> 71 <html> 72 <head><title>> > 404 © Not > Found & </title></head> 73 <body bgcolor="white"> 74 <center><h1>404 Not Found</h1></center> 75 <hr><center>nginx/1.12.2</center> 76 </body> 77 </html> 78 79 上例 Output, 80 Head Tag : > > 404 © Not > Found & 81 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况. 82 83 然而, 在 html 的代码中存在一类 '非对称'的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子 84 去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使 85 其能够正确解析这些'非对称'标签. 86 先扩展一下儿 htmlsample.html, 以 <li> 标签为例, 87 <-- htmlsample.html --> 88 <html> 89 <head><title>> > 404 © Not > Found &</title> 90 <body bgcolor="white"> 91 <center><h1>404 Not Found</h1></center> 92 <hr><center>nginx/1.12.2</center> 93 <ul> 94 <li> First Reason 95 <li> Second Reason 96 </body> 97 </html> 98 99 htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签 100 没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题. 101 102 例, 103 from html.parser import HTMLParser 104 class Parser(HTMLParser): 105 def __init__(self): 106 self.taglevels = [] # track anchor 107 self.tags =['head','ul','li'] 108 self.parsesemaphore = False 109 self.data = '' 110 HTMLParser.__init__(self) 111 112 def handle_starttag(self, tag, attrs): # enable semaphore 113 if len(self.taglevels) and self.taglevels[-1] == tag: 114 self.handle_endtag(tag) 115 self.taglevels.append(tag) 116 117 if tag in self.tags: 118 self.parsesemaphore = True 119 120 def handle_data(self, data): # tag process as requirement 121 if self.parsesemaphore: 122 self.data += data 123 124 def handle_endtag(self, tag): 125 self.parsesemaphore = False 126 127 def gettag(self): 128 return self.data 129 130 if __name__ == "__main__": 131 with open('htmlsample.html') as FH: 132 pht = Parser() 133 pht.feed(FH.read()) # HTMLParser will invoke the replaced methods 134 # handle_starttag, handle_data and handle_endtag 135 print("Head Tag : %s" % pht.gettag()) 136 137 Output, 138 Head Tag : > > 404 © Not > Found & 139 First Reason 140 Second Reason 141 142 Reference, 143 https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref 144 145 Appendix, 146 The example given by python Doc, 147 from html.parser import HTMLParser 148 from html.entities import name2codepoint 149 150 class MyHTMLParser(HTMLParser): 151 def handle_starttag(self, tag, attrs): 152 print("Start tag:", tag) 153 for attr in attrs: 154 print(" attr:", attr) 155 156 def handle_endtag(self, tag): 157 print("End tag :", tag) 158 159 def handle_data(self, data): 160 print("Data :", data) 161 162 def handle_comment(self, data): 163 print("Comment :", data) 164 165 def handle_entityref(self, name): 166 c = chr(name2codepoint[name]) 167 print("Named ent:", c) 168 169 def handle_charref(self, name): 170 if name.startswith('x'): 171 c = chr(int(name[1:], 16)) 172 else: 173 c = chr(int(name)) 174 print("Num ent :", c) 175 176 def handle_decl(self, data): 177 print("Decl :", data) 178 179 parser = MyHTMLParser() 180 181 Output, 182 Parsing a doctype: 183 184 # >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' 185 ... '"http://www.w3.org/TR/html4/strict.dtd">') 186 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" 187 Parsing an element with a few attributes and a title: 188 189 190 # >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') 191 Start tag: img 192 attr: ('src', 'python-logo.png') 193 attr: ('alt', 'The Python logo') 194 195 # >>> parser.feed('<h1>Python</h1>') 196 Start tag: h1 197 Data : Python 198 End tag : h1 199 The content of script and style elements is returned as is, without further parsing: 200 201 202 # >>> parser.feed('<style type="text/css">#python { color: green }</style>') 203 Start tag: style 204 attr: ('type', 'text/css') 205 Data : #python { color: green } 206 End tag : style 207 208 # >>> parser.feed('<script type="text/javascript">' 209 ... 'alert("<strong>hello!</strong>");</script>') 210 Start tag: script 211 attr: ('type', 'text/javascript') 212 Data : alert("<strong>hello!</strong>"); 213 End tag : script 214 Parsing comments: 215 216 # >>> parser.feed('<!-- a comment -->' 217 ... '<!--[if IE 9]>IE-specific content<![endif]-->') 218 Comment : a comment 219 Comment : [if IE 9]>IE-specific content<![endif] 220 Parsing named and numeric character references and converting them to the correct 221 char (note: these 3 references are all equivalent to '>'): 222 223 # >>> parser.feed('>>>') 224 Named ent: > 225 Num ent : > 226 Num ent : > 227 Feeding incomplete chunks to feed() works, but handle_data() might be called more 228 than once (unless convert_charrefs is set to True): 229 230 # >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: 231 ... parser.feed(chunk) 232 Start tag: span 233 Data : buff 234 Data : ered 235 Data : text 236 End tag : span 237 Parsing invalid HTML (e.g. unquoted attributes) also works: 238 239 # >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') 240 Start tag: p 241 Start tag: a 242 attr: ('class', 'link') 243 attr: ('href', '#main') 244 Data : tag soup 245 End tag : p 246 End tag : a