html.parser使用正则表达式解析html代码
在使用中发现部分网页无法完全解析,跟踪发现因为网页中有这样的代码
<a href="www.baidu.com"style="hot">badidu</a>
而html.parser定位tag使用的正则如下
locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:s+ # whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:s*=s* # value indicator (?:'[^']*' # LITA-enclosed value |"[^"]*" # LIT-enclosed value |[^'">s]+ # bare value ) )? ) )* s* # trailing whitespace """, re.VERBOSE)
它认为属性和属性间是有空格隔开的,遇到上面的例子就解析失败了
因此修改正则
locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:s*(?=>) # tag without attributes |s+ # whitespace before attribute name (?:s* # whitespace between attributes (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:s*=s* # value indicator (?:'[^']*' # LITA-enclosed value |"[^"]*" # LIT-enclosed value |[^'">s]+ # bare value ) )? ) )* s* # trailing whitespace ) """, re.VERBOSE)
顺利解析全部网页