zoukankan html css js c++ java

html.parser无法完全解析网页之BUG的修正

html.parser使用正则表达式解析html代码

在使用中发现部分网页无法完全解析，跟踪发现因为网页中有这样的代码

<a href="www.baidu.com"style="hot">badidu</a>

而html.parser定位tag使用的正则如下

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:s+                             # whitespace before attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:s*=s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |"[^"]*"                # LIT-enclosed value
          |[^'">s]+                # bare value
         )
       )?
     )
   )*
  s*                                # trailing whitespace
""", re.VERBOSE)

它认为属性和属性间是有空格隔开的，遇到上面的例子就解析失败了

因此修改正则

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*              # tag name
  (?:s*(?=>)                            # tag without attributes
    |s+                                 # whitespace before attribute name
     (?:s*                              # whitespace between attributes
        (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
          (?:s*=s*                     # value indicator
            (?:'[^']*'                   # LITA-enclosed value
              |"[^"]*"                # LIT-enclosed value
              |[^'">s]+                # bare value
             )
           )?
        )
     )*
     s*                                # trailing whitespace
  )
""", re.VERBOSE)

顺利解析全部网页

查看全文

相关阅读:
B树,B+树
 中断
 死锁
 无锁队列
 Cookie和Session
分布式系统一致性
 c++ 标准库迭代器失效
 html5 app图片预加载
 html5 手机APP计算高度问题
 html5 750 REM JS换算方法

原文地址：https://www.cnblogs.com/sqxy110/p/4881761.html