一、lxml库解析字符串
"""lxml库解析html代码""" from lxml import etree text = """ <body> <div class="header clear"> <div class="inner"> <h1 class="logo_area" title="腾讯互联网业务系统招聘"><a href="index.html">腾讯互联网业务系统招聘</a></h1> <div class="nav"> <ul> <li class="current"><a href="index.html">互联首页</a></li> <li><a href="jobs.html#0">互联岗位</a></li> <li><a href="internet_business.html">互联业务</a></li> <li><a href="internet_life.html">互联生活</a></li> </ul> </div> </div> </div> <div class="slide_wp"> <div class="slide"> <div class="img_area"> <img id="banner" src ="img/upload/banner1.jpg"> </div> <div class="trans_bar"></div> <div class="num_area"> <a id="banner0" class="current" onMouseOver="stopImg(0)" onMouseOut="starImg()"></a> <a id="banner1" onMouseOver="stopImg(1)" onMouseOut="starImg()"></a> <a id="banner2" onMouseOver="stopImg(2)" onMouseOut="starImg()"></a> <a id="banner3" onMouseOver="stopImg(3)" onMouseOut="starImg()"></a> </div> </div> </div> <div class="main"> <div class="search_area search_area_index"> <strong class="title">职位搜索快速通道</strong> > 城市:<select id="cityid"> <option value="1000">-选择工作地点-</option> <option value="2218">深圳</option> <option value="2156">北京</option> <option value="2175">上海</option> <option value="2268">成都</option> <option value="2000">其他</option> </select> 职位类别:<select id="typeid"> <option value="10">-选择职位类别-</option> <option value="87">技术类</option> <option value="82">产品/项目类</option> <option value="83">市场类</option> <option value="81">设计类</option> <option value="84">职能类</option> <option value="85">内容编辑类</option> <option value="86">客户服务类</option> </select> 职位名称:<input id="content" class="search_text" type="text" value="" name="content"/> <button onClick="Icity=document.getElementById('cityid').value;Itype=document.getElementById('typeid').value;Icon=document.getElementById('content').value; window.location.href='jobs.html#2';document.cookie='Icity='+Icity+'Itype='+Itype+'Icon='+Icon+',';"></button> </div> <div class="block_bottom"> <div class="block_intro"> <div class="com_block isd_intro"> <div class="title"><a class="more" href="internet_business.html"></a><strong>关于互联网业务系统</strong></div> <div class="content"> <div class="img_area"><img src="img/upload/isd.png" /></div> 互联网业务系统是腾讯负责互联网社区平台和增值服务运营的业务单元,所负责的业务一直以来都是中国互联网增值服务市场的领头羊,在用户数、收入等方面都在业内遥遥领先。 </div> </div> <div class="com_block"><div class="title"><a class="more" href="jobs.html#0"></a><strong>急聘岗位</strong></div> <div class="content" id="hotJob"> <ul> <li><i>·</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=9079.';">SJP-社交平台前台开发工程师(深圳)</a></li> <li><i>·</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=10110.';">HL6-移动应用测试组长(深圳)</a></li> <li><i>·</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=10116.';">HL7-ISD交互设计组长(深圳)</a></li> <li><i>·</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=8948.';">SZ-数媒音乐产品运营主管(深圳)</a></li> </ul> </div> </div> <div class="com_block"><div class="title"><strong>招聘动态</strong></div> <div class="content"> <ul> <li><i></i>腾讯互联网系统春季招聘火热进行中,非常欢迎对互联网感兴趣的优秀产品类、技术类、市场类人才加盟,携手互联,共筑精彩在线生活!</li> </ul> </div> </div> </div> </div> </div> <div class="footer"> <footer> <div class="inner"> <p><a href="http://www.tencent.com/">关于腾讯</a> | <a href="http://www.tencent.com/index_e.shtml">About Tencent</a> | <a href="http://www.qq.com/contract.shtml">服务条款</a> | <a href="http://www.tencentmind.com/">广告服务</a> | <a href="http://hr.tencent.com/">腾讯招聘</a> | <a href="http://service.qq.com/">客服中心</a> | <a href="http://www.qq.com/map/">网站导航</a></p> <p>Copyright © 1998 - 2012 Tencent. All Rights Reserved.</p> <p>腾讯公司版权所有 腾讯网网站导航.</p> </div> </footer> </div> <script> document.cookie="Icity=1000Itype=10Icon=,"; </script> </body> """ # 解释字符串 def parse_text(): htmlElement = etree.HTML(text) print(type(htmlElement)) # <class 'lxml.etree._Element'> print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))
parse_text()
二、lxml库解析文件
1 """lxml库解析html代码""" 2 3 4 from lxml import etree 5 6 # 从文件中读取html代码 7 def parse_file(): 8 htmlElement = etree.parse('3_3.html') 9 print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8')) 10 11 12 parse_file()
三、更改解析器解析内容
1 """lxml库解析html代码""" 2 3 4 from lxml import etree 5 6 def parse_lagou_file(): 7 # 指定解释器 8 parser = etree.HTMLParser(encoding='utf-8') 9 # 增加解释器可以解决lxml.etree.XMLSyntaxError: Opening and ending tag mismatch错误 10 htmlElement = etree.parse('3_3lagou.html', parser=parser) 11 print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8')) 12 13 14 parse_lagou_file()