zoukankan      html  css  js  c++  java
  • 11 lxml库解析html代码

    一、lxml库解析字符串

    """lxml库解析html代码"""
    
    
    from lxml import etree
    
    text = """
    <body>
    <div class="header clear">
      <div class="inner">
        <h1 class="logo_area" title="腾讯互联网业务系统招聘"><a href="index.html">腾讯互联网业务系统招聘</a></h1>
        <div class="nav">
              <ul>
                <li class="current"><a href="index.html">互联首页</a></li>
                <li><a href="jobs.html#0">互联岗位</a></li>
                <li><a href="internet_business.html">互联业务</a></li>
                <li><a href="internet_life.html">互联生活</a></li>
              </ul>
        </div>
      </div>
    </div>
        <div class="slide_wp">
            <div class="slide">
                <div class="img_area">
                    <img id="banner" src ="img/upload/banner1.jpg">
                </div>
                <div class="trans_bar"></div>
                <div class="num_area">
                    <a id="banner0" class="current" onMouseOver="stopImg(0)" onMouseOut="starImg()"></a>
                    <a id="banner1" onMouseOver="stopImg(1)" onMouseOut="starImg()"></a>
                    <a id="banner2" onMouseOver="stopImg(2)" onMouseOut="starImg()"></a>
                    <a id="banner3" onMouseOver="stopImg(3)" onMouseOut="starImg()"></a>
                </div>
            </div>
        </div>
    <div class="main">
        <div class="search_area search_area_index">
            <strong class="title">职位搜索快速通道</strong>&nbsp;&gt;&nbsp;
            城市:<select id="cityid">
                        <option value="1000">-选择工作地点-</option>
                        <option value="2218">深圳</option>
                        <option value="2156">北京</option>
                        <option value="2175">上海</option>
                        <option value="2268">成都</option>
                        <option value="2000">其他</option>
                     </select>&nbsp;
            职位类别:<select id="typeid">
                            <option value="10">-选择职位类别-</option>
                            <option value="87">技术类</option>
                            <option value="82">产品/项目类</option>
                            <option value="83">市场类</option>
                            <option value="81">设计类</option>
                            <option value="84">职能类</option>
                            <option value="85">内容编辑类</option>
                            <option value="86">客户服务类</option>
                        </select>&nbsp;
            职位名称:<input id="content" class="search_text" type="text" value="" name="content"/>&nbsp;&nbsp;
            <button onClick="Icity=document.getElementById('cityid').value;Itype=document.getElementById('typeid').value;Icon=document.getElementById('content').value;    window.location.href='jobs.html#2';document.cookie='Icity='+Icity+'Itype='+Itype+'Icon='+Icon+',';"></button>
        </div>
        <div class="block_bottom">
                <div class="block_intro">
                    <div class="com_block isd_intro">
                        <div class="title"><a class="more" href="internet_business.html"></a><strong>关于互联网业务系统</strong></div>
                        <div class="content">
                            <div class="img_area"><img src="img/upload/isd.png" /></div>
                            互联网业务系统是腾讯负责互联网社区平台和增值服务运营的业务单元,所负责的业务一直以来都是中国互联网增值服务市场的领头羊,在用户数、收入等方面都在业内遥遥领先。
                        </div>
                    </div>
                    <div class="com_block"><div class="title"><a class="more" href="jobs.html#0"></a><strong>急聘岗位</strong></div>
                        <div class="content" id="hotJob">
                            <ul>
                                <li><i>&middot;</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=9079.';">SJP-社交平台前台开发工程师(深圳)</a></li>
                                <li><i>&middot;</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=10110.';">HL6-移动应用测试组长(深圳)</a></li>
                                <li><i>&middot;</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=10116.';">HL7-ISD交互设计组长(深圳)</a></li>
                                <li><i>&middot;</i><a onclick="window.location.href='jobs_detail.html';document.cookie='itemId=8948.';">SZ-数媒音乐产品运营主管(深圳)</a></li>
                            </ul>
                        </div>
                    </div>
                    <div class="com_block"><div class="title"><strong>招聘动态</strong></div>
                        <div class="content">
                            <ul>
                                <li><i></i>腾讯互联网系统春季招聘火热进行中,非常欢迎对互联网感兴趣的优秀产品类、技术类、市场类人才加盟,携手互联,共筑精彩在线生活!</li>
                            </ul>
                        </div>
                    </div>
                </div>
        </div>
    </div>
    <div class="footer">
      <footer>
        <div class="inner">
            <p><a href="http://www.tencent.com/">关于腾讯</a>&nbsp;|&nbsp;<a href="http://www.tencent.com/index_e.shtml">About Tencent</a>&nbsp;|&nbsp;<a href="http://www.qq.com/contract.shtml">服务条款</a>&nbsp;|&nbsp;<a href="http://www.tencentmind.com/">广告服务</a>&nbsp;|&nbsp;<a href="http://hr.tencent.com/">腾讯招聘</a>&nbsp;|&nbsp;<a href="http://service.qq.com/">客服中心</a>&nbsp;|&nbsp;<a href="http://www.qq.com/map/">网站导航</a></p>
            <p>Copyright &copy; 1998 - 2012 Tencent. All Rights Reserved.</p>
            <p>腾讯公司版权所有 腾讯网网站导航.</p>
        </div>
      </footer>
    </div>
    <script>
        document.cookie="Icity=1000Itype=10Icon=,";
    </script>
    </body>
    """
    
    # 解释字符串
    def parse_text():
    
        htmlElement = etree.HTML(text)
        print(type(htmlElement))       # <class 'lxml.etree._Element'>
        print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

    parse_text()

    二、lxml库解析文件

     1 """lxml库解析html代码"""
     2 
     3 
     4 from lxml import etree
     5 
     6 # 从文件中读取html代码
     7 def parse_file():
     8     htmlElement = etree.parse('3_3.html')
     9     print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))
    10 
    11 
    12 parse_file()

    三、更改解析器解析内容

     1 """lxml库解析html代码"""
     2 
     3 
     4 from lxml import etree
     5 
     6 def parse_lagou_file():
     7     # 指定解释器
     8     parser = etree.HTMLParser(encoding='utf-8')
     9     # 增加解释器可以解决lxml.etree.XMLSyntaxError: Opening and ending tag mismatch错误
    10     htmlElement = etree.parse('3_3lagou.html', parser=parser) 
    11     print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))
    12 
    13 
    14 parse_lagou_file()
  • 相关阅读:
    用FileSystemWatcher监视文件系统
    生成随机汉字验证码
    MySQL学习笔记二
    python高级学习笔记
    boost bind 表达式中的是值语义还是指针语义?
    容器与适配器的个人总结
    subversion linux使用方法
    boost asio(初学示例)
    MySQL学习笔记一
    subversion 命令
  • 原文地址:https://www.cnblogs.com/sruzzg/p/13072861.html
Copyright © 2011-2022 走看看