zoukankan      html  css  js  c++  java
  • lxml.etree.HTML(text) 解析HTML文档

    0.参考

    http://lxml.de/tutorial.html#the-xml-function

    There is also a corresponding function HTML() for HTML literals.

    >>> root = etree.HTML("<p>data</p>")
    >>> etree.tostring(root)
    b'<html><body><p>data</p></body></html>'

    1.基本用法

    from lxml import etree
    # Parses an HTML document from a string constant.  Returns the root nood
    root = etree.HTML(r.text) #<Element html at 0x7bb8208>

    1.1 xpath 和 cssselect 获取文字和属性

    In [83]: for item in root.xpath('//button')[:1]:
        ...:     print(item)
        ...:     print(item.text)                           #获取文字
        ...:     print(item.xpath('./@id'))
        ...:
    <Element button at 0x84277c8>
    Requests Generator
    ['btn_requests']
    ###
    In [84]: for item in root.cssselect('button')[:1]:
        ...:     print(item)
        ...:     print(item.text)
        ...:     print(item.cssselect('::attr(id)'))        #不支持伪元素写法
        ...:
        ...:
    <Element button at 0x84277c8>
    Requests Generator
    ExpressionError: Pseudo-elements are not supported.
    ###
    In [92]: for item in root.cssselect('button')[:1]:
        ...:     print(item.get('id', ''))                  #获取属性
    
    btn_requests
    ###
    In [93]: for item in root.cssselect('button')[:1]:
        ...:     print(item.xpath('./@id'))                 #嵌套
        ...:
    ['btn_requests']

    1.2 美化打印

    print(etree.tostring(root, pretty_print=True).decode('utf-8'))      # 美化打印
    # You can also serialise to a Unicode string without declaration by
    # passing the ``unicode`` function as encoding (or ``str`` in Py3),
    # or the name 'unicode'.  This changes the return value from a byte
    # string to an unencoded unicode string.
    print(etree.tostring(root, encoding=str, pretty_print=True))        #py3 使之返回 text
    print(etree.tostring(root, encoding=unicode, pretty_print=True))    #py2 使之返回 unicode

    1.3 自动补全

    In [109]: rt = etree.HTML('<html><p>123</p></html>')            #自动补全
    In [110]: print(etree.tostring(rt, encoding=str, pretty_print=True))
    <html>
      <body>
        <p>123</p>
      </body>
    </html>

    1.4 fromstring 不支持残缺片段,不会自动补全

    In [115]: rt = etree.fromstring('<html><p>456</html>')           #fromstring 不支持残缺片段,不会自动补全
    XMLSyntaxError: Opening and ending tag mismatch: p line 1 and html, line 1, column 20
    In [116]: rt = etree.fromstring('<html><p>456</p></html>')
    In [117]: print(etree.tostring(rt, encoding=str, pretty_print=True))
    <html>
      <p>456</p>
    </html>

    .

  • 相关阅读:
    使用C#中的DirectorySearcher来获得活动目录中的组织结构与用户等信息,并在展示成树形结构(附源代码)
    oracle的简单操作和要注意的地方
    lambda匿名函数
    Linux查看系统信息(版本、cpu、raid)
    chmod 777后,目录权限不可写解决方法
    linux /boot空间满了如何清理
    k3s
    IDEA项目编译参数Werror设置
    minicube 安装
    ubuntu安装docker
  • 原文地址:https://www.cnblogs.com/my8100/p/parse_html_with_lxml.html
Copyright © 2011-2022 走看看