zoukankan      html  css  js  c++  java
  • html5lib-python doc

    http://html5lib.readthedocs.org/en/latest/

    By default, the document will be an xml.etree element instance.Whenever possible, html5lib chooses the accelerated ElementTreeimplementation (i.e. xml.etree.cElementTree on Python 2.x).

    Overview

    html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

    Usage

    Simple usage follows this pattern:

    import html5lib
    with open("mydocument.html", "rb") as f:
        document = html5lib.parse(f)
    

    or:

    import html5lib
    document = html5lib.parse("<p>Hello World!")
    

    By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).

    Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

    import html5lib
    with open("mydocument.html", "rb") as f:
        lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
    

    When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

    from contextlib import closing
    from urllib2 import urlopen
    import html5lib
    
    with closing(urlopen("http://example.com/")) as f:
        document = html5lib.parse(f, encoding=f.info().getparam("charset"))
    

    When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

    from urllib.request import urlopen
    import html5lib
    
    with urlopen("http://example.com/") as f:
        document = html5lib.parse(f, encoding=f.info().get_content_charset())
    

    To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

    import html5lib
    with open("mydocument.html", "rb") as f:
        parser = html5lib.HTMLParser(strict=True)
        document = parser.parse(f)
    

    When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

    import html5lib
    parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
    minidom_document = parser.parse("<p>Hello World!")
    

    More documentation is available at http://html5lib.readthedocs.org/.

    Installation

    html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:

    $ pip install html5lib
    

    Optional Dependencies

    The following third-party libraries may be used for additionalfunctionality:

    • datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
    • lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
    • genshi has a treewalker (but not builder); and
    • charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python 2.
    • ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical order.

    Bugs

    Please report any bugs on the issue tracker.

    Tests

    Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict isrequired under Python 2.6. All should pass.

    Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

    $ git submodule init
    $ git submodule update

    If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

    Questions?

    There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.

    Indices and tables


  • 相关阅读:
    表达式和计算的描述
    表达式和计算的描述
    递归算法浅谈
    编程基本功训练:流程图画法及练习
    【2012.1.24更新】不要再在网上搜索eclipse的汉化包了!
    VS2008下直接安装使用Boost库1.46.1版本号
    android关键组件service服务(一)
    U盘安装咱中国人自己的操作系统UbuntuKylin14.04LST(超具体原创图文教程)
    数据流图的画法
    匈牙利算法
  • 原文地址:https://www.cnblogs.com/blfshiye/p/4052830.html
Copyright © 2011-2022 走看看