zoukankan      html  css  js  c++  java
  • PyKHTML, a Python interface to KHTML

    PyKHTML, a Python interface to KHTML

    PyKHTML is...

    A Python module for writing website scrapers/spiders. Whereas traditional methods focus on writing the code to parse HTML/forms themselves, PyKHTML uses the excellent KHTML engine to do all the trudge work. It therefore handles webpages very well (even the severely crufty ones) and is pretty darn fast (implemented in C++). As a bonus the module handles JavaScript and cookies transparently. Hurrah!

    How?

    PyKHTML requires PyKDE 3 (and hence in turn PyQt 3 + KDE libs). If you would like to run PyKHTML on servers without an X display then Xvfb is required. Fortunately these requirements should come bundled with most modern Linux distributions, and support for Windows/Mac should appear in the next few months.

    Show me some code

    Okay. Here is an example (one of many examples included in the bundle) that scrapes the title and navigation from this page, with excessive commenting to give you a feel of what programming with PyKHTML is like:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    import pykhtml
    
    PyKHTMLUrl = "http://paul.giannaros.org/pykhtml"
    
    def extractBitsFromPage(browser):
        # getElementsByTagName returns a generator, so we convert
        # to a list and access the first element
        title = list(browser.document.getElementsByTagName("title"))[0]
        print "Title:", title.text
        # Get the text of the navigation items
        navigation = []
        # First get the container of the list items...
        navigationElement = browser.document.getElementById("navigation")
        # ... and then loop over the li elements we find
        for listItem in navigationElement.getElementsByTagName("li"):
            # Inside the list item is an anchor
            anchor = listItem.children[0]
            # And the text inside the anchor is what we want
            navigation.append(anchor.text)
        print "Navigation:", " | ".join(navigation)
        # Stop here, we're done
        pykhtml.stopEventLoop()
    
    def main():
        browser = pykhtml.Browser()
        # the browser is passed as a parameter to extractBitsFromPage
        # when it is called (when the page has loaded)
        browser.load(PyKHTMLUrl, extractBitsFromPage)
        # kick things off
        pykhtml.startEventLoop()
    
    main()
    

    Note of Thanks

    Gambit Research, a software company in West London, sponsor PyKHTML development.

  • 相关阅读:
    深入了解JVMzz
    正则表达式和Java编程语言1zz
    全世界所有程序员都会犯的错误zz
    C++完美实现Singleton模式zz
    Visual C++6.0 API函数操作技巧集zz光标和鼠标操作
    用next_permutation()生成r组合数,兼VC7的一个bugzz
    基于逆向最大化词表中文分词法zz
    c#.net常用函数列表
    Windows多线程多任务设计初步zz
    在Linux中实现内部进程通信
  • 原文地址:https://www.cnblogs.com/lexus/p/2487218.html
Copyright © 2011-2022 走看看