zoukankan      html  css  js  c++  java
  • Beautifulsoup的使用

           一款名为 Beautiful Soup 的常用配套工具帮助 Python 程序理解 Web 站点中包含的脏乱“基本是 HTML” 内容。是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。

    使用 Beautiful Soup 从无序的内容中生成整齐的数据

    				
    from glob import glob
    from BeautifulSoup import BeautifulSoup
    
    def process():
        print "!MOVIE,DIRECTOR,KEY_GRIP,THE_MOOSE"
        for fname in glob('result_*'):
            # Put that sloppy HTML into the soup
            soup = BeautifulSoup(open(fname))
    
            # Try to find the fields we want, but default to unknown values
            try:
                movie = soup.findAll('span', {'class':'movie_title'})[1].contents[0]
            except IndexError:
                fname = "UNKNOWN"
    
            try:
                director = soup.findAll('div', {'class':'director'})[1].contents[0]
            except IndexError:
                lname = "UNKNOWN"
    
            try:
                # Maybe multiple grips listed, key one should be in there
                grips = soup.findAll('p', {'id':'grip'})[0]
                grips = " ".join(grips.split())   # Normalize extra spaces
            except IndexError:
                title = "UNKNOWN"
    
            try:
                # Hide some stuff in the HTML <meta> tags
                moose = soup.findAll('meta', {'name':'shibboleth'})[0]['content']
            except IndexError:
                moose = "UNKNOWN"
    
            print '"%s","%s","%s","%s"' % (movie, director, grips, moose)

    具体可参考:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html

    与其类似的还有PyQuery库,看参考其网址 http://packages.python.org/pyquery/

  • 相关阅读:
    微信小程序中样式问题
    根据后台数据,渲染多个坐标在小程序中
    配置vscode同步大神玺哥的配置
    vue总结
    回文数
    Pytorch的runtime error
    PyTorch图像预处理
    python isinstance()函数
    Java实现weightedUF
    Java Iterator
  • 原文地址:https://www.cnblogs.com/djcsch2001/p/2105645.html
Copyright © 2011-2022 走看看