zoukankan      html  css  js  c++  java
  • 课堂练习-爬网(2)-爬网代码

    爬网演练代码1:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.cnblogs.com/exesoft/p/13184331.html'
    r = requests.get(url, timeout=30)
    r.encoding = 'utf-8'
    soup = BeautifulSoup(r.text, "html.parser")  
    trs = soup.select('.exesoft-table tr')
    print(type(trs))
    print(trs)
    print("------------------------")
    for tr in trs:
        print(tr)
        print("---------------")
        tds=tr.find_all('td')
        print(tds)
        print("---------")
        for td in tds:
            print(td.string)
            print("----")

    爬网演练代码2:上述代码的改良版

    import requests
    from bs4 import BeautifulSoup
    allStudents = []
    
    def getHTMLText(url):
        try:
            r = requests.get(url, timeout=30)
            r.raise_for_status()
            r.encoding = 'utf-8'
            return r.text
        except:
            return ""
    
    def fillStudentsList(soup):
        data = soup.select('.exesoft-table tr')
        for tr in data:
            ltd = tr.find_all('td')
            if len(ltd)==0:
                continue
            singleStudent = []
            for td in ltd:
                singleStudent.append(td.string)
            allStudents.append(singleStudent)
    
    def printStudentsList():
        print(allStudents)
        print("{}  {}    {}".format("编号","姓名","分数"))
        for i in range(5):
            u=allStudents[i]
            print("{}   {}    {}".format(u[0],u[1],u[2]))
    def main():
        url = 'https://www.cnblogs.com/exesoft/p/13184331.html'
        html = getHTMLText(url)
        soup = BeautifulSoup(html, "html.parser")
        fillStudentsList(soup)
        printStudentsList()
    main()

     代码运行效果:

    小结:

    爬网可以看成一个"剥蒜"过程.最终的目标是要获取一个带有少量皮的蒜,还是一颗纯净的蒜,这个由业务需求决定。由于爬网得来的数据结构列表居多,所以可以通过循环结构层层剥离外面的包装(即:Html标签),得到相应的数据。

  • 相关阅读:
    哈尔特征Haar
    【python】发送post请求
    【svn】SSL error: A TLS warning alert has been received的解决方法
    【python】tarfile的路径问题
    【python】nuitka封装python
    【linux】scp命令
    【mongo】聚合相关资料
    【python】with的实现方法
    【mongo】mongoVUE使用
    【http】四种常见的 POST 提交数据方式
  • 原文地址:https://www.cnblogs.com/exesoft/p/13185884.html
Copyright © 2011-2022 走看看