zoukankan      html  css  js  c++  java
  • python 爬取实例 1-中国大学排名问题处理

    该程序脱胎于嵩天老师的爬取中国大学排名实例程序,但由于网页的变动,嵩天老师的程序在运行中出现了一些问题,这爬取程序主要是在源程序的基础上进行了一些修改,使得程序正确运行,以及一些问题的处理

    爬取地址:https://www.shanghairanking.cn/rankings/bcur/2020

    修改后的源代码:

     1 import requests
     2 from bs4 import BeautifulSoup
     3 import bs4
     4 
     5 def get_html_text(url):
     6     try:
     7         r = requests.get(url, timeout=40)
     8         r.raise_for_status()
     9         r.encoding = r.apparent_encoding
    10         return r.text
    11     except:
    12         return ""
    13 
    14 
    15 def fill_univ_list(ulist, html):
    16     soup = BeautifulSoup(html, "html.parser")
    17     for tr in soup.find('tbody').children:
    18         if isinstance(tr, bs4.element.Tag):
    19             tds = tr('td')
    20             ulist.append([tds[0].text.strip(),tds[1].text.strip(),tds[2].text.strip(),tds[4].text.strip(),tds[5].text.strip()])
    21 
    22 def print_univ_list(ulist, num):
    23     print("{:^10}	{:^6}	{:^10}	{:^10}	{:^10}".format("排名","学校","省市","得分","教学层次",chr(12288)))
    24     for i in range(num):
    25         u = ulist[i]
    26         print("{:^10}	{:^10}	{:^10}	{:^12}	{:^12}".format(u[0],u[1],u[2],u[3],u[4],chr(12288)))
    27 
    28 def main():
    29    uinfo = []
    30    url = 'https://www.shanghairanking.cn/rankings/bcur/2020'
    31    html = get_html_text(url)
    32    fill_univ_list(uinfo,html)
    33    print_univ_list(uinfo,20)
    34 main()


    运行结果:

    问题处理:

    嵩天老师源程序:

     1 import requests
     2 
     3 from bs4 import BeautifulSoup
     4 
     5 import bs4 def getHTMLText(url):
     6 
     7     try:
     8 
     9         r = requests.get(url, timeout=30)
    10 
    11         r.raise_for_status()
    12 
    13         r.encoding = r.apparent_encoding
    14 
    15         return r.text
    16 
    17     except:
    18 
    19         return ""
    20 
    21 
    22 
    23 def fillUnivList(ulist, html):
    24 
    25     soup = BeautifulSoup(html, "html.parser")
    26 
    27     for tr in soup.find('tbody').children:
    28 
    29         if isinstance(tr, bs4.element.Tag):
    30 
    31             tds = tr('td')
    32 
    33             ulist.append([tds[0].string, tds[1].string, tds[3].string])
    34 
    35 
    36 
    37 def printUnivList(ulist, num):
    38 
    39     tplt = "{0:^10}	{1:{3}^10}	{2:^10}"
    40 
    41     print(tplt.format("排名","学校名称","总分",chr(12288)))
    42 
    43     for i in range(num):
    44 
    45         u=ulist[i]
    46 
    47         print(tplt.format(u[0],u[1],u[2],chr(12288)))
    48 
    49 
    50 
    51 def main():
    52 
    53     uinfo = []
    54 
    55     url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    56 
    57     html = getHTMLText(url)
    58 
    59     fillUnivList(uinfo, html)
    60 
    61     printUnivList(uinfo, 20) # 20 univs
    62 
    63 main()

    直接运行报错如下:

     解决方式:将原来代码中的网页url链接更换为https://www.shanghairanking.cn/rankings/bcur/2020(目前最新可用)

    更换url后运行报错运行如下:

    解决方式:

    没有想到很好的解决方法,尝试了几个,最后都失败了,于是就干脆一点,将代码中的:

     ulist.append([tds[0].string, tds[1].string, tds[3].string])

    换成:

     ulist.append([tds[0].text, tds[1].text, tds[2].text])

    即可成功出结果,运行截图:

    注:本文主要侧重点为问题的解决,爬取内容和格式可以自行决定,如有不对之处,请大家指正

     

     
  • 相关阅读:
    CSU 1333 Funny Car Racing
    FZU 2195 检查站点
    FZU 2193 So Hard
    ZOJ 1655 FZU 1125 Transport Goods
    zoj 2750 Idiomatic Phrases Game
    hdu 1874 畅通工程续
    hdu 2489 Minimal Ratio Tree
    hdu 3398 String
    洛谷 P2158 [SDOI2008]仪仗队 解题报告
    POJ 1958 Strange Towers of Hanoi 解题报告
  • 原文地址:https://www.cnblogs.com/2210633591zhang/p/13960748.html
Copyright © 2011-2022 走看看