zoukankan      html  css  js  c++  java
  • 大学排名

    Python爬虫程序获取源码中的内容

    requests库用来获取源码:

      requests.get(url)返回URL页面的源码

      requests.raise_for_status()检测链接是否建立成功,只有返回200是成功,其余都会抛出错误给except

      requests.encoding = requests.apparent_encoding用来改变编码方式

    BeautifulSoup用来处理html源码:

      北京理工大学的嵩天老师在中国大学MOOC上的课程说的很好

      http://www.icourse163.org/learn/BIT-1001870001?tid=1001962001#/learn/content?type=detail&id=1002702161&cid=1003064638

    这个程序里有很强的格式化输出.format()和补齐中文空格的 char(12288)

    import requests
    import bs4
    from bs4 import BeautifulSoup

    def Gethtml(url):
      try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
      except:
        return ""

    def BaocunList(ulist,html):
      soup = BeautifulSoup(html, "html.parser")
      for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):#用来检测tr是不是标签类型,是则进
          tds = tr('td')
          ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string ,tds[4].string])

    def PrintList(ulist, num):
      tplt = "{0:^10} {1:{5}^10} {2:^10} {3:^10} {4:^10}"
      print(tplt.format("排名","学校名称","省市","总分","生源质量(高考成绩)",chr(12288)))
      for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],u[3],u[4],chr(12288)))

    def main():
      url = 'http://zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
      r = Gethtml(url)
      List = []
      BaocunList(List,r)
      PrintList(List,50)

    main()

  • 相关阅读:
    Hihocoder-小Hi的烦恼
    Python包下载与离线安装
    Shell输出颜色设置
    MySQL主从配置
    MySQL初始化与用户配置
    [转]常用 GDB 命令中文速览
    搭建github静态博客
    树莓派上手
    vim安装与配置
    数组,看了你就懂了!
  • 原文地址:https://www.cnblogs.com/tianxxl/p/7655558.html
Copyright © 2011-2022 走看看