zoukankan      html  css  js  c++  java
  • 中国大学排名定向爬虫

    本篇爬虫主要是为了从最好大学网站上爬取2019年各个大学的排名,以及把数据存储到mysql的过程:

     1 import requests
     2 from bs4 import BeautifulSoup
     3 import bs4
     4 import pymysql
     5 
     6 # 连接数据库并且创建数据表
     7 db = pymysql.connect('localhost', 'root', 'password', 'universityrankings')
     8 cursor = db.cursor()
     9 cursor.execute('drop table if exists UNRANKING2019')
    10 sql = """
    11 create table UNRANKING2019
    12 (
    13 paiming INTEGER,
    14 xuexiaomingchen VARCHAR(40),
    15 shengshi VARCHAR(40),
    16 zongfen VARCHAR(40),
    17 shengyuanzhiliang VARCHAR(40),
    18 peiyangjieguo VARCHAR(40),
    19 shehuishengyu VARCHAR(40),
    20 keyanguimo VARCHAR(40),
    21 keyanzhiliang VARCHAR(40),
    22 dingjianchengguo VARCHAR(40),
    23 dingjianrencai VARCHAR(40),
    24 kejifuwu VARCHAR(40),
    25 chengguozhuanhua VARCHAR(40),
    26 xueshengguojihua VARCHAR(40),
    27 primary key(xuexiaomingchen)
    28 );
    29 """
    30 cursor.execute(sql)
    31 
    32 
    33 def getHTMLText(url):
    34     try:
    35         r = requests.get(url, timeout=30)
    36         r.raise_for_status()
    37         r.encoding = r.apparent_encoding
    38         return r.text
    39     except:
    40         return ""
    41 
    42 
    43 def fillUnivlist(ulist, html):
    44     soup = BeautifulSoup(html, "html.parser")
    45     for tr in soup.find('tbody').children:
    46         if isinstance(tr, bs4.element.Tag):
    47             tds = tr.find_all('td')
    48             ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string, tds[4].string, tds[5].string,
    49                           tds[6].string, tds[7].string, tds[8].string, tds[9].string, tds[10].string, tds[11].string,
    50                           tds[12].string, tds[13].string])
    51     sql = """
    52         INSERT INTO universityrankings.unranking2019 values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
    53         """
    54     for i in range(len(ulist)):
    55         cursor.execute(sql, ulist[i])
    56     db.commit()
    57     cursor.close()
    58 
    59 
    60 def printUnivList(ulist, num):
    61     tplt = "{0}	{1}	{2}	{3}	{4}	{5}	{6}	{7}	{8}	{9}	{10}	{11}	{12}	{13}"
    62     print(tplt.format("排名", "学校名称", "省市", "总分", "生源质量", "培养结果", "社会声誉", "科研规模", "科研质量", "顶尖成果", "顶尖人才", "科技服务", "成果转化",
    63                       "学生国际化"))
    64     for i in range(num):
    65         u = ulist[i]
    66         print(tplt.format(u[0], u[1], u[2], u[3], u[4], u[5], u[6], u[7], u[8], u[9], u[10], u[11], u[12], u[13]))
    67 
    68 
    69 def main():
    70     uinfo = []
    71     url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    72     html = getHTMLText(url)
    73     fillUnivlist(uinfo, html)
    74     printUnivList(uinfo, 549)
    75 
    76 
    77 main()
                  
    申明:本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
  • 相关阅读:
    ie6bug,element name 与 id不能相同
    鼠标的当前位置
    setAttribute在设置属性 ieBug
    【STM32 .Net MF开发板学习14】红外遥控器编码识别
    NETMF Versions 4.1 Release 发布
    【STM32 .Net MF开发板学习11】步进电机控制(非PWM模式)
    【STM32 .Net MF开发板学习10】SPI测试之触摸屏坐标获取
    【STM32 .Net MF开发板学习08】远程PLC读写控制
    【STM32 .Net MF开发板学习09】AD模拟量采集
    农村个人医疗远程助理(物联网应用)
  • 原文地址:https://www.cnblogs.com/lsyb-python/p/11801576.html
Copyright © 2011-2022 走看看