zoukankan      html  css  js  c++  java
  • 爬虫的运用

     首先是爬”中国最好的大学“这个网站,相应代码如下(应为我是06号所以我爬的是2017)

     1 # -*- coding: utf-8 -*-
     2 """
     3 Created on Wed May 22 22:34:04 2019
     4 
     5 @author: m1353
     6 """
     7 
     8 import requests
     9 from bs4 import BeautifulSoup
    10 alluniv = []
    11 def getHTMLText(url):
    12     try:
    13         r = requests.get(url,timeout = 30)
    14         r.raise_for_status()
    15         r.encoding = 'utf-8'
    16         return r.text
    17     except:
    18         return "error"
    19 def fillunivlist(soup):
    20     data=soup.find_all('tr')
    21     for tr in data:
    22         ltd =tr.find_all('td')
    23         if len(ltd)==0:
    24             continue
    25         singleuniv=[]
    26         for td in ltd:
    27             singleuniv.append(td.string)
    28         alluniv.append(singleuniv)
    29 def printunivlist(num):
    30     print("{1:^2}{2:{0}^10}{3:{0}^6}{4:{0}^4}{5:{0}^10}".format(chr(12288),"排名","学校名字","省份","总分","培养规模"))
    31     for i in range(num):
    32         u=alluniv[i]
    33         print("{1:^4}{2:{0}^10}{3:{0}^5}{4:{0}^8.1f}{5:{0}^10}".format(chr(12288),u[0],u[1],u[2],eval(u[3]),u[6]))
    34 def main(num):
    35     url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2017.html"
    36     html=getHTMLText(url)
    37     soup=BeautifulSoup(html,"html.parser")
    38     fillunivlist(soup)
    39     printunivlist(num)
    40 main(10)

    main函数中的参数是索取其中的排名信息的个数

      执行效果如下

    然后没有成功,但是我的代码没有错。。。因为我能爬2016,2019的截图如下

    2016的如下

    仅此修改了网页地址的一个数字

    然后是2019的,如下图

    嗯,都可以出来,所以我无辜的=3=

    然后是开始爬谷歌网(我是06号)

    代码如下(和上面的差不太多)

     1 # -*- coding: utf-8 -*-
     2 """
     3 Created on Wed May 22 22:34:04 2019
     4 
     5 @author: m1353
     6 """
     7 
     8 import requests
     9 from bs4 import BeautifulSoup
    10 alluniv = []
    11 def getHTMLText(url):
    12     try:
    13         r = requests.get(url,timeout = 30)
    14         r.raise_for_status()
    15         r.encoding = 'utf-8'
    16         return r.text
    17     except:
    18         return "error"
    19 def xunhuang(url):
    20     for i in range(20):
    21         getHTMLText(url)
    22 def fillunivlist(soup):
    23     data=soup.find_all('tr')
    24     for tr in data:
    25         ltd =tr.find_all('td')
    26         if len(ltd)==0:
    27             continue
    28         singleuniv=[]
    29         for td in ltd:
    30             singleuniv.append(td.string)
    31         alluniv.append(singleuniv)
    32 def printf():
    33     print("
    ")
    34     print("
    ")
    35     print("
    ")
    36 def main():
    37     url = "http://www.google.com"
    38     html=getHTMLText(url)
    39     xunhuang(url)
    40     print(html)
    41     soup=BeautifulSoup(html,"html.parser")
    42     fillunivlist(soup)
    43     print(html)
    44     printf()
    45     print(soup.title)
    46     printf()
    47     print(soup.head)
    48     printf()
    49     print(soup.body)
    50 main()

    输出的结果如下图

    中间还有一大堆。。。。

    省略到最后面几句

     划红线为输出的title标签

    嗯,这一部分是head部分(好多=3=)

     

     

    好,这里是body部分(真的好长啊)

    由于之前用了一个在td里面找tr,所以将一些“<tr>,<td>”标签给去掉了,然后就变成了这副模样

  • 相关阅读:
    ZOJ 1002 Fire Net
    Uva 12889 One-Two-Three
    URAL 1881 Long problem statement
    URAL 1880 Psych Up's Eigenvalues
    URAL 1877 Bicycle Codes
    URAL 1876 Centipede's Morning
    URAL 1873. GOV Chronicles
    Uva 839 Not so Mobile
    Uva 679 Dropping Balls
    An ac a day,keep wa away
  • 原文地址:https://www.cnblogs.com/qq1079179226/p/10909199.html
Copyright © 2011-2022 走看看