zoukankan      html  css  js  c++  java
  • 爬取校园新闻首页的新闻

     1 1. 用requests库和BeautifulSoup库,爬取校园新闻首页新闻的标题、链接、正文。
     2 
     3 import requests
     4 from bs4 import BeautifulSoup
     5 
     6 url = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
     7 res = requests.get(url)
     8 res.encoding = 'utf-8'
     9 soup = BeautifulSoup(res.text, 'html.parser')
    10 for news in soup.select('li'):
    11     if len(news.select('.news-list-title')) > 0:
    12         print(news.select('.news-list-title'))
    13         t=news.select('.news-list-title')[0].text
    14         dt=news.select('.news-list-info')[0].contents[0].text
    15         a=news.select('a')[0].attrs['href']
    16         print(dt,t,a)
    17 
    18 2. 分析字符串,获取每篇新闻的发布时间,作者,来源,摄影等信息。
    19 
    20 for news in soup.select('li'):
    21     if len(news.select('.news-list-title'))>0:
    22         title = news.select('.news-list-title')[0].text
    23         a = news.select('a')[0].attrs['href']
    24 
    25         resd = requests.get(a)
    26         resd.encoding = 'utf-8'
    27         soupd = BeautifulSoup(resd.text, 'html.parser')
    28         d = soupd.select('#content')[0].text
    29         info = soupd.select('.show.info')[0].text
    30         print(info)
    31         dt = info.lstrip('发布时间:')[:19]#发布时间
    32         dt2 = datetime.strptime(dt, '%Y-%m-%d %H:%M:%S')
    33         print(dt2)
    34         i = info.find('来源:')
    35         if i>0:
    36             s = info[info.find('来源:'):].split()[0].lstrip('来源:')#来源
    37             print(s)
    38         a = info.find('作者:')
    39         if a > 0:
    40             l = info[info.find('作者:'):].split()[0].replace('作者:')#作者
    41             print(l)
    42         y = info.find('摄影:')
    43         if y > 0:
    44             u = info[info.find('摄影:'):].split()[0].replace('摄影:')#摄影
    45             print(u)
    46 
    47 3. 将其中的发布时间由str转换成datetime类型。
    48 
    49 import requests
    50 from bs4 import BeautifulSoup
    51 from datetime import datetime
    52 
    53 gzccurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    54 res = requests.get(gzccurl)
    55 res.encoding='utf-8'
    56 soup = BeautifulSoup(res.text,'html.parser')
    57 
    58 for news in soup.select('li'):
    59     if len(news.select('.news-list-title'))>0:
    60         title = news.select('.news-list-title')[0].text#标题
    61         url = news.select('a')[0]['href']#链接
    62         time = news.select('.news-list-info')[0].contents[0].text
    63         dt = datetime.strptime(time,'%Y-%m-%d')
    64         source = news.select('.news-list-info')[0].contents[1].text#来源
    65         print(dt,'
    ',title,'
    ',url,'
    ',source,'
    ')
  • 相关阅读:
    rsync+inotify实现全网自动化数据备份-技术流ken
    高可用集群之keepalived+lvs实战-技术流ken
    高负载集群实战之lvs负载均衡-技术流ken
    实战!基于lamp安装Discuz论坛-技术流ken
    iptables实战案例详解-技术流ken
    (3)编译安装lamp三部曲之php-技术流ken
    (2)编译安装lamp三部曲之mysql-技术流ken
    (1)编译安装lamp三部曲之apache-技术流ken
    实战!基于lamp安装wordpress详解-技术流ken
    yum一键安装企业级lamp服务环境-技术流ken
  • 原文地址:https://www.cnblogs.com/liangyao111/p/8694496.html
Copyright © 2011-2022 走看看