zoukankan      html  css  js  c++  java
  • 爬取校园新闻首页的新闻

    1. 用requests库和BeautifulSoup库,爬取校园新闻首页新闻的标题、链接、正文。

    url = "http://news.gzcc.cn/html/xiaoyuanxinwen/"
    res = requests.get(url);
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text,"html.parser");
     
    for news in soup.select("li"):
        if len(news.select(".news-list-title"))>0:  #排除为空的li
            time = news.select(".news-list-info")[0].contents[0].text
            title = news.select(".news-list-title")[0].text
            description = news.select(".news-list-description")[0].text
            a = news.select('a')[0].attrs['href']
            detail_res = requests.get(a)
            detail_res.encoding = "utf-8"
            detail_soup = BeautifulSoup(detail_res.text, "html.parser")
            print(detail_soup.select("#content")[0].text)#正文
            print(time, title, description, a)
            content = detail_soup.select("#content")[0].text
            info = detail_soup.select(".show-info")[0].text
            date_time = info.lstrip('发布时间:')[:19]
            print(info)
            break
    

      

    2. 分析字符串,获取每篇新闻的发布时间,作者,来源,摄影等信息。

    info = '发布时间:2018-04-01 11:57:00      作者:陈流芳  审核:权麟春  来源:马克思主义学院      点击:次'
    detail_time = info.lstrip('发布时间:')[:19]
    sh = info[info.find("审核"):].split()[0].lstrip('审核:')
    print(detail_time,sh)
    

      

    3. 将其中的发布时间由str转换成datetime类型。

    # 获取当前的时间
    now_time = datetime.now();
    now_time.year
    # 将字符串转化为时间
    print(datetime.strptime(date_time,"%Y-%m-%d %H:%M:%S"))
    # 将时间转化为字符串
    print(now_time.strftime('%Y\%m\%d'))
    

      

    4. 将完整的代码及运行结果截图发布在作业上。

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
     
    url = "http://news.gzcc.cn/html/xiaoyuanxinwen/"
    res = requests.get(url);
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text,"html.parser");
     
    for news in soup.select("li"):
        if len(news.select(".news-list-title"))>0:  #排除为空的li
            time = news.select(".news-list-info")[0].contents[0].text
            title = news.select(".news-list-title")[0].text
            description = news.select(".news-list-description")[0].text
            a = news.select('a')[0].attrs['href']
            detail_res = requests.get(a)
            detail_res.encoding = "utf-8"
            detail_soup = BeautifulSoup(detail_res.text, "html.parser")
            print(detail_soup.select("#content")[0].text)#正文
     
            print(time, title, description, a)
     
            content = detail_soup.select("#content")[0].text
            info = detail_soup.select(".show-info")[0].text
            date_time = info.lstrip('发布时间:')[:19]
            print(info)
            break
     
    info = '发布时间:2018-04-01 11:57:00      作者:陈流芳  审核:权麟春  来源:马克思主义学院      点击:次'
    detail_time = info.lstrip('发布时间:')[:19]
    sh = info[info.find("审核"):].split()[0].lstrip('审核:')
    print(detail_time,sh)
     
     
    # # 多个名字查找作者
    info1 = '发布时间:2018-04-01 11:57:00      作者:陈流芳 许健杰  审核:权麟春   来源:马克思主义学院    点击:次 '
    info1 = info1[info1.find("作者"):info1.find('审核:')].lstrip('作者:').split()[1]
    print(info1)
     
    # 获取当前的时间
    now_time = datetime.now();
    now_time.year
     
    # 将字符串转化为时间
    print(datetime.strptime(date_time,"%Y-%m-%d %H:%M:%S"))
     
    # 将时间转化为字符串
    print(now_time.strftime('%Y\%m\%d'))
    

      

  • 相关阅读:
    POJ 1795 DNA Laboratory
    CodeForces 303B Rectangle Puzzle II
    HDU 2197 本源串
    HDU 5965 扫雷
    POJ 3099 Go Go Gorelians
    CodeForces 762D Maximum path
    CodeForces 731C Socks
    HDU 1231 最大连续子序列
    HDU 5650 so easy
    大话接口隐私与安全 转载
  • 原文地址:https://www.cnblogs.com/605-mk/p/8713351.html
Copyright © 2011-2022 走看看