zoukankan      html  css  js  c++  java
  • 爬取校园新闻首页的新闻

    1. 用requests库和BeautifulSoup库,爬取校园新闻首页新闻的标题、链接、正文。

    标题

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res = requests.get(url)
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
    
            break
    
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            t = news.select('.news-list-title')[0].text
            print(t)
            break
    

     链接

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res = requests.get(url)
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
    
            break
    
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            t = news.select('.news-list-title')[0].text
            link = news.select('a')[0].attrs['href']
            print(link)
            break

    正文

            resd = requests.get(link)
            resd.encoding='utf-8'
            soupd = BeautifulSoup(resd.text,'html.parser')
            d = soupd.select('#content')[0].text
            print(d)
            break

    2. 分析字符串,获取每篇新闻的发布时间,作者,来源,摄影等信息。

    发布时间

    info = soupd.select('.show-info')[0].text 
    t1 = info.lstrip('发布时间:')[:19]
    print(t1)

    作者来源摄影等

    s = info[info.find('来源:'):].split()[0].lstrip('来源:')
    print(s)
    
    
    

    3. 将其中的发布时间由str转换成datetime类型。

         from datetime import datetime
            dt = datetime.strptime(t1,'%Y-%m-%d %H:%M:%S')
            now = datetime.now()
    
            print(dt)
    
  • 相关阅读:
    【Leetcode】23. Merge k Sorted Lists
    【Leetcode】109. Convert Sorted List to Binary Search Tree
    【Leetcode】142.Linked List Cycle II
    【Leetcode】143. Reorder List
    【Leetcode】147. Insertion Sort List
    【Leetcode】86. Partition List
    jenkins 配置安全邮件
    python 发送安全邮件
    phpstorm 同步远程服务器代码
    phpUnit 断言
  • 原文地址:https://www.cnblogs.com/0056a/p/8692279.html
Copyright © 2011-2022 走看看