zoukankan      html  css  js  c++  java
  • 用requests库和BeautifulSoup4库爬取新闻列表

    1.用requests库和BeautifulSoup4库,爬取校园新闻列表的时间、标题、链接、来源。

    import requests
    from bs4 import BeautifulSoup
    
    gzccurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res = requests.get(gzccurl)
    res.encoding='utf-8'
    
    soup = BeautifulSoup(res.text,'html.parser')
    
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            title = news.select('.news-list-title')[0].text#标题
            url = news.select('a')[0]['href']#链接
            time = news.select('.news-list-info')[0].contents[0].text
            source = news.select('.news-list-info')[0].contents[1].text
            
            #详情
            #resd = requests.get(url)
            #res.encoding='utf-8'
            #soupd = BeautifulSoup(res.text,'html.parser')
            #detail = soupd.select('.show-content')
    
            print(time,'
    ',title,'
    ',url,'
    ',source,'
    ')
            

    结果:

    2.将其中的时间str转换成datetime类型。

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    
    gzccurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res = requests.get(gzccurl)
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    
    #def getdetail(url):
        #resd = requests.get(url)
        #resd.encoding='utf-8'
        #soupd = BeautifulSoup(res.text,'html.parser')
        #return (soupd.select('.show-content')[0].text)
    
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            title = news.select('.news-list-title')[0].text#标题
            url = news.select('a')[0]['href']#链接
            time = news.select('.news-list-info')[0].contents[0].text
            dt = datetime.strptime(time,'%Y-%m-%d')
            source = news.select('.news-list-info')[0].contents[1].text#来源
            #detail = getdetail(url)#详情
    
    
            print(dt,'
    ',title,'
    ',url,'
    ',source,'
    ')
            
            

    结果

    3.将取得详细内容的代码包装成函数。

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    
    gzccurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res = requests.get(gzccurl)
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    
    def getdetail(url):
        resd = requests.get(url)
        resd.encoding='utf-8'
        soupd = BeautifulSoup(res.text,'html.parser')
        return (soupd.select('.show-content')[0].text)
    
    for news in soup.select('li'):
        if len(news.select('.news-list-title'))>0:
            title = news.select('.news-list-title')[0].text#标题
            url = news.select('a')[0]['href']#链接
            time = news.select('.news-list-info')[0].contents[0].text
            dt = datetime.strptime(time,'%Y-%m-%d')
            source = news.select('.news-list-info')[0].contents[1].text#来源
            detail = getdetail(url)#详情
    
    
            print(dt,'
    ',title,'
    ',url,'
    ',source,'
    ')
            break
            

    结果

    4.选一个自己感兴趣的主题,做类似的操作,为“爬取网络数据并进行文本分析”做准备。

    import requests
    from bs4 import BeautifulSoup
    jq='http://www.lbldy.com/tag/gqdy/'
    res = requests.get(jq)
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
     
    for news in soup.select('li'):
        if len(news.select('a'))>0:
            title=news.select('a')[0].text
            url=news.select('a')[0]['href']
            #time=news.select('span')[0].contents[0].text
            #print(time,title,url)
            print(title,url)

    结果:

  • 相关阅读:
    jQuery基础
    深入理解JVM内存模型(jmm)和GC
    oracle,哪些操作会导致索引失效?
    systemd
    一个我小时候玩过的我是猪不然关机的软件,我高仿了一个,超简单。
    自己写的求最大值实现,用到了模板函数。
    poj 1695
    poj 1192
    poj 1239
    poj 1170
  • 原文地址:https://www.cnblogs.com/husiqi/p/7606167.html
Copyright © 2011-2022 走看看