zoukankan      html  css  js  c++  java
  • 爬虫(网易新闻)

    爬虫(网易新闻)

    import os
    import re
    import requests
    
    if not os.path.exists('网易新闻'):
        os.mkdir('网易新闻')
    
    count = 0
    for i in ['nba','cba','china']:
        # 获取所有的url
        response = requests.get(f'https://sports.163.com/{i}/')
        data = response.text
        url_res = re.findall('href="(https://sports.163.com/.*?)"', data)
        url_res = set(url_res)
    
        # 针对单个url
    
        for url in url_res:
            url_response = requests.get(url)
            url_data = url_response.text
    
            try:
                title = re.findall('<h1>(.*?)</h1>', url_data, re.S)[0]
                news_res = 
                    re.findall('<div class="post_text" id="endText" style="border-top:1px solid #ddd;">(.*?责任编辑:.*?)</span>',
                               url_data, re.S)[0]  #
                news_res = re.sub('<.*?>', '', news_res)
            except:
                continue
    
            title = re.sub('[!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~,…]|s', '', title)  # 除掉标题所有的脏字符
            title_path = os.path.join('网易新闻', f'{title}.txt')  # 拼接出新闻的路径
            f = open(title_path, 'w', encoding='utf8')
    
            f.write(news_res)
            f.flush()
            f.close()
            count += 1
    
            print(f'完成{count}篇, {title} done...')
    
  • 相关阅读:
    VINS_Fusion 框架
    VINS_Fusion 前端源码解析
    堆与优先队列
    LSD-SLAM简介
    直接法和特征点法的区别与优缺点
    CV::Mat介绍
    C++ 位运算
    OPENCV重要函数
    C++ 优先队列
    特征点法的巅峰之作—ORBSLAM2
  • 原文地址:https://www.cnblogs.com/yushan1/p/11232379.html
Copyright © 2011-2022 走看看