zoukankan      html  css  js  c++  java
  • 用selenium爬取某人的微博数据,面向过程方式

    爬取某人的微博数据,把某人所有时间段的微博数据都爬下来。
    具体思路:
    创建driver-----get网页----找到并提取信息-----保存csv----翻页----get网页(开始循环)----...----没有“下一页”就结束,
    用了while True,没用自我调用函数

    嘟大海的微博:https://weibo.com/u/1623915527
    办公室小野的微博:https://weibo.com/bgsxy
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import csv
    import os
    import time
    
    #只有这2个参数设置,想爬谁的微博数据就在这里改地址和目标csv名称就行
    weibo_url = 'https://weibo.com/bgsxy?profile_ftype=1&is_all=1#_0'
    csv_name = 'bgsxy_allweibo.csv'
    
    def start_chrome():
        print('开始创建浏览器')
        driver = webdriver.Chrome(executable_path='C:/Users/lori/Desktop/python52project/chromedriver_win32/chromedriver.exe')
        driver.start_client()
        return driver
    
    def get_web(url):      #获取网页,并下拉到最底部
        print('开始打开指定网页')
        driver.get(url)
        time.sleep(7)
        scoll_down()
        time.sleep(5)
    
    def scoll_down():   # 滚轮下拉到最底部
        html_page = driver.find_element_by_tag_name('html')
        for i in range(7):
            print(i)
            html_page.send_keys(Keys.END)
            time.sleep(1)
    
    def get_data():
        print('开始查找并提取数据')
        card_sel = 'div.WB_cardwrap.WB_feed_type'
        time_sel = 'a.S_txt2[node-type="feed_list_item_date"]'
        source_sel = 'a.S_txt2[suda-uatrack="key=profile_feed&value=pubfrom_guest"]'
        content_sel = 'div.WB_text.W_f14'
        interact_sel = 'span.line.S_line1>span>em:nth-child(2)'
    
        cards = driver.find_elements_by_css_selector(card_sel)
        info_list = []
    
        for card in cards:
            time = card.find_elements_by_css_selector(time_sel)[0].text  #虽然有可能在一个card中有2个time元素,我们取第一个就对
            if card.find_elements_by_css_selector(source_sel):
                source = card.find_elements_by_css_selector(source_sel)[0].text
            else:
                source = ''
            content = card.find_elements_by_css_selector(content_sel)[0].text
            link = card.find_elements_by_css_selector(time_sel)[0].get_attribute('href')
            trans = card.find_elements_by_css_selector(interact_sel)[1].text
            comment = card.find_elements_by_css_selector(interact_sel)[2].text
            like = card.find_elements_by_css_selector(interact_sel)[3].text
            info_list.append([time,source,content,link,trans,comment,like])
    
        return info_list
    
    def save_csv(info_list,csv_name):
        csv_path = './' + csv_name
        print('开始写入csv文件')
        if os.path.exists(csv_path):
            with open(csv_path,'a',newline='',encoding='utf-8-sig') as f: #newline=''避免空行;encoding='utf-8-sig'比utf8牛,保存中文没问题
                writer = csv.writer(f)
                writer.writerows(info_list)
        else:
            with open(csv_path,'w+',newline='',encoding='utf-8-sig') as f:
                writer = csv.writer(f)
                writer.writerow(['发布时间','来源','内容','链接','转发数','评论数','点赞数'])
                writer.writerows(info_list)
        time.sleep(5)
    
    def next_page_url():
        next_page_sel = 'a.page.next'
        next_page_ele = driver.find_elements_by_css_selector(next_page_sel)
        if next_page_ele:
            return next_page_ele[0].get_attribute('href')
        else:
            return None
    
    
    driver = start_chrome()
    input('请在chrome中登录weibo.com')     # 暂停程序,手动登录weibo.com
    
    while True:
        get_web(weibo_url)
        info_list = get_data()
        save_csv(info_list,csv_name)
        if next_page_url():
            weibo_url = next_page_url()
        else:
            print('爬取结束')
            break
    

      

  • 相关阅读:
    pycharm连接数据库报错:1130-host ... is not allowed to connect to this MySql server如何处理
    Linux 下清空或删除大文件内容的 5 种方法
    使用dos命令创建多模块Maven项目
    使用Java API创建(create),查看(describe),列举(list),删除(delete)Kafka主题(Topic)
    Kafka设计解析(二) Kafka High Availability (上)
    Kafka设计解析(一) Kafka背景及架构介绍
    nginx日志切割并使用flume-ng收集日志
    kafka迁移与扩容
    Kafka中的Message Delivary机制
    Kafka Topic动态迁移 (源代码解析)
  • 原文地址:https://www.cnblogs.com/djlbolgs/p/12513661.html
Copyright © 2011-2022 走看看