zoukankan      html  css  js  c++  java
  • python爬虫---->github上python的项目

      这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。我一直以为山是水的故事,云是风的故事,你是我的故事,可是却不知道,我是不是你的故事。

    github的python爬虫

    爬虫的需求:爬取github上有关python的优质项目,以下是测试用例,并没有爬取很多数据。

    一、实现基础功能的爬虫版本

    这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用,可以参考博客:python框架---->pymysql的使用

    import requests
    import pymysql.cursors
    from bs4 import BeautifulSoup
    
    def get_effect_data(data):
        results = list()
        soup = BeautifulSoup(data, 'html.parser')
        projects = soup.find_all('div', class_='repo-list-item')
        for project in projects:
            writer_project = project.find('a', attrs={'class': 'v-align-middle'})['href'].strip()
            project_language = project.find('div', attrs={'class': 'd-table-cell col-2 text-gray pt-2'}).get_text().strip()
            project_starts = project.find('a', attrs={'class': 'muted-link'}).get_text().strip()
            update_desc = project.find('p', attrs={'class': 'f6 text-gray mb-0 mt-2'}).get_text().strip()
    
            result = (writer_project.split('/')[1], writer_project.split('/')[2], project_language, project_starts, update_desc)
            results.append(result)
        return results
    
    
    def get_response_data(page):
        request_url = 'https://github.com/search'
        params = {'o': 'desc', 'q': 'python', 's': 'stars', 'type': 'Repositories', 'p': page}
        resp = requests.get(request_url, params)
        return resp.text
    
    
    def insert_datas(data):
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='root',
                                     db='test',
                                     charset='utf8mb4',
                                     cursorclass=pymysql.cursors.DictCursor)
        try:
            with connection.cursor() as cursor:
                sql = 'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)'
                cursor.executemany(sql, data)
                connection.commit()
        except:
            connection.close()
    
    
    if __name__ == '__main__':
        total_page = 2 # 爬虫数据的总页数
        datas = list()
        for page in range(total_page):
            res_data = get_response_data(page + 1)
            data = get_effect_data(res_data)
            datas += data
        insert_datas(datas)

    运行完之后,可以在数据库中看到如下的数据:

    11 tensorflow tensorflow C++ 78.7k Updated Nov 22, 2017
    12 robbyrussell oh-my-zsh Shell 62.2k Updated Nov 21, 2017
    13 vinta awesome-python Python 41.4k Updated Nov 20, 2017
    14 jakubroztocil httpie Python 32.7k Updated Nov 18, 2017
    15 nvbn thefuck Python 32.2k Updated Nov 17, 2017
    16 pallets flask Python 31.1k Updated Nov 15, 2017
    17 django django Python 29.8k Updated Nov 22, 2017
    18 requests requests Python 28.7k Updated Nov 21, 2017
    19 blueimp jQuery-File-Upload JavaScript 27.9k Updated Nov 20, 2017
    20 ansible ansible Python 26.8k Updated Nov 22, 2017
    21 justjavac free-programming-books-zh_CN JavaScript 24.7k Updated Nov 16, 2017
    22 scrapy scrapy Python 24k Updated Nov 22, 2017
    23 scikit-learn scikit-learn Python 23.1k Updated Nov 22, 2017
    24 fchollet keras Python 22k Updated Nov 21, 2017
    25 donnemartin system-design-primer Python 21k Updated Nov 20, 2017
    26 certbot certbot Python 20.1k Updated Nov 20, 2017
    27 aymericdamien TensorFlow-Examples Jupyter Notebook 18.1k Updated Nov 8, 2017
    28 tornadoweb tornado Python 14.6k Updated Nov 17, 2017
    29 python cpython Python 14.4k Updated Nov 22, 2017
    30 reddit reddit Python 14.2k Updated Oct 17, 2017

    友情链接

  • 相关阅读:
    位运算学习
    C语言从文件中读取数字
    百度网盘视频加速代码
    算法思想
    递归解决全排列问题
    android studio出现offline情况
    二叉树的遍历(递归与非递归)
    (一)为什么要UML
    Logstash详解之——input模块
    解决maven项目update project会更改jdk版本问题
  • 原文地址:https://www.cnblogs.com/huhx/p/usepythongithubspider.html
Copyright © 2011-2022 走看看