zoukankan      html  css  js  c++  java
  • 爬取github项目。

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://github.com/login'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Referer': 'https://github.com/',
        'Upgrade-Insecure-Requests': '1',  # 此处的1 必须是字符串,不是数字
        'Host': 'github.com',
        'Connection': 'keep-alive',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
    res1 = requests.get(url, headers=headers)
    # 检验
    print(res1.status_code)
    print(res1.reason)
    # 通过解析页面来获取动态token
    soup = BeautifulSoup(res1.text, 'lxml')
    tag_input = soup.find(name='input', attrs={'name': 'authenticity_token'})
    authenticity_token = tag_input.get('value')
    data = {'commit': 'Sign+in',
            'utf8': '✓',
            'authenticity_token': authenticity_token,
            'login': '295345t54341@qq.com',
            'password': '234523456345'}
    
    cookies = res1.cookies.get_dict()
    # 这里的url是https://github.com/session,不是https://github.com/login
    res2 = requests.post(url='https://github.com/session', headers=headers, cookies=cookies, data=data)
    print(authenticity_token)
    print(res2.status_code)
    print(res2.reason)
    cookies.update(res2.cookies.get_dict())
    res3 = requests.get(url='https://github.com/settings/repositories',
                        cookies=cookies,
                        headers=headers
                        )
    
    print(res3.url)
    print(res3.status_code)
    print(res3.reason)
    
    soup3 = BeautifulSoup(res3.text, 'lxml')
    project = soup3.find(name='div', attrs={'class': 'listgroup'})
    print(project)
    project_list = project.find_all(name='a', attrs={'class': 'mr-1'})
    for i in project_list:
        project_name = i.text
        project_ = i.get('href')
        project_href = 'https://github.com/' + project_.split('/', maxsplit=1)[1]
        print('项目名称:%s , 项目连接:%s' % (project_name, project_href), '
    ')
    
        # 爬取github注意事项,1.以后携带的cookie使用的是登录后的cookie
        # 2.需要在登录页面找到token,该token是动态的需要使用bs4,或者正则表达式来获取动态值
    
  • 相关阅读:
    1.IntelliJ IDEA搭建SpringBoot的小Demo
    etc目录名字的意思---挖Linux中的古老缩略语
    CI当开启URL重写的时候,报错500 Internal Server Error
    app后端架构设计(转)
    Redis安装及主从配置
    ***Linux文件夹文件创建、删除、改名
    Redis中常用命令
    linux上ln链接命令详细说明
    Redis快速入门:安装、配置和操作
    redis的PHP扩展包安装方法
  • 原文地址:https://www.cnblogs.com/luobiao-114/p/9263876.html
Copyright © 2011-2022 走看看