zoukankan html css js c++ java

爬取爬虫学习资料

如有不得当之处，请联系我会及时删除

这次的抓取我用的是requests和Xpath,因为没有必要使用大型工具

import requests
from lxml import etree

思路：
1.目的是下载爬虫教程
2.分析网页以及规则，使用Xpath简单获取下载url
3.循环下载

代码如下：

class github():
    def __init__(self):
        self.allowed_domains = 'https://github.com/Python3WebSpider'
        self.headers = {
            'User-Agent':'*****请换成你们自己的 '
        }
    def spider_pipline(self):
        response1 = requests.get(self.allowed_domains,headers = self.headers,timeout = 5)
        selector = etree.HTML(response1.text)
        main_hrefs = selector.xpath('//div[@id="org-repositories"]//ul/li/div[@class="d-inline-block mb-1"]//a/@href')
        for start_href in main_hrefs:
            href = 'https://github.com'+ start_href
            response2 = requests.get(href, headers=self.headers, timeout=5)
            selector2 = etree.HTML(response2.text)
            href = selector2.xpath('//main[@id="js-repo-pjax-container"]//div[@class="get-repo-modal-options"]/div[@class="mt-2"]/a[2]/@href')
            for item in href:
                item_new = 'https://github.com'+item
                # yield item_new
                # print(item_new)
                r = requests.get(item_new)
                item = item[18:].replace('/','-')
                # print(item)
                with open(item, "wb") as git_zip:
                    git_zip.write(r.content)
                    print('done-')

if __name__ == '__main__':
    git = github()
    git.spider_pipline()
    print('down——OK')

最后的最后，建议大家给GitHub博主送个星，那个博主也是我崇拜的偶像呢！他的书很不错！建议买书进行学习、有利于知识体系的结构化构建

如有冒犯之处，请联系删除相应内容。

查看全文

相关阅读:
无法启动程序 ”*.lib”
winedt打开.tex文件时会出现reading error，看不到任何文字
 VS2012 OpenCV2.4.9 Debug可以允许，Release不可以
 VS2012的调试插件Image Watch，opencv编程神器
 VS2012 配置 OpenCV3.0
ICP 算法步骤
 linux 文件系统
 interrupt_control
bootm命令移植
 DMA

原文地址：https://www.cnblogs.com/chenruhai/p/12464230.html