zoukankan      html  css  js  c++  java
  • python Beautiful Soup 采集it books pdf,免费下载

    http://www.allitebooks.org/
    是我见过最良心的网站,所有书籍免费下载
    周末无聊,尝试采集此站所有Pdf书籍。

    采用技术

    • python3.5
    • Beautiful soup

    分享代码

    最简单的爬虫,没有考虑太多的容错,建议大家尝试的时候,温柔点,别把这个良心网站搞挂掉了

    # www.qingmiaokeji.cn 30
    from bs4 import BeautifulSoup
    import requests
    import json
    
    siteUrl = 'http://www.allitebooks.org/'
    
    
    def category():
        response = requests.get(siteUrl)
        # print(response.text)
        categoryurl = []
        soup = BeautifulSoup(response.text,"html.parser")
        for a in soup.select('.sub-menu li a'):
            categoryurl.append({'name':a.get_text(),'href':a.get("href")})
        return categoryurl
    
    def  bookUrlList(url):
        # urls = []
        response = requests.get(url['href'])
        soup = BeautifulSoup(response.text,"html.parser")
        a = soup.select(".pagination a[title='Last Page →']")
        nums = 0
        for e in a:
            nums = int(e.get_text())
            # print(e.get_text())
        for i in range(1,nums+1):
            # print(url+"page/"+str(i))
            # urls.append(url+"page/"+str(i))
            bookList(url['href']+"page/"+str(i))
    
    def bookList(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"html.parser")
        article = soup.select(".main-content-inner article .entry-title a")
        for i in article:
            url = i.get("href")
            getBookDetail(url)
    
    def  getBookDetail(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"html.parser")
        title = soup.select(".single-title")[0].text
        imgurl = soup.select(".entry-body-thumbnail .attachment-post-thumbnail")[0].get("src")
        downLoadPdfUrl = soup.select(".download-links a")[0].get("href")
        with open('d:/booklist.txt', 'a+',encoding='utf-8') as f:
            f.write(title+" | ![]("+imgurl+") | "+ downLoadPdfUrl+"
    ")
    
    
    if __name__ == '__main__':
        
        list = category()
        for url in list:
            bookUrlList(url)
    
  • 相关阅读:
    循环链表问题
    非常有用的编程学习网站
    我的单例模式(C++)
    C# xml解析
    设计模式趣解
    简单工厂(C++)
    贝塞尔曲线 原理
    C++ 1.#QNAN0;1.#QNAN0
    [NOI2018]屠龙勇士 excrt
    [NOI.AC#30]candy 贪心
  • 原文地址:https://www.cnblogs.com/qingmiaokeji/p/10988906.html
Copyright © 2011-2022 走看看