zoukankan      html  css  js  c++  java
  • 下载代码python之小说下载器

    时光紧张,先记一笔,后续优化与完善。

                首先声明,我写这个是为了练手,我不看小说了.因为眼睛近视太厉害了,我连手机都不玩了.

                小说下载器的目标是为了解决当初市面上能下载最新小说的网站是在太少了,但是在线观看的却很多,所以我写了这个在线抓取小说的工具.代码是针对特定的网站编写的代码,但是我认为这个网站时光很长,小说也很全,应该能满意绝大多数的需求,网站名字这里不说,一会大家代码里看,我怕有法律纠纷.

        因为这个是一个网页抓取去读html的一个工具,所以需要一个解析html的框架,我发明了pyquery,因为我自己以为jquery学得不错(jquery 写过自己的插件,浏览器兼容性问题不大都能处理,jqueryui基本上全部的东西都用过,还自定制过很多jqueryui插件.能自己修复官方bug),发明了这个pyquery法宝,肯定不能放过.安装python插件我应用的是easy_install,

        我原先应用的pip但是发明不如easy_install好用,我在装pyquery的时候,用pip就不能安装胜利,pip在处理依赖库的时候报错了,我用easy_install就安装胜利了.easy_install 和pip的安装可以看这里:http://blog.csdn.net/qq413041153/article/details/8950247 

        安装好easy_install 以后直接在cmd里面输入:

    easy_install pyquery

            如图,因为我已经安装过了,所以直接提示我已经在easy-install.pth中激活了pyquery1.2.4.

        下载和代码

        

                上面直接上代码:

        每日一道理
    如果说生命是一座庄严的城堡,如果说生命是一株苍茂的大树,如果说生命是一只飞翔的海鸟。那么,信念就是那穹顶的梁柱,就是那深扎的树根,就是那扇动的翅膀。没有信念,生命的动力便荡然无存;没有信念,生命的美丽便杳然西去。(划线处可以换其他词语)
    # -*- coding:gbk -*-
    '''
    file desc:novel downloader
    author:kingviker
    email:kingviker@163.com.kingviker88@gmail.com
    date:2013-05-21
    depends:python 2.7.4,pyquery
    '''
    
    import os,codecs
    from pyquery import PyQuery as pq
    
    
    saveMode="singleFile" #singleFile or singleChapter
    
    #novel's main webpage.
    url = "http://www.dushuge.net/html/14/14712/"
    #where the novels will be saved
    baseSavePath="E:/enovel/"
    
    #using pyquery to grub the webpage's content
    html_pq = pq(url=url)
    
    #using jquery's grammar to get the novel's name/
    novelName = html_pq("div.book_news_style_text2 > h1").text()
    print novelName
    
    
    #if the novel's file system  not exists,created.
    if os.path.exists(baseSavePath+novelName) is not True:
        os.mkdir(baseSavePath+novelName)
    
    #using to save pieces and chapter lists
    pieceList=[]
    chapterList=[]
    
    
    #find the first piece of the novel.
    piece = pq(html_pq("div.book_article_texttable").find(".book_article_texttitle")[0])
    
    #get the current piece's text
    pieceList.append(piece.text())
    print "piece Text:", piece.text()
    
    #scan out the piece and chapter lists
    nextPiece=False
    while nextPiece==False:
        chapterDiv = piece.next()
        #print "章节div长度:",chapterDiv.length
        piece = chapterDiv
        if chapterDiv.length==0:
            pieceList.append(chapterList[:])
            del chapterList[:]
            nextPiece=True
        elif chapterDiv.attr("class")=="book_article_texttitle":
            pieceList.append(chapterList[:])
            del chapterList[:]
            pieceList.append(piece.text())
            
        else:
            chapterUrls = chapterDiv.find("a");
            for urlA in chapterUrls:
                urlList_temp = [pq(urlA).text(),pq(urlA).attr("href")]
                chapterList.append(urlList_temp)
    
    print "下载列表收集实现",len(pieceList)
    
    
    #based on the piecelist,grub the special webpage's novel content and save them .
    if saveMode == "singleFile":
        
        if os.path.exists(baseSavePath+novelName+".txt"):os.remove(baseSavePath+novelName+".txt")
    
        #using codecs to create a file. write mode(w+) is appended.
        novelFile = codecs.open(baseSavePath+novelName+".txt","wb+","utf-8")
        #just using two for loops to analyze the piecelist.
        for pieceNum in range(0,len(pieceList),2):
            piece = pieceList[pieceNum]
            print "开始下载",pieceList[pieceNum]
            chapterList = pieceList[pieceNum+1]
            for chapterNum in range(0,len(chapterList)):
                chapter = chapterList[chapterNum]
                print "开始下载",chapter[0],"地址:",chapter[1]
                chapterPage = pq(url=url+chapter[1])
    
                chapterContent = piece+" "+chapter[0]+"\r\n"
                chapterContent += chapterPage("#booktext").html().replace("<br />","\r\n")
                print "小说内容:",len(chapterContent)
                novelFile.write(chapterContent+"\r\n"+"\r\n")
                
        novelFile.close()
    else:
        # as same as above
       for pieceNum in range(0,len(pieceList),2):
            piece = pieceList[pieceNum]
            print "开始下载",pieceList[pieceNum]
            chapterList = pieceList[pieceNum+1]
            for chapterNum in range(0,len(chapterList)):
                chapter = chapterList[chapterNum]
                print "开始下载",chapter[0],"地址:",chapter[1]
                novelFile = codecs.open(baseSavePath+novelName+os.sep+piece+chapter[0]+".txt","wb","utf-8")
                chapterPage = pq(url=url+chapter[1])
    
                chapterContent = piece+" "+chapter[0]+"\r\n"
                chapterContent += chapterPage("#booktext").html().replace("<br />","\r\n")
                print "小说内容:",len(chapterContent)
                novelFile.write(chapterContent+"\r\n"+"\r\n")
                novelFile.close()
    
    print "下载实现"

                直接改换代码中的小说主页面 即可下载,小说文件会放在e:/novel/下,可以选择单章保存或者单文件保存.

                我没有封装成函数,因为我比较懒.

                有问题或者错误 欢送批评指正.

        

        弥补:

                代码里面用到了codecs,这里有篇文章可以帮助大家了解codecs:传送门

    文章结束给大家分享下程序员的一些笑话语录: 人在天涯钻,哪儿能不挨砖?日啖板砖三百颗,不辞长做天涯人~

  • 相关阅读:
    各种协议与HTTP协议之间的关系
    在浏览器中输入url地址到显示主页的过程
    TCP 协议如何保证可靠传输
    TCP,UDP 协议的区别
    TCP 三次握手和四次挥手
    OSI与TCP/IP各层的结构与功能,用到的协议
    424. 替换后的最长重复字符
    webstorm快捷键
    S1:动态方法调用:call & apply
    S1:原型继承
  • 原文地址:https://www.cnblogs.com/xinyuyuanm/p/3091510.html
Copyright © 2011-2022 走看看