时光紧张,先记一笔,后续优化与完善。
首先声明,我写这个是为了练手,我不看小说了.因为眼睛近视太厉害了,我连手机都不玩了.
小说下载器的目标是为了解决当初市面上能下载最新小说的网站是在太少了,但是在线观看的却很多,所以我写了这个在线抓取小说的工具.代码是针对特定的网站编写的代码,但是我认为这个网站时光很长,小说也很全,应该能满意绝大多数的需求,网站名字这里不说,一会大家代码里看,我怕有法律纠纷.
因为这个是一个网页抓取去读html的一个工具,所以需要一个解析html的框架,我发明了pyquery,因为我自己以为jquery学得不错(jquery 写过自己的插件,浏览器兼容性问题不大都能处理,jqueryui基本上全部的东西都用过,还自定制过很多jqueryui插件.能自己修复官方bug),发明了这个pyquery法宝,肯定不能放过.安装python插件我应用的是easy_install,
我原先应用的pip但是发明不如easy_install好用,我在装pyquery的时候,用pip就不能安装胜利,pip在处理依赖库的时候报错了,我用easy_install就安装胜利了.easy_install 和pip的安装可以看这里:http://blog.csdn.net/qq413041153/article/details/8950247
安装好easy_install 以后直接在cmd里面输入:
easy_install pyquery
如图,因为我已经安装过了,所以直接提示我已经在easy-install.pth中激活了pyquery1.2.4.
上面直接上代码:
# -*- coding:gbk -*- ''' file desc:novel downloader author:kingviker email:kingviker@163.com.kingviker88@gmail.com date:2013-05-21 depends:python 2.7.4,pyquery ''' import os,codecs from pyquery import PyQuery as pq saveMode="singleFile" #singleFile or singleChapter #novel's main webpage. url = "http://www.dushuge.net/html/14/14712/" #where the novels will be saved baseSavePath="E:/enovel/" #using pyquery to grub the webpage's content html_pq = pq(url=url) #using jquery's grammar to get the novel's name/ novelName = html_pq("div.book_news_style_text2 > h1").text() print novelName #if the novel's file system not exists,created. if os.path.exists(baseSavePath+novelName) is not True: os.mkdir(baseSavePath+novelName) #using to save pieces and chapter lists pieceList=[] chapterList=[] #find the first piece of the novel. piece = pq(html_pq("div.book_article_texttable").find(".book_article_texttitle")[0]) #get the current piece's text pieceList.append(piece.text()) print "piece Text:", piece.text() #scan out the piece and chapter lists nextPiece=False while nextPiece==False: chapterDiv = piece.next() #print "章节div长度:",chapterDiv.length piece = chapterDiv if chapterDiv.length==0: pieceList.append(chapterList[:]) del chapterList[:] nextPiece=True elif chapterDiv.attr("class")=="book_article_texttitle": pieceList.append(chapterList[:]) del chapterList[:] pieceList.append(piece.text()) else: chapterUrls = chapterDiv.find("a"); for urlA in chapterUrls: urlList_temp = [pq(urlA).text(),pq(urlA).attr("href")] chapterList.append(urlList_temp) print "下载列表收集实现",len(pieceList) #based on the piecelist,grub the special webpage's novel content and save them . if saveMode == "singleFile": if os.path.exists(baseSavePath+novelName+".txt"):os.remove(baseSavePath+novelName+".txt") #using codecs to create a file. write mode(w+) is appended. novelFile = codecs.open(baseSavePath+novelName+".txt","wb+","utf-8") #just using two for loops to analyze the piecelist. for pieceNum in range(0,len(pieceList),2): piece = pieceList[pieceNum] print "开始下载",pieceList[pieceNum] chapterList = pieceList[pieceNum+1] for chapterNum in range(0,len(chapterList)): chapter = chapterList[chapterNum] print "开始下载",chapter[0],"地址:",chapter[1] chapterPage = pq(url=url+chapter[1]) chapterContent = piece+" "+chapter[0]+"\r\n" chapterContent += chapterPage("#booktext").html().replace("<br />","\r\n") print "小说内容:",len(chapterContent) novelFile.write(chapterContent+"\r\n"+"\r\n") novelFile.close() else: # as same as above for pieceNum in range(0,len(pieceList),2): piece = pieceList[pieceNum] print "开始下载",pieceList[pieceNum] chapterList = pieceList[pieceNum+1] for chapterNum in range(0,len(chapterList)): chapter = chapterList[chapterNum] print "开始下载",chapter[0],"地址:",chapter[1] novelFile = codecs.open(baseSavePath+novelName+os.sep+piece+chapter[0]+".txt","wb","utf-8") chapterPage = pq(url=url+chapter[1]) chapterContent = piece+" "+chapter[0]+"\r\n" chapterContent += chapterPage("#booktext").html().replace("<br />","\r\n") print "小说内容:",len(chapterContent) novelFile.write(chapterContent+"\r\n"+"\r\n") novelFile.close() print "下载实现"
直接改换代码中的小说主页面 即可下载,小说文件会放在e:/novel/下,可以选择单章保存或者单文件保存.
我没有封装成函数,因为我比较懒.
有问题或者错误 欢送批评指正.
弥补:
代码里面用到了codecs,这里有篇文章可以帮助大家了解codecs:传送门
文章结束给大家分享下程序员的一些笑话语录:
人在天涯钻,哪儿能不挨砖?日啖板砖三百颗,不辞长做天涯人~