zoukankan      html  css  js  c++  java
  • 如何下载web资源


    如何下载web资源

    目的

    最近机工社宣布开放工程科技数字图书馆, 全网免费共克时艰!

    发现有些书是以web页面的方式给用户看的,一张一张,很难一次性下载

    有没有办法一次性下载他们呢?

    比如

    1580562344539

    研究

    test 1: chrome extension

    上网查到很多chrome extension但是他们都认不到页面内的连接。这是因为页面里面根本没有连接

    biru

    页面链接如下

    <a href="javascript:void(0);" onclick="probation.readBook(this);" id="678612" ref="/openresources/teach_ebook/uncompressed/13780/OEBPS/Text/chapter33.html#heading_id_3">3.1 协商原则</a>
    

    该链接其实最终变成http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/13780/OEBPS/Text/chapter33.html

    所以怪不得扩展不认识了

    看来还是要自己写一个了

    最简单就是用python了

    测试以上链接

    C:Userscutep>python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/13780/OEBPS/Text/chapter33.html -o 33.html
    100% [................................................................................] 4000 / 4000
    Saved under 33.html
    

    成功!

    test 2: 最终写了如下python脚本

    import os 
    #from selenium import webdriver
    #from urllib2 import urlopen
    import requests
    
    def my_system(cmd):
    	print(cmd)
    	os.system(cmd)
    	
    def download(url, file):
    	cmd = 'python -m wget %s -o %s'%(url, file)
    	my_system(cmd)
    	
    def download_chapter(click_url, file):
    	download('http://www.hzcourse.com/resource/readBook?path=%s'%click_url, file)
    	
    def get_bookname(cont):
    	s='<div class="book-name">'
    	p1 = cont.find(s)
    	p1 = p1 + len(s)
    	p1 = cont.find('<span>', p1)
    	p1 = p1 + len('<span>')
    	
    	p2 = cont.find('</span>', p1)
    	#print(p1, p2)
    	name=cont[p1:p2]
    	return name
    	
    def get_value_token(cont):
    	s='"ebookId" value="'
    	p1 = cont.find(s)
    	p1 = p1 + len(s)
    	p2 = cont.find('"/>', p1)
    	#print(p1, p2)
    	ebookId=cont[p1:p2]
    	s2 = 'name="token" value="'
    	p3 = cont.find(s2, p2)
    	p3 = p3 + len(s2)
    	p4 = cont.find('"/>', p3)
    	#print(p3, p4)
    	token=cont[p3:p4]
    	print('ebookId, token %s %s'%(ebookId, token))
    	return [ebookId, token]
    	
    def download_book(main_link):
    	my_system('del main*.html')
    	
    	download(main_link, 'main.html')
    	main_cont = open('main.html', 'r', encoding='utf-8').read()
    	[ebookId, token] = get_value_token(main_cont)
    	bookname = get_bookname(main_cont)
    	print(bookname)
    
    	if os.path.isdir(bookname): return
    	
    	my_system('rd/s/q my_temp')
    	my_system('md my_temp')
    	os.chdir('my_temp')
    	my_system('cd')
    	
    	#response = requests.post('http://www.hzcourse.com/web/refbook/queryAllChapterList', data={'ebookId':15917,'token':"e87436c8bc7849c397a1db2f27c0ba5d"})
    	response = requests.post('http://www.hzcourse.com/web/refbook/queryAllChapterList', data={'ebookId':ebookId,'token':token})
    	resp_json = response.json()
    	#print(resp_json)
    	for i in resp_json['data']['data']:
    		ref_link = i['ref']
    		file = ref_link[ref_link.rfind('/')+1:]
    		print(ref_link, file)
    		download_chapter(ref_link, file)
    	os.chdir('..')
    	my_system('cd')
    	my_system('md "%s"'%bookname)
    	my_system('xcopy /c/d/e/y my_temp "%s"'%bookname)
    	
    #download_book('http://www.hzcourse.com/web/refbook/probationAll/6736/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6736/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6856/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/7899/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/7249/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/7165/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/7186/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/7523/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6965/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6826/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6166/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6188/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6853/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/4599/e87436c8bc7849c397a1db2f27c0ba5d')
    download_book('http://www.hzcourse.com/web/refbook/probationAll/6759/e87436c8bc7849c397a1db2f27c0ba5d')
    

    Test result

    Saved under chapter51.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter52.xhtml chapter52.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter52.xhtml -o chapter52.xhtml
    100% [................................................................................] 1058 / 1058
    Saved under chapter52.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter53.xhtml chapter53.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter53.xhtml -o chapter53.xhtml
    100% [................................................................................] 4625 / 4625
    Saved under chapter53.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter54.xhtml chapter54.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter54.xhtml -o chapter54.xhtml
    100% [..................................................................................] 705 / 705
    Saved under chapter54.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter55.xhtml chapter55.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter55.xhtml -o chapter55.xhtml
    100% [................................................................................] 1814 / 1814
    Saved under chapter55.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter56.xhtml chapter56.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter56.xhtml -o chapter56.xhtml
    100% [..............................................................................] 10025 / 10025
    Saved under chapter56.xhtml
    /openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter57.xhtml chapter57.xhtml
    python -m wget http://www.hzcourse.com/resource/readBook?path=/openresources/teach_ebook/uncompressed/16571/OEBPS/Text/chapter57.xhtml -o chapter57.xhtml
    

    1580569390465

    其他

    下面这个是啥框架写的?

    A: avalonjs

                                <li ms-for="bookChapter in @bookChapters">
                                	<a href="javascript:void(0);" onclick="probation.readBook(this);" ms-attr="{id : bookChapter.id, ref : bookChapter.ref}">{{bookChapter.title}}</a>
                                </li>
    

    bookChapter在哪里定义的?

    var probation = {
    	search:function(){
    		var key = $.trim($("#condition").val());
    		ebookRead.queryEbookChapterList(key);
    	},
    	queryEbookChapterList:function(key){
    		var ebookId = $.trim($("#ebookId").val());
    		var token = $.trim($("#token").val());
    		debugger;
    		jQuery.ajax({
    	    	type : "post" , 
    	    	url : "web/refbook/queryAllChapterList", 
    	    	dataType : "json" , 
    	    	data : {ebookId:ebookId,key:key,token:token},
    	    	success : function(obj) {
    	    		if(obj.data.code==1){
    	    			var bookChapters = obj.data.data;
    	    			if(bookChapters.length > 0){
    	    				bookChaptertCtrl.bookChapters = bookChapters;
    	    				$("#chapterCont").load();
    	    				$("#directories").find("li").first().children("a").click();
    	    			}
    	    		} else {
    	    			alert(obj.data.message);
    	    		}
    	    	}
    	    });
    	},
    

    1580564786707

    如何获取连接?

    万能的chrome F12了

    1580563424388

  • 相关阅读:
    spring 配置 线程池并使用 springtest 进行测试
    Mybatis使用generator自动生成的Example类使用OR条件查询
    springtest mapper注入失败问题解决 {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    异常 org.apache.ibatis.binding.BindingException: Invalid bound statement (not found) 解决方案
    idea 开启 tomcat 访问日志记录
    idea ssm项目迁移到另一台机器上时出现不能正常启动项目的解决方案
    记一次 java 连接 linux ssh服务 权限验证失败的原因和解决过程
    ajax传递数组给controller的实现方法和坑
    service手动实例化(new)导致类中的spring对象无法注入的问题解决
    javaweb学习总结十一(JAXP对XML文档进行DOM解析)
  • 原文地址:https://www.cnblogs.com/cutepig/p/12250629.html
Copyright © 2011-2022 走看看