zoukankan      html  css  js  c++  java
  • Phantomjs 根据Casperjs源码拓展download方法

    最近项目在使用Phantomjs作自动化检测时,有一个需求,需要下载检测网站的所有资源,包括css、js和图片资源,方便人工分析时可以把整个page还原。可惜,Phantomjs并没有直接提供download()这样的方法。查找资料后发现Casperjs有一个download的方法,可以把任意url的内容下载为字符串。由于Casperjs是根据Phantomjs开发的,因此从Casperjs的源码上分析,可能会得到一点启发。

     

    目的:根据Casperjs源码,拓展Phantomjs,添加download方法

     

    1. 先测试Casperjs的download方法[1]

     1 var casper = require('casper').create({
     2     pageSettings : {
     3         webSecurityEnabled: false
     4     }
     5 });
     6 
     7 casper.start('http://www.baidu.com/', function() {
     8     this.download('http://www.w3school.com.cn/', 'w3school.html');
     9 });
    10 
    11 casper.run();

    保存为D:/script.js,在命令行执行(casperjs D:/script.js)。Casperjs需要Phantomjs,请确保已安装Phantomjs v1.x版本。

     

    2. 分析Casperjs源码

    download方法在casper模块里,打开源码包下modules/casper.js,先找到download这个方法体(#592行)

     1 /**
     2  * Downloads a resource and saves it on the filesystem.
     3  *
     4  * @param  String  url         The url of the resource to download
     5  * @param  String  targetPath  The destination file path
     6  * @param  String  method      The HTTP method to use (default: GET)
     7  * @param  String  data        Optional data to pass performing the request
     8  * @return Casper
     9  */
    10 Casper.prototype.download = function download(url, targetPath, method, data) {
    11     "use strict";
    12     this.checkStarted();    //在#426行,检查this是否已启动
    13     var cu = require('clientutils').create(utils.mergeObjects({}, this.options));
    14     try {
    15         fs.write(targetPath, cu.decode(this.base64encode(url, method, data)), 'wb');
    16         this.emit('downloaded.file', targetPath);
    17         this.log(f("Downloaded and saved resource in %s", targetPath));
    18     } catch (e) {
    19         this.log(f("Error while downloading %s to %s: %s", url, targetPath, e), "error");
    20     }
    21     return this;
    22 };

    上面源码中,cu为'clientutils'模块的实例,用于decode(),具体功能后面再讲述。第#16行中,emit()在events模块中(与this绑定的语句在源码#226行),功能为发送日志广播之类,与下面的this.log()一样,对download功能没大影响。因此核心语句在fs.write()中,url的内容在this.base64encode中获取。

    再找base64encode这个方法,在源码#255行,返回callUtils('getBase64', url, method, data)。callUtils在#283行。

     1 /**
     2  * Invokes a client side utils object method within the remote page, with arguments.
     3  *
     4  * @param  {String}   method  Method name
     5  * @return {...args}          Arguments
     6  * @return {Mixed}
     7  * @throws {CasperError}      If invokation failed.
     8  */
     9 Casper.prototype.callUtils = function callUtils(method) {
    10     "use strict";
    11     var args = [].slice.call(arguments, 1); //把除method外的其余参数存到args
    12     var result = this.evaluate(function(method, args) {
    13         return __utils__.__call(method, args);
    14     }, method, args);
    15     if (utils.isObject(result) && result.__isCallError) {
    16         throw new CasperError(f("callUtils(%s) with args %s thrown an error: %s",
    17                               method, args, result.message));
    18     }
    19     return result;
    20 };

    此时的method的值为“getBase64”,估计是一个方法名。这个方法核心语句在this.evaluate(),具体执行为this.evaluate(fn, "getBase64", [url, method, data])。evaluate()在#689行。

     1 /**
     2  * Evaluates an expression in the page context, a bit like what
     3  * WebPage#evaluate does, but the passed function can also accept
     4  * parameters if a context Object is also passed:
     5  *
     6  *     casper.evaluate(function(username, password) {
     7  *         document.querySelector('#username').value = username;
     8  *         document.querySelector('#password').value = password;
     9  *         document.querySelector('#submit').click();
    10  *     }, 'Bazoonga', 'baz00nga');
    11  *
    12  * @param  Function  fn       The function to be evaluated within current page DOM
    13  * @param  Object    context  Object containing the parameters to inject into the function
    14  * @return mixed
    15  * @see    WebPage#evaluate
    16  */
    17  //实际执行evaluate(fn, 'getBase64', [url, method, data])
    18  //即context='getBase64', arguments.length=3
    19 Casper.prototype.evaluate = function evaluate(fn, context) {
    20     "use strict";
    21     this.checkStarted();
    22     console.log("context:"+context);
    23     
    24     if (!utils.isFunction(fn) && !utils.isString(fn)) {
    25         throw new CasperError("evaluate() only accepts functions or strings");
    26     }
    27     
    28     this.injectClientUtils();       //注入clientutils.js,稍后再细看
    29     
    30     if (arguments.length === 1) {
    31         return utils.clone(this.page.evaluate(fn));
    32     } else if (arguments.length === 2) {
    33         // check for closure signature if it matches context
    34         if (utils.isObject(context) && eval(fn).length === Object.keys(context).length) {
    35             context = utils.objectValues(context);
    36         } else {
    37             context = [context];
    38         }
    39     } else {        //arguments.length==3,实际执行到这里
    40         // phantomjs-style signature
    41         context = [].slice.call(arguments).slice(1);
    42     }
    43     //此时context = ['getBase64', [url, method, data]]
    44     //[fn].concat(context) = [fn, 'getBase64', [url, method, data]]
    45     return utils.clone(this.page.evaluate.apply(this.page, [fn].concat(context)));
    46 };

    以上第#28行注入了clientutils.js,具体实现方法下面再分析。第#17和#18行说明调用本方法时的参数情况,根据参数个数,实际执行到#39行,详细说明在#43和#44行的注释。因此,#45行相当于执行this.page.evaluate(fn, 'getBase64', [url, method, data])。fn在callUtils中定义了,最终效果相当于:

    1 this.page.evaluate(function(method, args) {
    2     return __utils__.__call(method, args);
    3 }, 'getBase64', [url, method, data])

    其中,function中的method='getBaes64',args=[url, method, data]。所以最后,这句的意义等于在page中注入脚本执行__utils__.__call('getBase64', [url, method, data])。

    再回头看,__utils__对象在以上#28行this.injectClientUtils()中注入的,injectClientUtils在#1256行。

     1 /**
     2  * Injects Client-side utilities in current page context.
     3  *
     4  */
     5 Casper.prototype.injectClientUtils = function injectClientUtils() {
     6     "use strict";
     7     this.checkStarted();
     8     //保证不重复注入
     9     var clientUtilsInjected = this.page.evaluate(function() {
    10         return typeof __utils__ === "object";
    11     });
    12     if (true === clientUtilsInjected) {
    13         return;
    14     }
    15     var clientUtilsPath = require('fs').pathJoin(phantom.casperPath, 'modules', 'clientutils.js');
    16     if (true === this.page.injectJs(clientUtilsPath)) {
    17         this.log("Successfully injected Casper client-side utilities", "debug");
    18     } else {
    19         this.warn("Failed to inject Casper client-side utilities");
    20     }
    21     // ClientUtils and Casper shares the same options
    22     // These are not the lines I'm the most proud of in my life, but it works.
    23     /*global __options*/
    24     this.page.evaluate(function() {
    25         window.__utils__ = new window.ClientUtils(__options);
    26     }.toString().replace('__options', JSON.stringify(this.options)));
    27 };

    以上代码很好解释。先检查有没有__utils__对象,如果有说明已经注入clientutils了。若没有则注入clientutils.js,并新建ClientUtils对象,取名为__utils__。因此,下一步应该看clientutils.js。

    在clientutils.js中,找到__call方法,在#70行。

     1 /**
     2  * Calls a method part of the current prototype, with arguments.
     3  *
     4  * @param  {String} method Method name
     5  * @param  {Array}  args   arguments
     6  * @return {Mixed}
     7  */
     8 this.__call = function __call(method, args) {
     9     if (method === "__call") {
    10         return;
    11     }
    12     try {
    13         return this[method].apply(this, args);
    14     } catch (err) {
    15         err.__isCallError = true;
    16         return err;
    17     }
    18 };

    核心在#13行,很好理解,即执行method指定的方法,并返回结果。回顾上面,method为'getBase64',因此再找到getBase64方法,在#364行,其引用的getBinary()在下一个方法。getBinary()引用this.sendAJAX()。

    至此整个下载过程的原理已经很清楚了,就是在page中注入脚本,利用跨域同步AJAX取得指定url的内容,然后再返回给Casperjs。sendAJAX则新建XMLHttpRequest来发出请求,这里不详细讲解。

     

    3. 拓展download模块

    首先模仿clientutils封装client模块,保存为modules/client.js。

      1 /*
      2  * 用于phantomjs引用或注入page
      3  */
      4 (function(exports) {
      5     "use strict";
      6 
      7     exports.create = function create() {
      8         return new this.Client();
      9     }
     10 
     11     exports.Client = function Client() {
     12         var BASE64_ENCODE_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
     13         var BASE64_DECODE_CHARS = new Array(
     14             -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
     15             -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
     16             -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63,
     17             52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1,
     18             -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
     19             15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1,
     20             -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
     21             41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1
     22         );
     23 
     24         /**
     25          * Performs an AJAX request.
     26          *
     27          * @param   String   url      Url.
     28          * @param   String   method   HTTP method (default: GET).
     29          * @param   Object   data     Request parameters.
     30          * @param   Boolean  async    Asynchroneous request? (default: false)
     31          * @param   Object   settings Other settings when perform the ajax request
     32          * @return  String            Response text.
     33          */
     34         this.sendAJAX = function sendAJAX(url, method, data, async, settings) {
     35             var xhr = new XMLHttpRequest(),
     36                 dataString = "",
     37                 dataList = [];
     38             method = method && method.toUpperCase() || "GET";
     39             var contentType = settings && settings.contentType || "application/x-www-form-urlencoded";
     40             xhr.open(method, url, !!async);
     41             xhr.overrideMimeType("text/plain; charset=x-user-defined");
     42             if (method === "POST") {
     43                 if (typeof data === "object") {
     44                     for (var k in data) {
     45                         dataList.push(encodeURIComponent(k) + "=" + encodeURIComponent(data[k].toString()));
     46                     }
     47                     dataString = dataList.join('&');
     48                 } else if (typeof data === "string") {
     49                     dataString = data;
     50                 }
     51                 xhr.setRequestHeader("Content-Type", contentType);
     52             }
     53             xhr.send(method === "POST" ? dataString : null);
     54             return this.encode(xhr.responseText);
     55         };
     56 
     57         /**
     58          * Base64 encodes a string, even binary ones. Succeeds where
     59          * window.btoa() fails.
     60          *
     61          * @param  String  str  The string content to encode
     62          * @return string
     63          */
     64         this.encode = function encode(str) {
     65             /*jshint maxstatements:30 */
     66             var out = "", i = 0, len = str.length, c1, c2, c3;
     67             while (i < len) {
     68                 c1 = str.charCodeAt(i++) & 0xff;
     69                 if (i === len) {
     70                     out += BASE64_ENCODE_CHARS.charAt(c1 >> 2);
     71                     out += BASE64_ENCODE_CHARS.charAt((c1 & 0x3) << 4);
     72                     out += "==";
     73                     break;
     74                 }
     75                 c2 = str.charCodeAt(i++);
     76                 if (i === len) {
     77                     out += BASE64_ENCODE_CHARS.charAt(c1 >> 2);
     78                     out += BASE64_ENCODE_CHARS.charAt(((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4));
     79                     out += BASE64_ENCODE_CHARS.charAt((c2 & 0xF) << 2);
     80                     out += "=";
     81                     break;
     82                 }
     83                 c3 = str.charCodeAt(i++);
     84                 out += BASE64_ENCODE_CHARS.charAt(c1 >> 2);
     85                 out += BASE64_ENCODE_CHARS.charAt(((c1 & 0x3) << 4) | ((c2 & 0xF0) >> 4));
     86                 out += BASE64_ENCODE_CHARS.charAt(((c2 & 0xF) << 2) | ((c3 & 0xC0) >> 6));
     87                 out += BASE64_ENCODE_CHARS.charAt(c3 & 0x3F);
     88             }
     89             return out;
     90         };
     91 
     92         /**
     93          * Decodes a base64 encoded string. Succeeds where window.atob() fails.
     94          *
     95          * @param  String  str  The base64 encoded contents
     96          * @return string
     97          */
     98         this.decode = function decode(str) {
     99             /*jshint maxstatements:30, maxcomplexity:30 */
    100             var c1, c2, c3, c4, i = 0, len = str.length, out = "";
    101             while (i < len) {
    102                 do {
    103                     c1 = BASE64_DECODE_CHARS[str.charCodeAt(i++) & 0xff];
    104                 } while (i < len && c1 === -1);
    105                 if (c1 === -1) {
    106                     break;
    107                 }
    108                 do {
    109                     c2 = BASE64_DECODE_CHARS[str.charCodeAt(i++) & 0xff];
    110                 } while (i < len && c2 === -1);
    111                 if (c2 === -1) {
    112                     break;
    113                 }
    114                 out += String.fromCharCode((c1 << 2) | ((c2 & 0x30) >> 4));
    115                 do {
    116                     c3 = str.charCodeAt(i++) & 0xff;
    117                     if (c3 === 61)
    118                     return out;
    119                     c3 = BASE64_DECODE_CHARS[c3];
    120                 } while (i < len && c3 === -1);
    121                 if (c3 === -1) {
    122                     break;
    123                 }
    124                 out += String.fromCharCode(((c2 & 0XF) << 4) | ((c3 & 0x3C) >> 2));
    125                 do {
    126                     c4 = str.charCodeAt(i++) & 0xff;
    127                     if (c4 === 61) {
    128                         return out;
    129                     }
    130                     c4 = BASE64_DECODE_CHARS[c4];
    131                 } while (i < len && c4 === -1);
    132                 if (c4 === -1) {
    133                     break;
    134                 }
    135                 out += String.fromCharCode(((c3 & 0x03) << 6) | c4);
    136             }
    137             return out;
    138         };
    139     };
    140 })(typeof exports === 'object' ? exports : window);

     封装download模块,保存为modules/download.js

     1 /*
     2  * 拓展模块,添加使用GET/POST下载资源的方法
     3  */
     4 exports.create = function create(page) {
     5     return new this.Casper(page);
     6 }
     7 
     8 exports.Casper = function Casper(page) {
     9     this.page = page;
    10     this.fs = require('fs');
    11     //client.js模块所在路径
    12     this.clientPath = this.fs.absolute(require('system').args[0]) + '/../modules/client.js';
    13     this.client = require(this.clientPath).create();
    14 
    15     this.get = function get(url, targetPath) {
    16         this.injectClientJs();  //注入client.js
    17         var content = this.page.evaluate(function(url) {
    18             return __utils__.sendAJAX(url);
    19         }, url);
    20         this.fs.write(targetPath, this.client.decode(content), 'wb');
    21     }
    22 
    23     this.post = function post(url, data, targetPath) {
    24         this.injectClientJs();  //注入client.js
    25         var content = this.page.evaluate(function(url, data) {
    26             return __utils__.sendAJAX(url, 'POST', data);
    27         }, url, data);
    28         this.fs.write(targetPath, this.client.decode(content), 'wb');
    29     }
    30 
    31     this.injectClientJs = function injectClientJs() {
    32         "use strict";
    33         //避免重复注入
    34         var isJsInjected = this.page.evaluate(function() {
    35             return typeof __utils__ === 'object';
    36         });
    37         if (true === isJsInjected) {
    38             return ;
    39         }
    40         if (true !== this.page.injectJs(this.clientPath)) {
    41             console.log('WARNING: Failed to inject client module!');
    42         }
    43         this.page.evaluate(function() {
    44             window.__utils__ = new window.Client(); //新建Client对象
    45         });
    46     };
    47 };

    写一份测试脚本保存为script.js。脚本路径与modules文件夹同级,假设分别为D:/script.js和D:/modules/。

     1 var fs = require('fs');
     2 //切换至当前脚本路径下,方便引入自定义模块
     3 var isChangeDirSuccees = fs.changeWorkingDirectory(fs.absolute(require('system').args[0]) + '/../');
     4 if (!isChangeDirSuccees) {
     5     console.log('ERROR: Failed to change working directory!');
     6     phantom.exit();
     7 }
     8 
     9 var page = require('webpage').create();
    10 page.open('http://www.w3school.com.cn/', function(status) {
    11     var download = require('./modules/download').create(page);
    12     download.get('http://www.w3school.com.cn/i/site_photoref.jpg', 'photo.jpg');
    13     console.log('LOG: Download Completed!');
    14     phantom.exit();
    15 });

    以上代码,先访问w3school主页,再下载site_photoref.jpg图片,保存在photo.jpg中。

    经过测试,download可下载所有类型的资源,包括压缩文件、APK。但是注意一点,由于同源策略,当执行跨域请求时(page.open和download的url不在同域下),要把web-security设为false[2],在命令行启动时输入:phantomjs --web-security=false script.js。

     

    参考资料及引用:

    [1] download方法例子:Casper官网. Casperjs Api.
    http://docs.casperjs.org/en/latest/modules/casper.html#download

    [2] web-security:Phantomjs官网. 命令行选项.
    http://phantomjs.org/api/command-line.html

  • 相关阅读:
    SpringMVC之使用ResponseEntity
    紧随时代的步伐--Java8特性之接口默认方法
    Executor多线程框架
    Jsoup入门
    Echart、Excel、highcharts、jfreechart对比
    JFreeChart入门
    Spring定时任务(@Scheduled)
    Java正则表达式入门基础篇
    Vue.js之入门
    springboot rabbitmq direct exchange和topic exchange 写法上关于路由键的区别
  • 原文地址:https://www.cnblogs.com/kavmors/p/4744445.html
Copyright © 2011-2022 走看看