java基于windows爬取ajax加载的动态页面需要一定的辅助工具支持,本文爬取ajax加载的动态页面所使用的工具是phantomJS(关于phantomJS的介绍百度一大堆)
首先下载phantomJS;下载地址:https://phantomjs.org/download.html
下载之后解压文件,为了后面方便使用建议单独放在一个文件夹里面,例如我这边是放在F盘下面单独的文件夹phantomjs,然后进入phantomjs--bin点击运行phantomjs.exe,出现一下界面:
即表示可以正常运行js代码了。(如果要经常使用建议配置path环境)
接下来就是爬取页面了。
首先需要写一个js(例:parser.js):
1 system = require('system') 2 address = system.args[1]; 3 var page = require('webpage').create(); 4 var url = address; 5 6 page.settings.resourceTimeout = 1000*10; // 10 seconds 7 page.onResourceTimeout = function(e) { 8 console.log(page.content); 9 phantom.exit(1); 10 }; 11 12 page.open(url, function (status) { 13 //Page is loaded! 14 if (status !== 'success') { 15 console.log('Unable to post!'); 16 } else { 17 console.log(page.content); 18 } 19 phantom.exit(); 20 });
然后是java代码(我的parser.js是放在F盘下面的):
1 //读取动态页面 2 public static String dynamicHtml(String url){ 3 Runtime rt = Runtime.getRuntime(); 4 Process process = null; 5 String html = ""; 6 try { 7 process = rt.exec("F:\phantomjs\bin\phantomjs.exe F:/parser.js " +url); 8 InputStream in = process.getInputStream(); 9 InputStreamReader reader = new InputStreamReader(in, "UTF-8"); 10 BufferedReader br = new BufferedReader(reader); 11 String tmp = ""; 12 while ((tmp = br.readLine()) != null) { 13 html = html + tmp; 14 } 15 br.close(); 16 reader.close(); 17 } catch (IOException e) { 18 e.printStackTrace(); 19 } 20 return html; 21 }
处理逻辑(利用Jsoup爬取):
1 public static void ReadAjaxDynamicHtml(String htmlUrl){ 2 String imageHtml = dynamicHtml(htmlUrl); 3 Document imageDoc = Jsoup.parse(imageHtml); 4 //如果选择其中部分元素 有class就用: 5 //Elements childrenImg = imageDoc.select(".class"); 6 //System.err.println(childrenImg.html()); 7 //System.err.println(childrenImg.text()); 8 //如果选择其中部分标签 比如img: 9 //Elements childrenImg = imageDoc.select("img"); 10 System.err.println(imageDoc); 11 /* 接下来的处理逻辑 */ 12 // ... 13 }
main方法调用示例:
1 public static void main(String[] args) { 2 String htmlUrl = "http://www.baidu.com"; 3 ReadAjaxDynamicHtml(htmlUrl); 4 }
显示的结果部分截图:
jar参考:
1 <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> 2 <dependency> 3 <groupId>org.jsoup</groupId> 4 <artifactId>jsoup</artifactId> 5 <version>1.8.3</version> 6 </dependency>
至此测试完成。爬取页面或会涉及读取文本和图片,给出示例读取文本和下载图片到本地示例代码:
1 /** 2 * 3 * @param text 要写入的文本 4 * @param fileName 文件名 5 * @throws IOException 6 */ 7 public static void Writer(String text,String fileName) throws IOException { 8 // 生成的文件路径 9 String path = "F:\" + fileName + System.currentTimeMillis() + ".txt"; 10 File file = new File(path); 11 if (!file.exists()) { 12 file.getParentFile().mkdirs(); 13 } 14 file.createNewFile(); 15 OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file), "UTF-8"); 16 BufferedWriter bw = new BufferedWriter(fw); 17 bw.write(text); 18 bw.flush(); 19 bw.close(); 20 fw.close(); 21 }
1 /** 2 * 3 * @param urlList 图片地址 4 * @param path 存储路径 5 */ 6 private static void downloadPicture(String urlList,String path) { 7 URL url = null; 8 try { 9 url = new URL(urlList); 10 DataInputStream dataInputStream = new DataInputStream(url.openStream()); 11 File file = new File(path); 12 if (!file.exists()) { 13 file.getParentFile().mkdirs(); 14 } 15 //file.createNewFile(); 16 FileOutputStream fileOutputStream = new FileOutputStream(file); 17 ByteArrayOutputStream output = new ByteArrayOutputStream(); 18 19 byte[] buffer = new byte[1024]; 20 int length; 21 22 while ((length = dataInputStream.read(buffer)) > 0) { 23 output.write(buffer, 0, length); 24 } 25 BASE64Encoder encoder = new BASE64Encoder(); 26 String encode = encoder.encode(buffer);//返回Base64编码过的字节数组字符串 27 fileOutputStream.write(output.toByteArray()); 28 dataInputStream.close(); 29 fileOutputStream.close(); 30 } catch (MalformedURLException e) { 31 e.printStackTrace(); 32 } catch (IOException e) { 33 e.printStackTrace(); 34 } 35 }
当然 接口入参可自定义。