zoukankan html css js c++ java

java网络爬虫-利用phantomjs和jsoup爬取动态ajax加载页面

java基于windows爬取ajax加载的动态页面需要一定的辅助工具支持，本文爬取ajax加载的动态页面所使用的工具是phantomJS(关于phantomJS的介绍百度一大堆)

首先下载phantomJS；下载地址：https://phantomjs.org/download.html

下载之后解压文件，为了后面方便使用建议单独放在一个文件夹里面，例如我这边是放在F盘下面单独的文件夹phantomjs,然后进入phantomjs--bin点击运行phantomjs.exe，出现一下界面：

phantomjs运行界面

即表示可以正常运行js代码了。（如果要经常使用建议配置path环境）

接下来就是爬取页面了。

首先需要写一个js（例：parser.js）：

 1 system = require('system')
 2 address = system.args[1];
 3 var page = require('webpage').create();
 4 var url = address;
 5 
 6 page.settings.resourceTimeout = 1000*10; // 10 seconds
 7 page.onResourceTimeout = function(e) {
 8     console.log(page.content);
 9     phantom.exit(1);
10 };
11 
12 page.open(url, function (status) {
13     //Page is loaded!
14     if (status !== 'success') {
15         console.log('Unable to post!');
16     } else {
17         console.log(page.content);
18     }
19     phantom.exit();
20 });

然后是java代码（我的parser.js是放在F盘下面的）：

 1 //读取动态页面
 2     public static String dynamicHtml(String url){
 3         Runtime rt = Runtime.getRuntime();
 4         Process process = null;
 5         String html = "";
 6         try {
 7             process = rt.exec("F:\phantomjs\bin\phantomjs.exe F:/parser.js " +url);
 8             InputStream in = process.getInputStream();
 9             InputStreamReader reader = new InputStreamReader(in, "UTF-8");
10             BufferedReader br = new BufferedReader(reader);
11             String tmp = "";
12             while ((tmp = br.readLine()) != null) {
13                 html = html + tmp;
14             }
15             br.close();
16             reader.close();
17         } catch (IOException e) {
18             e.printStackTrace();
19         }
20         return html;
21     }

处理逻辑（利用Jsoup爬取）：

 1 public static void ReadAjaxDynamicHtml(String htmlUrl){
 2         String imageHtml = dynamicHtml(htmlUrl);
 3         Document imageDoc = Jsoup.parse(imageHtml);
 4         //如果选择其中部分元素 有class就用：
 5         //Elements childrenImg = imageDoc.select(".class");
 6         //System.err.println(childrenImg.html());
 7         //System.err.println(childrenImg.text());
 8         //如果选择其中部分标签 比如img：
 9         //Elements childrenImg = imageDoc.select("img");
10         System.err.println(imageDoc);
11         /* 接下来的处理逻辑 */
12         // ...
13     }

main方法调用示例：

1 public static void main(String[] args) {
2         String htmlUrl = "http://www.baidu.com";
3         ReadAjaxDynamicHtml(htmlUrl);
4     }

显示的结果部分截图：

jar参考：

1 <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
2 <dependency>
3     <groupId>org.jsoup</groupId>
4     <artifactId>jsoup</artifactId>
5     <version>1.8.3</version>
6 </dependency>

至此测试完成。爬取页面或会涉及读取文本和图片，给出示例读取文本和下载图片到本地示例代码：

 1 /**
 2      *
 3      * @param text 要写入的文本
 4      * @param fileName 文件名
 5      * @throws IOException
 6      */
 7     public static void Writer(String text,String fileName) throws IOException {
 8         // 生成的文件路径
 9         String path = "F:\" + fileName + System.currentTimeMillis() + ".txt";
10         File file = new File(path);
11         if (!file.exists()) {
12             file.getParentFile().mkdirs();
13         }
14         file.createNewFile();
15         OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
16         BufferedWriter bw = new BufferedWriter(fw);
17         bw.write(text);
18         bw.flush();
19         bw.close();
20         fw.close();
21     }

 1 /**
 2      * 
 3      * @param urlList 图片地址
 4      * @param path 存储路径
 5      */
 6     private static void downloadPicture(String urlList,String path) {
 7         URL url = null;
 8         try {
 9             url = new URL(urlList);
10             DataInputStream dataInputStream = new DataInputStream(url.openStream());
11             File file = new File(path);
12             if (!file.exists()) {
13                 file.getParentFile().mkdirs();
14             }
15             //file.createNewFile();
16             FileOutputStream fileOutputStream = new FileOutputStream(file);
17             ByteArrayOutputStream output = new ByteArrayOutputStream();
18 
19             byte[] buffer = new byte[1024];
20             int length;
21 
22             while ((length = dataInputStream.read(buffer)) > 0) {
23                 output.write(buffer, 0, length);
24             }
25             BASE64Encoder encoder = new BASE64Encoder();
26             String encode = encoder.encode(buffer);//返回Base64编码过的字节数组字符串
27             fileOutputStream.write(output.toByteArray());
28             dataInputStream.close();
29             fileOutputStream.close();
30         } catch (MalformedURLException e) {
31             e.printStackTrace();
32         } catch (IOException e) {
33             e.printStackTrace();
34         }
35     }

当然接口入参可自定义。

查看全文

相关阅读:
as3 变量默认值
 as3 判断移动方向
 as3 根据鼠标移动方向
 as3 XML类和XMLList类的区别
 as3 文档类判断是否被加载
 AS3获取对象类名，getDefinitionByName，getQualifiedClassName，getQualifiedSuperclassName
as3 object与dictionary区别
 吹芯片
 stm32四种输入
 usart和uart 的区别

原文地址：https://www.cnblogs.com/unidentified/p/12502741.html