zoukankan      html  css  js  c++  java
  • Java之网络爬虫WebCollector2.1.2+selenium2.44+phantomjs2.1.1

    Java之网络爬虫WebCollector2.1.2+selenium2.44+phantomjs2.1.1

    一、简介

    版本匹配: WebCollector2.12 + selenium2.44.0 + phantomjs 2.1.1 

    动态网页爬取: WebCollector + selenium + phantomjs

    说明:这里的动态网页指几种可能:1)需要用户交互,如常见的登录操作;2)网页通过JS / AJAX动态生成,如一个html里有<div id="test"></div>,通过JS生成<div id="test"><span>aaa</span></div>。

    这里用了WebCollector 2进行爬虫,这东东也方便,不过要支持动态关键还是要靠另外一个API -- selenium 2(集成htmlunit 和 phantomjs).

    二、示例

    /** 
     * Project Name:padwebcollector 
     * File Name:DiscussService.java 
     * Package Name:com.pad.service 
     * Date:2018年7月25日下午4:59:44 
     * Copyright (c) 2018 All Rights Reserved. 
     * 
    */  
      
    package com.pad.service;  
    
    import java.util.ArrayList;
    import java.util.List;
    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.phantomjs.PhantomJSDriver;
    import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;
    import cn.edu.hfut.dmic.webcollector.model.Links;
    import cn.edu.hfut.dmic.webcollector.model.Page;
    import com.pad.entity.DiscussInfo;
    import com.pad.impl.DiscussInfoImpl;
    
    public class DiscussService extends DeepCrawler {
        
        public DiscussService(String crawlPath) {
            super(crawlPath);
            // TODO Auto-generated constructor stub
        }
        
        @Override
        public Links visitAndGetNextLinks(Page page) {
            // TODO Auto-generated method stub
            WebDriver driver = getWebDriver(page);
            Analysis analysis = new Analysis();
            List<DiscussInfo> discusslist = new ArrayList();
            List<WebElement> list = driver.findElements(By.className("content"));
            int i = 1;
            String r_msg = "观望";
            for(WebElement el : list) {
                if(!"".equals(el.getText().trim())){
                    r_msg = analysis.analysis(el.getText());
                }
                
                DiscussInfo info = new DiscussInfo();
                info.setLine_no(String.valueOf(i));
                info.setResult_msg(r_msg);
                info.setContent_msg(el.getText());
                discusslist.add(info);
                System.out.println(i+" "+el.getText());
                i++;
            }
            driver.close();
            driver.quit();
            
            DiscussInfoImpl impl = new DiscussInfoImpl();
            impl.saveData(discusslist);
            return null;
        }
        
        public static WebDriver getWebDriver(Page page) {
            System.setProperty("phantomjs.binary.path", "D:\******\phantomjs.exe");
            WebDriver driver = new PhantomJSDriver();
            driver.get(page.getUrl());
            return driver;
        }
    
        public static void main(String[] args) {
            DiscussService dis=new DiscussService("discuss");   
         dis.addSeed("https://*******/index/0000012"); try { dis.start(1); } catch (Exception e) { e.printStackTrace(); } } }

    注意:WebCollector2.12 和WebCollector2.7区别类 extends 继承分别为 DeepCrawler 和 BreadthCrawler;

  • 相关阅读:
    HDU 2089 不要62
    HDU 5038 Grade(分级)
    FZU 2105 Digits Count(位数计算)
    FZU 2218 Simple String Problem(简单字符串问题)
    FZU 2221 RunningMan(跑男)
    FZU 2216 The Longest Straight(最长直道)
    FZU 2212 Super Mobile Charger(超级充电宝)
    FZU 2219 StarCraft(星际争霸)
    FZU 2213 Common Tangents(公切线)
    FZU 2215 Simple Polynomial Problem(简单多项式问题)
  • 原文地址:https://www.cnblogs.com/lizm166/p/9376369.html
Copyright © 2011-2022 走看看