zoukankan      html  css  js  c++  java
  • 使用htmlunit在线解析网页信息

    前言

    最近工作上遇到一个问题,后端有一个定时任务,需要用JAVA每天判断法定节假日、周末放假,上班等情况,

    其实想单独通过逻辑什么的去判断中国法定节假日的放假情况,基本不可能,因为国家每一年的假期可能不一样,是人为设定的;

    所以只能依靠其它手段,能想到的比较靠谱的如下:

    1. 网络接口:有些数据服务商会提供,要么是收钱的,要么是次数限制,等等各种问题,效果不理想,可控性差,我也没试过,如:https://www.juhe.cn/docs/api/id/177/aid/601或者http://apistore.baidu.com/apiworks/servicedetail/1116.html
    2. 在线解析网页信息,获取节假日情况:严重依赖被解析的网站网页,所以在选取网站的时候,要找稍微靠谱点的;
    3. 根据国家规定的法定节假日放假情况,每年录入系统,这种如果客户不怕麻烦的话。还是比较靠谱的;

     本Demo将选择第二种来实现;

    使用htmlunit在线解析网页信息,获取节假日情况

    一开始是使用jsoup去解析网页的,效果不理想,如果网页是动态生成的时候,用jsoup遇到了各种问题,所以改成了htmlunit,总得来说htmlunit还是很强大的,能够模拟浏览器运行,被誉为java浏览器的开源实现;

    首先去官网下载相关jar包,以及阅读相关文档:

    http://htmlunit.sourceforge.net/

    我这里解析的网页是360的万年历:

    http://hao.360.cn/rili/

    日历界面如下:

    被解析的 HTML格式如下:

    实现步骤:

    1、加载页面;

    2、循环等待页面加载完成(可能会有一些动态页面,是用javascript生成);

    3、根据网页格式解析html内容,并提取关键信息存入封装好的对象;

    注意点:

    1、难点在于判断是否休假及假期类型,由于原页面并没有标明每一天的假期类型,所以这里的逻辑要自己去实现,详情参考代码;

    2、之所以有个静态latestVocationName变量,是防止出现以下情况(出现该情况的概率极低;PS:方法要每天调用一次,该变量才生效):

    代码实现:

    定义一个中国日期类:

    package com.pichen.tools.getDate;
    
    import java.util.Date;
    
    
    public class ChinaDate {
    
        /**
         * 公历时间
         */
        private Date solarDate;
        
        /**
         * 农历日
         */
        private String lunar;
        
        /**
         * 公历日
         */
        private String solar;
    
        
        /**
         * 是否是  休
         */
        private boolean isVacation = false;
        /**
         * 如果是 休情况下的假期名字
         */
        private String VacationName = "非假期";
        /**
         * 是否是 班
         */
        private boolean isWorkFlag = false;
        
        private boolean isSaturday = false;
        private boolean isSunday = false;
        /**
         * @return the solarDate
         */
        public Date getSolarDate() {
            return solarDate;
        }
        /**
         * @param solarDate the solarDate to set
         */
        public void setSolarDate(Date solarDate) {
            this.solarDate = solarDate;
        }
        /**
         * @return the lunar
         */
        public String getLunar() {
            return lunar;
        }
        /**
         * @param lunar the lunar to set
         */
        public void setLunar(String lunar) {
            this.lunar = lunar;
        }
        /**
         * @return the solar
         */
        public String getSolar() {
            return solar;
        }
        /**
         * @param solar the solar to set
         */
        public void setSolar(String solar) {
            this.solar = solar;
        }
    
        /**
         * @return the isVacation
         */
        public boolean isVacation() {
            return isVacation;
        }
        /**
         * @param isVacation the isVacation to set
         */
        public void setVacation(boolean isVacation) {
            this.isVacation = isVacation;
        }
        /**
         * @return the vacationName
         */
        public String getVacationName() {
            return VacationName;
        }
        /**
         * @param vacationName the vacationName to set
         */
        public void setVacationName(String vacationName) {
            VacationName = vacationName;
        }
        /**
         * @return the isWorkFlag
         */
        public boolean isWorkFlag() {
            return isWorkFlag;
        }
        /**
         * @param isWorkFlag the isWorkFlag to set
         */
        public void setWorkFlag(boolean isWorkFlag) {
            this.isWorkFlag = isWorkFlag;
        }
        /**
         * @return the isSaturday
         */
        public boolean isSaturday() {
            return isSaturday;
        }
        /**
         * @param isSaturday the isSaturday to set
         */
        public void setSaturday(boolean isSaturday) {
            this.isSaturday = isSaturday;
        }
        /**
         * @return the isSunday
         */
        public boolean isSunday() {
            return isSunday;
        }
        /**
         * @param isSunday the isSunday to set
         */
        public void setSunday(boolean isSunday) {
            this.isSunday = isSunday;
        }
        
    
    }
    View Code

    解析网页,并调用demo,打印本月详情,和当天详情:

    package com.pichen.tools.getDate;
    import java.io.IOException;
    import java.net.MalformedURLException;
    import java.text.DateFormat;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.ArrayList;
    import java.util.Date;
    import java.util.List;
    
    import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.DomNodeList;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    
    
    public class Main {
        
    
        private static String latestVocationName="";
        
        public String getVocationName(DomNodeList<HtmlElement> htmlElements, String date) throws ParseException{
            String rst = "";
            
            boolean pastTimeFlag = false;
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            Date paramDate = dateFormat.parse(date);
            if(new Date().getTime() >= paramDate.getTime()){
                pastTimeFlag = true;
            }
            
            //first step   //jugde if can get vocation name from html page
            for(int i = 0; i < htmlElements.size(); i++){
                HtmlElement element = htmlElements.get(i);
                if(element.getAttribute("class").indexOf("vacation")!=-1){
                    
                    boolean hitFlag = false;
                    String voationName = "";
                    for(; i < htmlElements.size(); i++){
                        HtmlElement elementTmp = htmlElements.get(i);
                        String liDate = elementTmp.getAttribute("date");
                        
                        List<HtmlElement> lunar = elementTmp.getElementsByAttribute("span", "class", "lunar");
                        String lanarText = lunar.get(0).asText();
                        
                        if(lanarText.equals("元旦")){
                            voationName = "元旦";
                        }else if(lanarText.equals("除夕")||lanarText.equals("春节")){
                            voationName = "春节";
                        }else if(lanarText.equals("清明")){
                            voationName = "清明";
                        }else if(lanarText.equals("国际劳动节")){
                            voationName = "国际劳动节";
                        }else if(lanarText.equals("端午节")){
                            voationName = "端午节";
                        }else if(lanarText.equals("中秋节")){
                            voationName = "中秋节";
                        }else if(lanarText.equals("国庆节")){
                            voationName = "国庆节";
                        }
                        
                        
                        if(liDate.equals(date)){
                            hitFlag = true;
                        }
                        
                        if(elementTmp.getAttribute("class").indexOf("vacation")==-1){
                            break;
                        }
                    }
                    
                    
                    if(hitFlag == true && !voationName.equals("")){
                        rst = voationName;
                        break;
                    }
                    
                    
                }else{
                    continue;
                }
            }
            
            
            
            //if first step fail(rarely), get from the latest Vocation name
            if(rst.equals("")){
                System.out.println("warning: fail to get vocation name from html page.");
    
                //you can judge by some simple rule 
                
                //from the latest Vocation name
                rst = Main.latestVocationName;
            }else if(pastTimeFlag == true){
                //更新《当前时间,且最近一次的可见的假期名
                Main.latestVocationName = rst;
            }
            return rst;
        }
        
        
        public List<ChinaDate> getCurrentDateInfo(){
            WebClient webClient = null;
            List<ChinaDate> dateList = null;
            
            try{
                DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
                dateList = new ArrayList<ChinaDate>();
    
                webClient = new WebClient();
                HtmlPage page = webClient.getPage("http://hao.360.cn/rili/");
                
                //最大等待60秒
                for(int k = 0; k < 60; k++){
                    if(!page.getElementById("M-dates").asText().equals("")) break;
                    Thread.sleep(1000);
                }
                
                //睡了8秒,等待页面加载完成...,有时候,页面可能获取不到,不稳定()
                //Thread.sleep(8000);
    
                DomNodeList<HtmlElement> htmlElements = page.getElementById("M-dates").getElementsByTagName("li");
                //System.out.println(htmlElements.size());
                
                
                for(HtmlElement element : htmlElements){
                    ChinaDate chinaDate = new ChinaDate();
                    
                    List<HtmlElement> lunar = element.getElementsByAttribute("span", "class", "lunar");
                    List<HtmlElement> solar = element.getElementsByAttribute("div", "class", "solar");
    
                    chinaDate.setLunar(lunar.get(0).asText());
                    chinaDate.setSolar(solar.get(0).asText());
                    chinaDate.setSolarDate(dateFormat.parse(element.getAttribute("date")));
                    
    
                    if(element.getAttribute("class").indexOf("vacation")!=-1){
                        chinaDate.setVacation(true);
                        chinaDate.setVacationName(this.getVocationName(htmlElements, element.getAttribute("date")));
                        
                        
    
                        
                    }
                    
                    if(element.getAttribute("class").indexOf("weekend")!=-1 && 
                       element.getAttribute("class").indexOf("last")==-1){
                        chinaDate.setSaturday(true);
                    }
                    if(element.getAttribute("class").indexOf("last weekend")!=-1){
                        chinaDate.setSunday(true);
                    }
                    if(element.getAttribute("class").indexOf("work")!=-1){
                        chinaDate.setWorkFlag(true);
                    }else if(chinaDate.isSaturday() == false &&
                             chinaDate.isSunday() == false && 
                             chinaDate.isVacation() == false ){
                        chinaDate.setWorkFlag(true);
                    }else{
                        chinaDate.setWorkFlag(false);
                    }
                    
                    dateList.add(chinaDate);
                }
                
                
            }catch(Exception e){
                e.printStackTrace();
                System.out.println("get date from http://hao.360.cn/rili/ error~");
            }finally{
                webClient.close();
            }
            return dateList;
        }
        
        
        public ChinaDate getTodayInfo(){
            List<ChinaDate> dateList = this.getCurrentDateInfo();
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            for(ChinaDate date: dateList){
                if(dateFormat.format(date.getSolarDate()).equals(dateFormat.format(new Date()))){
                    return date;
                }
            }
            return new ChinaDate();
        }
        
    
        public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
    
            List<ChinaDate> dateList = new Main().getCurrentDateInfo();
            ChinaDate today = new Main().getTodayInfo();
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            
            System.out.println("本月详情:");
            for(ChinaDate date: dateList){
                System.out.println(dateFormat.format(date.getSolarDate()) + " " + date.getVacationName());
            }
    
            System.out.println("------------------------------------------------------------------------");
            System.out.println("今日详情:");
            System.out.println("日期:" + today.getSolarDate());
            System.out.println("农历:"+today.getLunar());
            System.out.println("公历:"+today.getSolar());
            System.out.println("假期名:"+today.getVacationName());
            System.out.println("是否周六:"+today.isSaturday());
            System.out.println("是否周日:"+today.isSunday());
            System.out.println("是否休假:"+today.isVacation());
            System.out.println("是否工作日:"+today.isWorkFlag());
            
            System.out.println("已发生的最近一次假期:" + Main.latestVocationName);
        }
    
    }
    View Code

    运行程序,结果正确:

    后续改进措施

    当网页加载失败的时候,可以多次尝试;

    可以考虑多找几个网站的日历进行解析,当其中一个抛出异常的时候,切换到另一个网站解析;

    考虑增加邮件通知或短信通知功能,出现任何异常信息都能实时通知系统管理者;

  • 相关阅读:
    左偏树
    论在Windows下远程连接Ubuntu
    ZOJ 3711 Give Me Your Hand
    SGU 495. Kids and Prizes
    POJ 2151 Check the difficulty of problems
    CodeForces 148D. Bag of mice
    HDU 3631 Shortest Path
    HDU 1869 六度分离
    HDU 2544 最短路
    HDU 3584 Cube
  • 原文地址:https://www.cnblogs.com/chenpi/p/5161181.html
Copyright © 2011-2022 走看看