zoukankan      html  css  js  c++  java
  • 使用htmlunit在线解析网页信息

    前言

    最近工作上遇到一个问题,后端有一个定时任务,需要用JAVA每天判断法定节假日、周末放假,上班等情况,

    其实想单独通过逻辑什么的去判断中国法定节假日的放假情况,基本不可能,因为国家每一年的假期可能不一样,是人为设定的;

    所以只能依靠其它手段,能想到的比较靠谱的如下:

    1. 网络接口:有些数据服务商会提供,要么是收钱的,要么是次数限制,等等各种问题,效果不理想,可控性差,我也没试过,如:https://www.juhe.cn/docs/api/id/177/aid/601或者http://apistore.baidu.com/apiworks/servicedetail/1116.html
    2. 在线解析网页信息,获取节假日情况:严重依赖被解析的网站网页,所以在选取网站的时候,要找稍微靠谱点的;
    3. 根据国家规定的法定节假日放假情况,每年录入系统,这种如果客户不怕麻烦的话。还是比较靠谱的;

     本Demo将选择第二种来实现;

    使用htmlunit在线解析网页信息,获取节假日情况

    一开始是使用jsoup去解析网页的,效果不理想,如果网页是动态生成的时候,用jsoup遇到了各种问题,所以改成了htmlunit,总得来说htmlunit还是很强大的,能够模拟浏览器运行,被誉为java浏览器的开源实现;

    首先去官网下载相关jar包,以及阅读相关文档:

    http://htmlunit.sourceforge.net/

    我这里解析的网页是360的万年历:

    http://hao.360.cn/rili/

    日历界面如下:

    被解析的 HTML格式如下:

    实现步骤:

    1、加载页面;

    2、循环等待页面加载完成(可能会有一些动态页面,是用javascript生成);

    3、根据网页格式解析html内容,并提取关键信息存入封装好的对象;

    注意点:

    1、难点在于判断是否休假及假期类型,由于原页面并没有标明每一天的假期类型,所以这里的逻辑要自己去实现,详情参考代码;

    2、之所以有个静态latestVocationName变量,是防止出现以下情况(出现该情况的概率极低;PS:方法要每天调用一次,该变量才生效):

    代码实现:

    定义一个中国日期类:

    package com.pichen.tools.getDate;
    
    import java.util.Date;
    
    
    public class ChinaDate {
    
        /**
         * 公历时间
         */
        private Date solarDate;
        
        /**
         * 农历日
         */
        private String lunar;
        
        /**
         * 公历日
         */
        private String solar;
    
        
        /**
         * 是否是  休
         */
        private boolean isVacation = false;
        /**
         * 如果是 休情况下的假期名字
         */
        private String VacationName = "非假期";
        /**
         * 是否是 班
         */
        private boolean isWorkFlag = false;
        
        private boolean isSaturday = false;
        private boolean isSunday = false;
        /**
         * @return the solarDate
         */
        public Date getSolarDate() {
            return solarDate;
        }
        /**
         * @param solarDate the solarDate to set
         */
        public void setSolarDate(Date solarDate) {
            this.solarDate = solarDate;
        }
        /**
         * @return the lunar
         */
        public String getLunar() {
            return lunar;
        }
        /**
         * @param lunar the lunar to set
         */
        public void setLunar(String lunar) {
            this.lunar = lunar;
        }
        /**
         * @return the solar
         */
        public String getSolar() {
            return solar;
        }
        /**
         * @param solar the solar to set
         */
        public void setSolar(String solar) {
            this.solar = solar;
        }
    
        /**
         * @return the isVacation
         */
        public boolean isVacation() {
            return isVacation;
        }
        /**
         * @param isVacation the isVacation to set
         */
        public void setVacation(boolean isVacation) {
            this.isVacation = isVacation;
        }
        /**
         * @return the vacationName
         */
        public String getVacationName() {
            return VacationName;
        }
        /**
         * @param vacationName the vacationName to set
         */
        public void setVacationName(String vacationName) {
            VacationName = vacationName;
        }
        /**
         * @return the isWorkFlag
         */
        public boolean isWorkFlag() {
            return isWorkFlag;
        }
        /**
         * @param isWorkFlag the isWorkFlag to set
         */
        public void setWorkFlag(boolean isWorkFlag) {
            this.isWorkFlag = isWorkFlag;
        }
        /**
         * @return the isSaturday
         */
        public boolean isSaturday() {
            return isSaturday;
        }
        /**
         * @param isSaturday the isSaturday to set
         */
        public void setSaturday(boolean isSaturday) {
            this.isSaturday = isSaturday;
        }
        /**
         * @return the isSunday
         */
        public boolean isSunday() {
            return isSunday;
        }
        /**
         * @param isSunday the isSunday to set
         */
        public void setSunday(boolean isSunday) {
            this.isSunday = isSunday;
        }
        
    
    }
    View Code

    解析网页,并调用demo,打印本月详情,和当天详情:

    package com.pichen.tools.getDate;
    import java.io.IOException;
    import java.net.MalformedURLException;
    import java.text.DateFormat;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.ArrayList;
    import java.util.Date;
    import java.util.List;
    
    import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.DomNodeList;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    
    
    public class Main {
        
    
        private static String latestVocationName="";
        
        public String getVocationName(DomNodeList<HtmlElement> htmlElements, String date) throws ParseException{
            String rst = "";
            
            boolean pastTimeFlag = false;
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            Date paramDate = dateFormat.parse(date);
            if(new Date().getTime() >= paramDate.getTime()){
                pastTimeFlag = true;
            }
            
            //first step   //jugde if can get vocation name from html page
            for(int i = 0; i < htmlElements.size(); i++){
                HtmlElement element = htmlElements.get(i);
                if(element.getAttribute("class").indexOf("vacation")!=-1){
                    
                    boolean hitFlag = false;
                    String voationName = "";
                    for(; i < htmlElements.size(); i++){
                        HtmlElement elementTmp = htmlElements.get(i);
                        String liDate = elementTmp.getAttribute("date");
                        
                        List<HtmlElement> lunar = elementTmp.getElementsByAttribute("span", "class", "lunar");
                        String lanarText = lunar.get(0).asText();
                        
                        if(lanarText.equals("元旦")){
                            voationName = "元旦";
                        }else if(lanarText.equals("除夕")||lanarText.equals("春节")){
                            voationName = "春节";
                        }else if(lanarText.equals("清明")){
                            voationName = "清明";
                        }else if(lanarText.equals("国际劳动节")){
                            voationName = "国际劳动节";
                        }else if(lanarText.equals("端午节")){
                            voationName = "端午节";
                        }else if(lanarText.equals("中秋节")){
                            voationName = "中秋节";
                        }else if(lanarText.equals("国庆节")){
                            voationName = "国庆节";
                        }
                        
                        
                        if(liDate.equals(date)){
                            hitFlag = true;
                        }
                        
                        if(elementTmp.getAttribute("class").indexOf("vacation")==-1){
                            break;
                        }
                    }
                    
                    
                    if(hitFlag == true && !voationName.equals("")){
                        rst = voationName;
                        break;
                    }
                    
                    
                }else{
                    continue;
                }
            }
            
            
            
            //if first step fail(rarely), get from the latest Vocation name
            if(rst.equals("")){
                System.out.println("warning: fail to get vocation name from html page.");
    
                //you can judge by some simple rule 
                
                //from the latest Vocation name
                rst = Main.latestVocationName;
            }else if(pastTimeFlag == true){
                //更新《当前时间,且最近一次的可见的假期名
                Main.latestVocationName = rst;
            }
            return rst;
        }
        
        
        public List<ChinaDate> getCurrentDateInfo(){
            WebClient webClient = null;
            List<ChinaDate> dateList = null;
            
            try{
                DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
                dateList = new ArrayList<ChinaDate>();
    
                webClient = new WebClient();
                HtmlPage page = webClient.getPage("http://hao.360.cn/rili/");
                
                //最大等待60秒
                for(int k = 0; k < 60; k++){
                    if(!page.getElementById("M-dates").asText().equals("")) break;
                    Thread.sleep(1000);
                }
                
                //睡了8秒,等待页面加载完成...,有时候,页面可能获取不到,不稳定()
                //Thread.sleep(8000);
    
                DomNodeList<HtmlElement> htmlElements = page.getElementById("M-dates").getElementsByTagName("li");
                //System.out.println(htmlElements.size());
                
                
                for(HtmlElement element : htmlElements){
                    ChinaDate chinaDate = new ChinaDate();
                    
                    List<HtmlElement> lunar = element.getElementsByAttribute("span", "class", "lunar");
                    List<HtmlElement> solar = element.getElementsByAttribute("div", "class", "solar");
    
                    chinaDate.setLunar(lunar.get(0).asText());
                    chinaDate.setSolar(solar.get(0).asText());
                    chinaDate.setSolarDate(dateFormat.parse(element.getAttribute("date")));
                    
    
                    if(element.getAttribute("class").indexOf("vacation")!=-1){
                        chinaDate.setVacation(true);
                        chinaDate.setVacationName(this.getVocationName(htmlElements, element.getAttribute("date")));
                        
                        
    
                        
                    }
                    
                    if(element.getAttribute("class").indexOf("weekend")!=-1 && 
                       element.getAttribute("class").indexOf("last")==-1){
                        chinaDate.setSaturday(true);
                    }
                    if(element.getAttribute("class").indexOf("last weekend")!=-1){
                        chinaDate.setSunday(true);
                    }
                    if(element.getAttribute("class").indexOf("work")!=-1){
                        chinaDate.setWorkFlag(true);
                    }else if(chinaDate.isSaturday() == false &&
                             chinaDate.isSunday() == false && 
                             chinaDate.isVacation() == false ){
                        chinaDate.setWorkFlag(true);
                    }else{
                        chinaDate.setWorkFlag(false);
                    }
                    
                    dateList.add(chinaDate);
                }
                
                
            }catch(Exception e){
                e.printStackTrace();
                System.out.println("get date from http://hao.360.cn/rili/ error~");
            }finally{
                webClient.close();
            }
            return dateList;
        }
        
        
        public ChinaDate getTodayInfo(){
            List<ChinaDate> dateList = this.getCurrentDateInfo();
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            for(ChinaDate date: dateList){
                if(dateFormat.format(date.getSolarDate()).equals(dateFormat.format(new Date()))){
                    return date;
                }
            }
            return new ChinaDate();
        }
        
    
        public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
    
            List<ChinaDate> dateList = new Main().getCurrentDateInfo();
            ChinaDate today = new Main().getTodayInfo();
            DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd");
            
            System.out.println("本月详情:");
            for(ChinaDate date: dateList){
                System.out.println(dateFormat.format(date.getSolarDate()) + " " + date.getVacationName());
            }
    
            System.out.println("------------------------------------------------------------------------");
            System.out.println("今日详情:");
            System.out.println("日期:" + today.getSolarDate());
            System.out.println("农历:"+today.getLunar());
            System.out.println("公历:"+today.getSolar());
            System.out.println("假期名:"+today.getVacationName());
            System.out.println("是否周六:"+today.isSaturday());
            System.out.println("是否周日:"+today.isSunday());
            System.out.println("是否休假:"+today.isVacation());
            System.out.println("是否工作日:"+today.isWorkFlag());
            
            System.out.println("已发生的最近一次假期:" + Main.latestVocationName);
        }
    
    }
    View Code

    运行程序,结果正确:

    后续改进措施

    当网页加载失败的时候,可以多次尝试;

    可以考虑多找几个网站的日历进行解析,当其中一个抛出异常的时候,切换到另一个网站解析;

    考虑增加邮件通知或短信通知功能,出现任何异常信息都能实时通知系统管理者;

  • 相关阅读:
    input file 上传图片并显示
    关于npm ---- npm 命令行运行多个命令
    webpack4.x 配置
    React的生命周期
    HTML5 meta 属性整理
    css 命名规范
    html5 标签 meter 和 progress
    .NET Linq TO XML 操作XML
    .NET 字符串指定规则添加换行
    Linux Centos上部署ASP.NET网站
  • 原文地址:https://www.cnblogs.com/chenpi/p/5161181.html
Copyright © 2011-2022 走看看