zoukankan      html  css  js  c++  java
  • 提取新闻正文

    参考了好多人的算法,但感觉太深奥了,自己写了一个,感觉效果还可以,不过还是有很多杂质在里面

    成功率没有测试过,以后校验。

        public static String extractContent(String url) {
            Document document = JsoupUitl.readUrl(url);
            String orderHtml = document.toString().toLowerCase();
            orderHtml = orderHtml.replaceAll("(?is)<!DOCTYPE.*?>", "");
            orderHtml = orderHtml.replaceAll("(?is)<!--.*?-->", ""); // remove html
            orderHtml = orderHtml.replaceAll("(?is)<script.*?>.*?</script>", ""); // remove
            orderHtml = orderHtml.replaceAll("(?is)<style.*?>.*?</style>", ""); // remove
            orderHtml = orderHtml.replaceAll("(?is)<a.*?>.*?</a>", ""); // remove
            orderHtml = orderHtml.replaceAll("&.{2,5};|&#.{2,5};", "");
            orderHtml = orderHtml.replaceAll("<(?!\/?(td|tr|img|br|p)).*?>", "");
            String[] eleList = orderHtml.split("
    ");
            StringBuffer sb = new StringBuffer();
            for (String string : eleList) {
                if (string.trim().length() > 20) {
                    if (string.contains("></p>")) {
                    } else {
                        sb.append(string);
                    }
                }
            }
    
            orderHtml = sb.toString();
            // System.out.println("=====================================");
            // System.out.println(Jsoup.parse(orderHtml));
            return orderHtml;
    
        }

    测试例子,感觉效果还不错:

  • 相关阅读:
    属性选择器
    Map
    tomcat 启动失败
    find for /f 分割字符串 bat
    oracle 1day
    scott lock
    oracle oracle sqldeveloper 12505 创建连接失败
    oracle sql developer 启动java.exe设置错误
    servlet post response.sendRedirect 乱码

  • 原文地址:https://www.cnblogs.com/tomcattd/p/3511461.html
Copyright © 2011-2022 走看看