zoukankan      html  css  js  c++  java
  • Java去除掉HTML里面所有标签的两种方法——开源jar包和自己写正则表达式

    Java去除掉HTML里面所有标签,主要就两种,要么用开源的jar处理,要么就自己写正则表达式。自己写的话,可能处理不全一些自定义的标签。企业应用基本都是能找开源就找开源,实在不行才自己写……

    1,开源的,我目前找到的就是Jsoup包:

        public static String getTextFromTHML(String htmlStr) {
            Document doc = Jsoup.parse(htmlStr);
            String text = doc.text();
            // remove extra white space
            StringBuilder builder = new StringBuilder(text);
            int index = 0;
            while(builder.length()>index){
                char tmp = builder.charAt(index);
                if(Character.isSpaceChar(tmp) || Character.isWhitespace(tmp)){
                    builder.setCharAt(index, ' ');
                }
                index++;
            }
            text = builder.toString().replaceAll(" +", " ").trim();
            return text;
        }

    2,自己写的话,百度一搜一大堆,这里只是借用一下:

    public static String removeTag(String htmlStr) {
            String regEx_script = "<script[^>]*?>[\s\S]*?<\/script>"; // script
            String regEx_style = "<style[^>]*?>[\s\S]*?<\/style>"; // style
            String regEx_html = "<[^>]+>"; // HTML tag
            String regEx_space = "\s+|	|
    |
    ";// other characters
    
            Pattern p_script = Pattern.compile(regEx_script,
                    Pattern.CASE_INSENSITIVE);
            Matcher m_script = p_script.matcher(htmlStr);
            htmlStr = m_script.replaceAll("");
    
            Pattern p_style = Pattern
                    .compile(regEx_style, Pattern.CASE_INSENSITIVE);
            Matcher m_style = p_style.matcher(htmlStr);
            htmlStr = m_style.replaceAll("");
    
            Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
            Matcher m_html = p_html.matcher(htmlStr);
            htmlStr = m_html.replaceAll("");
    
            Pattern p_space = Pattern
                    .compile(regEx_space, Pattern.CASE_INSENSITIVE);
            Matcher m_space = p_space.matcher(htmlStr);
            htmlStr = m_space.replaceAll(" ");
    
            return htmlStr;
    
        }

     

  • 相关阅读:
    Android 程序员必须知道的 53 个知识点
    2017.8.27 考试
    hdu 3118 Arbiter
    UVA 1575 Factors
    [HNOI2008]Cards
    JSOI2008 小店购物
    hdu 2121 Ice_cream’s world II
    poj 3164 Command Network(最小树形图模板)
    [USACO14MAR] Counting Friends
    UVA 10479 The Hendrie Sequence
  • 原文地址:https://www.cnblogs.com/wytings/p/4580065.html
Copyright © 2011-2022 走看看