java-去除html中的标签或者元素属性（正则表达式/jsoup）

zoukankan html css js c++ java

java-去除html中的标签或者元素属性（正则表达式/jsoup）
业务场景：

如一篇使用富文本编辑器编辑的新闻稿，需要在列表页面截取前200字作为摘要，此时需要去除html标签，截取真正的文本部分。

/**
* 删除Html标签
*/
public static String removeHtmlTag(String htmlStr) {
//定义script的正则表达式{或<script[^>]*?>[\s\S]*?<\/script>
String regEx_script = "<[\s]*?script[^>]*?>[\s\S]*?<[\s]*?\/[\s]*?script[\s]*?>";
//定义style的正则表达式{或<style[^>]*?>[\s\S]*?<\/style>
String regEx_style = "<[\s]*?style[^>]*?>[\s\S]*?<[\s]*?\/[\s]*?style[\s]*?>";
//定义HTML标签的正则表达式
String regEx_html = "<[^>]+>";
//定义一些特殊字符的正则表达式如：     
String regEx_special = "\&[a-zA-Z]{1,10};";

//1.过滤script标签
Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
Matcher m_script = p_script.matcher(htmlStr);
htmlStr = m_script.replaceAll("");
//2.过滤style标签
Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
Matcher m_style = p_style.matcher(htmlStr);
htmlStr = m_style.replaceAll("");
//3.过滤html标签
Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
Matcher m_html = p_html.matcher(htmlStr);
htmlStr = m_html.replaceAll("");
//4.过滤特殊标签
Pattern p_special = Pattern.compile(regEx_special, Pattern.CASE_INSENSITIVE);
Matcher m_special = p_special.matcher(htmlStr);
htmlStr = m_special.replaceAll("");

return htmlStr;
}

使用正则表达式去除html中的元素属性
private static final String regEx_tag = "<(\w[^>|\s]*)[\s\S]*?>";

public static String removeEleProp(String htmlStr) {
Pattern p = Pattern.compile(regEx_tag, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(htmlStr);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String tagWithProp= m.group(0);
String tag =m.group(1);
if ("img".equals(tag)) {
//img标签保留属性，可进一步处理删除无用属性，仅保留src等必要属性
m.appendReplacement(sb, tagWithProp);
}else if ("a".equals(tag)) {
//a标签保留属性，可进一步处理删除无用属性，仅保留href等必要属性
m.appendReplacement(sb, tagWithProp);
}else{
m.appendReplacement(sb, "<" + tag + ">");
}
}
m.appendTail(sb);
return sb.toString();
}

使用方法：
jsoup Cookbook(中文版)
jsoup官网(en)
————————————————
版权声明：本文为CSDN博主「fukaiit」的原创文章，遵循CC 4.0 by-sa版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/fukaiit/article/details/84262471
查看全文

相关阅读:
软件工程概论课后作业2
第三周进度表
 软件工程概论课后作业1
第二周进度表
 9.异常处理
 《构建之法》阅读笔记二
 《构建之法》阅读笔记一
 第五周进度表
 软件工程个人作业03
第四周进度表

原文地址：https://www.cnblogs.com/ww25/p/11397887.html

java-去除html中的标签或者元素属性（正则表达式/jsoup）

使用正则表达式去除html中的元素属性