zoukankan      html  css  js  c++  java
  • strip invalid xml characters

    今天有同事遇到了XML中包含特殊字符"",导致XML解析出错,他的IE7解析错误,我的FF3也解析出错,但我的IE6却可以显示正常,只是状态栏提示警告信息。
    于是我在网上查找相关资料,发现W3C中指定不能包括这些特殊字符。

    对于XML,我们一般只对以下字符进行转义(避免escape这些字符):
    "<"      "&lt;" 
    ">"      "&gt;"
    "\""     "&quot;" 
    "\'"     "&apos;" 
    "&"      "&amp;"
    其实这些这符,在节点文本中使用<![CDATE[]]>处理,是允许的。

    Assuming your ASP is not trying to add any non-printable characters ot the XML, it usually suffices to filter and replace characters as follows:
        For any text node child of an element:
          "<"  becomes  "&lt;"
          "&"  becomes  "&amp;"

        For any attribute value:
          "<"  becomes  "&lt;"
          "&"  becomes  "&amp;"
          '"'  becomes  '&quot;' (if you are using quote(") to delimit the attribute value)
          "'"  becomes  "&apos;" (if you are using apostrophe(') to delimit the attribute value)

    但是在W3C标准中只能限制以下字符才可以正确使用
    http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
    http://www.ivoa.net/forum/apps-samp/0808/0197.htm

    XML processors MUST accept any character in the range specified for Char.

    Character Range
    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    像以下16进制字符是不允许出现在XML中的,即使放在<![CDATE[]]> 中,也不能幸免遇难。
    \\x00-\\x08
    \\x0b-\\x0c
    \\x0e-\\x1f

    按Character Range说明,除了以上3段需要排除外,另外还有一些也不能在XML中使用,像#xD800-#xDFFF,由于本人不知道这些字符是个什么样,一般应用也很难会出现这些字符,所以暂不作排除,如有需要可自行加上排除处理

    简单处理c# code:
    string content = "slei20sk<O?`";
    content = Regex.Replace(content, "[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "*");
    Response.Write(content);


    网上实例代码(区别XML1.0和XML1.1,特别注意XML1.0和XML1.1不同)
    http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/
    W3C has defined a set of illegal characters for use in XML . You can find info about the same here:
    XML 1.0(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) | XML 1.1(http://www.w3.org/TR/xml11/#charsets)

    Here is a function to remove these characters from a specified XML file:

    using System;
    using System.IO;
    using System.Text;
    using System.Text.RegularExpressions;

    namespace XMLUtils
    {
        class Standards
        {
            /// <summary>
            /// Strips non-printable ascii characters
            /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
            /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
            /// </summary>
            /// <param name="filePath">Full path to the File</param>
            /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
            private void StripIllegalXMLChars(string filePath, string XMLVersion)
            {
                //Remove illegal character sequences
                string tmpContents = File.ReadAllText(filePath, Encoding.UTF8);

                string pattern = String.Empty;
                switch (XMLVersion)
                {
                    case "1.0":
                        pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                        break;
                    case "1.1":
                        pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                        break;
                    default:
                        throw new Exception("Error: Invalid XML Version!");
                }

                Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
                if (regex.IsMatch(tmpContents))
                {
                    tmpContents = regex.Replace(tmpContents, String.Empty);
                    File.WriteAllText(filePath, tmpContents, Encoding.UTF8);
                }
                tmpContents = string.Empty;
            }
        }
    }

    补上msdn上类似的处理:
    http://msdn.microsoft.com/en-us/library/k1y7hyy9(vs.71).aspx
    internal void CheckUnicodeString(String value)
        {
        for (int i=0; i < value.Length; ++i) {
            if (value[i] > 0xFFFD)
            {
                throw new Exception("Invalid Unicode");
            }
            else if (value[i] < 0x20 && value[i] != '\t' & value[i] != '\n' & value[i] != '\r')
            {
                throw new Exception("Invalid Xml Characters");
            }
        } 

     附ascii表:
    http://www.asciitable.com
    http://code.cside.com/3rdpage/us/unicode/converter.html

  • 相关阅读:
    PhpStorm 常用快捷键和配置+关闭快捷键ctrl+alt+方向键旋转屏幕+快速复制一行快捷键恢复
    WP七牛云插件详解
    注册表删除键值时拒绝访问
    删除注册表子项清除u盘使用痕迹
    一件代发发货人怎么写?淘宝代理发货流程
    联动设置
    使用vue实现行列转换的一种方法。
    从后端到前端之Vue(五)小试路由
    从后端到前端之Vue(四)小试牛刀——真实项目的应用(树、tab、数据列表和分页)
    从后端到前端之Vue(三)小结以及一颗真实的大树
  • 原文地址:https://www.cnblogs.com/net205/p/1414607.html
Copyright © 2011-2022 走看看