非法字符在xml中的存储一直比较讨厌,其实这个非法字符并不仅仅是非可见字符,还包括xml中规定的某些特殊字符,比如<&>等。
一种比较方便的处理方式是将那些非法字符采用HEX方式存储或者base64加密后存储,以下是两个函数展示怎么采用base64加密的方式妥善处理那些非法字符,既保证数据的完整性,又能保持可读。毕竟所生成的xml不仅仅是用于机器读取,而且很大一部分还要对人阅读友好。其中的思路是:对于存在非法字符的字符串,统一使用base64加密,在生成的xml标签中增加base64=True属性,对于不存在非法字符的,直接显示原始数据,生成的标签中也不再添加base64属性。这样既能保证数据的完整性,又能保持xml的可读性。
# -*- encoding: utf-8 -*- """ Created on 2011-11-08 @summary: helper functions may be used in xml process @author: JerryKwan """ try: import xml.sax.saxutils except ImportError: raise ImportError("requires xml.sax.saxutils package, pleas check if xml.sax.saxutils is installed!") import base64 import logging logger = logging.getLogger(__name__) __all__ = ["escape", "unescape"] def escape(data): """ @summary: Escape '&', '<', and '>' in a string of data. if the data is not ascii, then encode in base64 @param data: the data to be processed @return {"base64": True | False, "data": data} """ # check if all of the data is in ascii code is_base64 = False escaped_data = "" try: data.decode("ascii") is_base64 = False # check if the data should be escaped to be stored in xml escaped_data = xml.sax.saxutils.escape(data) except UnicodeDecodeError: logger.debug("%s is not ascii-encoded string, so i will encode it in base64") # base64 encode escaped_data = base64.b64encode(data) is_base64 = True return {"base64": is_base64, "data": escaped_data} def unescape(data, is_base64 = False): """ @summary: Unescape '&', '<', and '>' in a string of data. if base64 is True, then base64 decode will be processed first @param data: the data to be processed @param base64: specify if the data is encoded by base64 @result: unescaped data """ # check if base64 unescaped_data = data if is_base64: try: unescaped_data = base64.b64decode(data) except Exception, ex: logger.debug("some excpetion occured when invoke b64decode") logger.error(ex) print ex else: # unescape it unescaped_data = xml.sax.saxutils.unescape(data) return unescaped_data if __name__ == "__main__": def test(data): print "original data is: ", data t1 = escape(data) print "escaped result: ", t1 print "unescaped result is: ", unescape(t1["data"], t1["base64"]) print "#" * 50 test("123456") test("测试") test("< & >") test("`!@#$%^&*:'\"-=") print "just a test"
注意:上述方法做的比较简单,只是处理了ascii字符和<&>,非ascii统一使用base64加密,要想做兼容性更好一些的话,可以采用chardet包,将字符串同意转换成utf-8存储,这样一来适用性会强很多。