Python2.6.6的ElementTree输出xml中汉字变成&#xxxx的问题 - 走看看

zoukankan html css js c++ java

Python2.6.6的ElementTree输出xml中汉字变成&#xxxx的问题
Python2.6.6的ElementTree输出xml中汉字变成&#xxxx的问题：
# coding: gbk import xml.etree.ElementTree as ET rootelem = ET.Element("SystemList") organization = ET.SubElement(rootelem, "Organization") organization.attrib["label"] = "<空>" organization.attrib["id"] = "-1" organization.text = "测试" print ET.tostring(rootelem, encoding="gbk")

输出：
<?xml version='1.0' encoding='gbk'?>
<SystemList><Organization id="-1" label="<¿Õ>">²âÊÔ</Organization></SystemList>

但在Eclipse里运行没问题，输出：

<?xml version='1.0' encoding='gbk'?>
<SystemList><Organization id="-1" label="<空>">测试</Organization></SystemList>

分析：

首先查阅xml.etree.ElementTree的源码，发现制造出“&#”的是其中的_encode_entity()函数：
def _encode_entity(text, pattern=_escape): # map reserved and non-ascii characters to numerical entities def escape_entities(m, map=_escape_map): out = [] append = out.append for char in m.group(): text = map.get(char) if text is None: text = "&#%d;" % ord(char) append(text) return string.join(out, "") try: return _encode(pattern.sub(escape_entities, text), "ascii") except TypeError: _raise_serialization_error(text)

在_escape_attrib()函数里，会调用上面的_encode_entity()函数：

def _escape_attrib(text, encoding=None, replace=string.replace): # escape attribute value try: if encoding: try: text = _encode(text, encoding) except UnicodeError: return _encode_entity(text) text = replace(text, "&", "&") text = replace(text, "'", "'") # FIXME: overkill text = replace(text, "\"", """) text = replace(text, "<", "<") text = replace(text, ">", ">") return text except (TypeError, AttributeError): _raise_serialization_error(text)

而ElementTree类的_write()函数里，会调用上面这个_escape_attrib()函数，所以导致了¿Õ这些字符的出现。
那么为什么会去调用_encode_entity()这个函数呢？

在_escape_attrib()函数的代码里可以看到，函数是先尝试去调用_encode()函数的，失败了才会调用_encode_entity()。

那么我们来看看_encode()函数：
def _encode(s, encoding): try: return s.encode(encoding) except AttributeError: return s # 1.5.2: assume the string uses the right encoding

s="<空>"，类型是str，encoding="gbk"
这时 s.encode(encoding) 会失败，因为python会先尝试将str转变为unicode，再调用其encode方法。而在转换的时候默认是按sys.getdefaultencoding()的编码转换的，python命令行下，这个编码是ascii，在Eclipse里，这个编码是系统默认编码（中文Windows下是gbk）。

在python命令行下执行 "<空>".encode("gbk") ：

而在Eclipse里运行时，相当于已经预先执行了：
import sys reload(sys) sys.setdefaultencoding("gbk")

所以"<空>"在转换到unicode时，不会失败。
问题清楚了，怎么解决呢？

有两个方法：

1. 像在Eclipse里运行一样，手动添加上面三行代码，把系统默认编码从ascii变为gbk。但是这样可能会导致其他第三方库的一些兼容性问题，有可能发生某些错误的时候不能正常输出日志了。

2. 修改ElementTree的_encode()函数，改成下面这样：
def _encode(s, encoding): try: if isinstance(s, str): return s return s.encode(encoding) except AttributeError: return s # 1.5.2: assume the string uses the right encoding

可以把上面这个函数定义在自己的py文件里，然后在
import xml.etree.ElementTree as ET

后，执行：

ET._encode = _encode

覆盖掉ElementTree的_encode()函数。

但是这个方法对

import xml.etree.cElementTree as ET

无效。
查看全文

相关阅读:
（第十二周）Bug修正报告
 （第十二周）团队项目19
（第十二周）新功能WBS
（第十二周）团队项目18
（第十二周）团队项目17
（第十二周）Debug阶段成员贡献分
 （第十一周）工作总结
 学习进度
 第九周
 第八周

原文地址：https://www.cnblogs.com/ddgg/p/3093285.html

Copyright © 2011-2022 走看看