[python] python xml ElementTree

zoukankan html css js c++ java

[python] python xml ElementTree
python xml ElementTree

xml示例
```
<?xml version="1.0" encoding="utf-8"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</data>
```
XML是中结构化数据形式，在ET中使用ElementTree代表整个XML文档，并视其为一棵树，Element代表这个文档树中的单个节点。

解析xml

从硬盘读取xml

(cElementTree为c版本的，速度快，ElementTree为python版本，速度稍慢，有c版本就用c的）
```
try:  
    import xml.etree.cElementTree as ET  
except ImportError:  
    import xml.etree.ElementTree as ET  
tree = ET.parse(self.tempfilename)  #打开xml文档
root = tree.getroot()               #拿到根节点
```
element对象属性
tag： string对象，表示数据代表的种类。

attrib： dictionary对象，表示附有的属性。

text： string对象，表示element的内容。

tail： string对象，表示element闭合之后的尾迹。

若干子元素（child elements）。
节点操作

Element.iter(tag=None)： [return list] 遍历该Element所有后代，指定tag进行遍历寻找。
Element.findall(path)：[return list]查找当前元素下tag或path能够匹配的 直系节点。
Element.find(path)：[return element]查找当前元素下tag或path能够匹配的 首个直系节点。
Element.text: 获取当前元素的text值。
Element.get(key, default=None)：获取元素指定key对应的属性值，如果没有该属性，则返回default值。
```
>>> for child in root:
...   print child.tag, child.attrib
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

>>> for neighbor in root.iter('neighbor'):
...   print neighbor.attrib
...
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}
```
本次没有进行写操作，所以没研究，先略过。

xml编码问题

这次被python的编码坑了一次。
在网上发现了这么一段话：
python中dom和ElementTree只支持utf-8文件的解析，所以在解析之前不管用什么方法，最好确保文件是utf-8格式的，话说python的文本操作通常用utf-8都是没什么问题的，其它编码方式多多少少都有些麻烦，所以生成文件的时候尽量少用中文编码！
使用utf-8，如果涉及跨平台的时候不要带BOM，也可以采用GBK（尽量不用），但就是不能使用utf16。
然而这次要批量处理的xml居然就是utf-16的，心中一万只羊驼跑过！！！
```
utf-16编码直接按照上述方法Load进来，会报错：  
ParseError: encoding specified in XML declaration is incorrect: line 1, column 30
```
先将xml解析成utf8格式，然后替代第一行的encoding格式，然后保存一个新文件，提供后续解析。
```
import codecs
f = codecs.open(file_name, 'rb','utf-16')   
text = f.read().encode('utf-8')   
text = text.replace('<?xml version="1.0" encoding="utf-16"?>','<?xml version="1.0" encoding="utf-8"?>')  
f.close()  
tempfilename = file_name.split('.xml')[0]+'temp.xml'  
f = open(tempfilename, 'wb')   
f.write(text)   
f.close()
```
无意之间用查了一下文件的编码，发现我要处理的xml编码是用utf-8，但是头文件写的是utf-16,所以我上边就不需要codecs.open,直接用open就好了，再把头replace。

linux vim命令 :set fileencoding

Reference

http://blog.csdn.net/gingerredjade/article/details/21944675
http://www.cnblogs.com/CheeseZH/p/4026686.html
http://www.jb51.net/article/67120.htm
http://blog.csdn.net/whzhcahzxh/article/details/33735293
http://blog.csdn.net/jnbbwyth/article/details/6991425/
http://www.cnblogs.com/findumars/p/3620076.html
http://blog.csdn.net/guang11cheng/article/details/7491715
查看全文

相关阅读:
Linux03__管理
 Linux02__常用命令
 Linux01__系统安装
 爬虫性能相关
 【转载】资源整合
 Continuous integration
行业巨头的云计算冷数据存储应用和比较 2016-07-15
win7中使用docker ——配置阿里云容器加速
 layui treeSelect插件的使用
 springboot 拦截器设置

原文地址：https://www.cnblogs.com/zhanxiage1994/p/7562379.html

[python] python xml ElementTree

python xml ElementTree

xml示例

解析xml

从硬盘读取xml

element对象属性

节点操作

xml编码问题

Reference