利用python进行数据分析-第六章笔记

zoukankan html css js c++ java

利用python进行数据分析-第六章笔记
Chapter 6 Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

最常用的是 read_csv 和 read_table，不过数模竞赛里很多都是用 excel 给数据，不知道今年是个啥情况。

下表是一些常用的数据读取方法：

其中根据试验数据的不同形式，可以选择不同的read_csv参数进行调整。这部分我觉得根据具体数据具体处理即可，用到再去查文档，现在不必要把所有的参数都记住。

需要注意的是，read_csv会把数据中的 NA 和 NULL 标记为空值；也可以通过 na_values参数设置空值：
```
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])

# 可以用下面这种形式分别规定每一列中的空值
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
```
Reading Text Files in Pieces

这部分主要是用 chunksize 这个参数。
```
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
chunker_list = []
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    chunker_list.append(piece)

print(chunker_list)
```
观察输出，可以发现，chunker 相当于是一个 iterator，每个元素都是 chunksize 的大小。需要注意的是，chunker 迭代之后就不可以再用了，这种特性有点类似于之前提到过的 generator。

Writing Data to Text Format

主要是使用to_csv这个方法。

Working with Delimited Formats

In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table.

read_csv和read_table功能已经足够强大，但是为了防止数据中存在格式不正确的数据使得这两个方法失效，所以考虑 python 内置的 csv 读取方法：
```
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
```
一个简单的读取数据方法：
```
with import csv
with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))
    headers, values = lines[0], lines[1:]
    data_dict = {h: v for h, v in zip(headers, zip(*values))}

print(data_dict)
```
为了个性化的定义数据的分隔符等，可以用csv.Dialect子类，通过一些属性来定义数据的组织形式：
```
class my_dialect(csv.Dialect):
    lineterminator = '
'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
with open('examples/ex7.csv') as f:
    reader = csv.reader(f, dialect=my_dialect)
```
这里是一些 dialect 的属性。

JSON Data

可以使用 python 内置的 json 进行读取，也可以用pd.read_json，但是read_json默认情况下认为 json 文件中的每一行就是 DataFrame 的每一行。

XML and HTML: Web Scraping

主要使用read_html方法，顺便可以进行一些简单的数据分析。

Parsing XML with lxml.objectify

使用 lxml.objectify.parse 来对 XML 数据文件进行处理，调用 getroot() 方法得到处理后数据的 root，调用 XML 最外层的标签，会返回一个可迭代的对象。

具体遇到直接查文档吧。。。

Binary Data Formats

One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format.

同时，作者指出应该慎用 pickle 这种方式，只推荐用 pickle 保存短期的数据：

pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.

Using HDF5 Format

HDF5( hierarchical data format )便于存储大型、复杂的数据集结构，pandas 提供了一个高级接口 HDFStore 来存储 DataFrame 和Series。
```
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
print(store)
```
HDFStore 支持两种存储形式，一种是 fixed，一种是 table，详细的文档在这里，可以看到 table 的速度虽然慢，但是支持 searching and selecting subsets 操作。

HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.

Read Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function.

To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method:
```
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
# another method
frame.to_excel('examples/ex2.xlsx')
writer.save()
```
Interacting with Web APIs

这部分其实与 pandas 关系不大，不细读了。

Interacting with Databases

同上。

Conclusion

Getting access to data is frequently the first step in the data analysis process. We have looked at a number of useful tools in this chapter that should help you get started. In the upcoming chapters we will dig deeper into data wrangling, data visualization, time series analysis, and other topics.
查看全文

相关阅读:
不使用库函数，编写函数int strcmp(char *source, char *dest) 相等返回0，不等返回-1【转】
atol实现【转】
atol的实现【转】
关于内存中栈和堆的区别（非数据结构中的堆和栈，区别）【转】
ubuntu下安装android模拟器genymotion【转】
buntu下命令行安装jdk，android-studio，及genymotion虚拟机来进行android开发【转】
Ubuntu下安装Android studio【转】
C++模板（二）【转】
【转】iOS中设置导航栏标题的字体颜色和大小
 【转】Java 截取字符串

原文地址：https://www.cnblogs.com/LuoboLiam/p/13653781.html

利用python进行数据分析-第六章笔记

Chapter 6 Data Loading, Storage, and File Formats

Reading and Writing Data in Text Format

Reading Text Files in Pieces

Writing Data to Text Format

Working with Delimited Formats

JSON Data

XML and HTML: Web Scraping

Parsing XML with lxml.objectify

Binary Data Formats

Using HDF5 Format

Read Microsoft Excel Files

Interacting with Web APIs

Interacting with Databases

Conclusion