zoukankan      html  css  js  c++  java
  • pandas模块

    pandas官方文档:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750

    pandas基于Numpy,可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构,其中Series数据结构类似于Numpy中的一维数组,DataFrame类似于多维表格数据结构。

    pandas是python数据分析的核心模块。它主要提供了五大功能:

    1. 支持文件存取操作,支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
    2. 支持增删改查、切片、高阶函数、分组聚合等单表操作,以及和dict、list的互相转换。
    3. 支持多表拼接合并操作。
    4. 支持简单的绘图操作。
    5. 支持简单的统计分析操作。

    一、Series数据结构

    Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成。

    Series比较像列表(数组)和字典的结合体

    1
    2
    import numpy as np
    import pandas as pd
    1
    2
    df = pd.Series(0, index=['a', 'b', 'c', 'd'])
    print(df)
    1
    2
    3
    4
    5
    a    0
    b 0
    c 0
    d 0
    dtype: int64
    1
    print(df.values)
    1
    [0 0 0 0]
    1
    print(df.index)
    1
    Index(['a', 'b', 'c', 'd'], dtype='object')

    1.1 Series支持NumPy模块的特性(下标)

    详解方法
    从ndarray创建Series Series(arr)
    与标量运算 df*2
    两个Series运算 df1+df2
    索引 df[0], df[[1,2,4]]
    切片 df[0:2]
    通用函数 np.abs(df)
    布尔值过滤 df[df>0]
    1
    2
    arr = np.array([1, 2, 3, 4, np.nan])
    print(arr)
    1
    [ 1.  2.  3.  4. nan]
    1
    2
    df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
    print(df)
    1
    2
    3
    4
    5
    6
    a    1.0
    b 2.0
    c 3.0
    d 4.0
    e NaN
    dtype: float64
    1
    print(df**2)
    1
    2
    3
    4
    5
    6
    a     1.0
    b 4.0
    c 9.0
    d 16.0
    e NaN
    dtype: float64
    1
    print(df[0])
    1
    1.0
    1
    print(df['a'])
    1
    1.0
    1
    print(df[[0, 1, 2]])
    1
    2
    3
    4
    a    1.0
    b 2.0
    c 3.0
    dtype: float64
    1
    print(df[0:2])
    1
    2
    3
    a    1.0
    b 2.0
    dtype: float64
    1
    np.sin(df)
    1
    2
    3
    4
    5
    6
    a    0.841471
    b 0.909297
    c 0.141120
    d -0.756802
    e NaN
    dtype: float64
    1
    df[df > 1]
    1
    2
    3
    4
    b    2.0
    c 3.0
    d 4.0
    dtype: float64

    1.2 Series支持字典的特性(标签)

    详解方法
    从字典创建Series Series(dic),
    in运算 ’a’ in sr
    键索引 sr[‘a’], sr[[‘a’, ‘b’, ‘d’]]
    1
    2
    df = pd.Series({'a': 1, 'b': 2})
    print(df)
    1
    2
    3
    a    1
    b 2
    dtype: int64
    1
    print('a' in df)
    1
    True
    1
    print(df['a'])
    1
    1

    1.3 Series缺失数据处理

    方法详解
    dropna() 过滤掉值为NaN的行
    fillna() 填充缺失数据
    isnull() 返回布尔数组,缺失值对应为True
    notnull() 返回布尔数组,缺失值对应为False
    1
    2
    df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])
    print(df)
    1
    2
    3
    4
    5
    6
    a    1.0
    b 2.0
    c 3.0
    d 4.0
    e NaN
    dtype: float64
    1
    print(df.dropna())
    1
    2
    3
    4
    5
    a    1.0
    b 2.0
    c 3.0
    d 4.0
    dtype: float64
    1
    print(df.fillna(5))
    1
    2
    3
    4
    5
    6
    a    1.0
    b 2.0
    c 3.0
    d 4.0
    e 5.0
    dtype: float64
    1
    print(df.isnull())
    1
    2
    3
    4
    5
    6
    a    False
    b False
    c False
    d False
    e True
    dtype: bool
    1
    print(df.notnull())
    1
    2
    3
    4
    5
    6
    a     True
    b True
    c True
    d True
    e False
    dtype: bool

    二、DataFrame数据结构

    DataFrame是一个表格型的数据结构,含有一组有序的列。

    DataFrame可以被看做是由Series组成的字典,并且共用一个索引。

    2.1 产生时间对象数组:date_range

    date_range参数详解:

    参数详解
    start 开始时间
    end 结束时间
    periods 时间长度
    freq 时间频率,默认为’D’,可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…
    1
    2
    dates = pd.date_range('20190101', periods=6, freq='M')
    print(dates)
    1
    2
    3
    DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
    '2019-05-31', '2019-06-30'],
    dtype='datetime64[ns]', freq='M')
    1
    2
    3
    np.random.seed(1)
    arr = 10 * np.random.randn(6, 4)
    print(arr)
    1
    2
    3
    4
    5
    6
    [[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
    [ 8.65407629 -23.01538697 17.44811764 -7.61206901]
    [ 3.19039096 -2.49370375 14.62107937 -20.60140709]
    [ -3.22417204 -3.84054355 11.33769442 -10.99891267]
    [ -1.72428208 -8.77858418 0.42213747 5.82815214]
    [-11.00619177 11.4472371 9.01590721 5.02494339]]
    1
    2
    df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])
    df
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943

    三、DataFrame属性

    属性详解
    dtype是 查看数据类型
    index 查看行序列或者索引
    columns 查看各列的标签
    values 查看数据框内的数据,也即不含表头索引的数据
    describe 查看数据每一列的极值,均值,中位数,只可用于数值型数据
    transpose 转置,也可用T来操作
    sort_index 排序,可按行或列index排序输出
    sort_values 按数据值来排序
    1
    2
    # 查看数据类型
    print(df2.dtypes)
    1
    2
    3
    4
    5
    0    float64
    1 float64
    2 float64
    3 float64
    dtype: object
    1
    df
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    print(df.index)
    1
    2
    3
    DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
    '2019-05-31', '2019-06-30'],
    dtype='datetime64[ns]', freq='M')
    1
    print(df.columns)
    1
    Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
    1
    print(df.values)
    1
    2
    3
    4
    5
    6
    [[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
    [ 8.65407629 -23.01538697 17.44811764 -7.61206901]
    [ 3.19039096 -2.49370375 14.62107937 -20.60140709]
    [ -3.22417204 -3.84054355 11.33769442 -10.99891267]
    [ -1.72428208 -8.77858418 0.42213747 5.82815214]
    [-11.00619177 11.4472371 9.01590721 5.02494339]]
    1
    df.describe()
     c1c2c3c4
    count 6.000000 6.000000 6.000000 6.000000
    mean 2.022213 -5.466424 7.927203 -6.514830
    std 9.580084 11.107772 8.707171 10.227641
    min -11.006192 -23.015387 -5.281718 -20.601407
    25% -2.849200 -8.113329 2.570580 -10.931606
    50% 0.733054 -4.979054 10.176801 -9.170878
    75% 7.288155 -2.830414 13.800233 1.865690
    max 16.243454 11.447237 17.448118 5.828152
    1
    df.T
     2019-01-31 00:00:002019-02-28 00:00:002019-03-31 00:00:002019-04-30 00:00:002019-05-31 00:00:002019-06-30 00:00:00
    c1 16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192
    c2 -6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237
    c3 -5.281718 17.448118 14.621079 11.337694 0.422137 9.015907
    c4 -10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943
    1
    2
    # 按行标签[c1, c2, c3, c4]从大到小排序
    df.sort_index(axis=0)
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    2
    # 按列标签[2019-01-01, 2019-01-02...]从大到小排序
    df.sort_index(axis=1)
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    2
    # 按c2列的值从大到小排序
    df.sort_values(by='c2')
     c1c2c3c4
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-06-30 -11.006192 11.447237 9.015907 5.024943

    四、DataFrame取值

    1
    df
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943

    4.1 通过columns取值

    1
    df['c2']
    1
    2
    3
    4
    5
    6
    7
    2019-01-31    -6.117564
    2019-02-28 -23.015387
    2019-03-31 -2.493704
    2019-04-30 -3.840544
    2019-05-31 -8.778584
    2019-06-30 11.447237
    Freq: M, Name: c2, dtype: float64
    1
    df[['c2', 'c3']]
     c2c3
    2019-01-31 -6.117564 -5.281718
    2019-02-28 -23.015387 17.448118
    2019-03-31 -2.493704 14.621079
    2019-04-30 -3.840544 11.337694
    2019-05-31 -8.778584 0.422137
    2019-06-30 11.447237 9.015907

    4.2 loc(通过行标签取值)

    1
    2
    # 通过自定义的行标签选择数据
    df.loc['2019-01-01':'2019-01-03']
     c1c2c3c4
             
    1
    df[0:3]
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407

    4.3 iloc(类似于numpy数组取值)

    1
    df.values
    1
    2
    3
    4
    5
    6
    array([[ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622],
    [ 8.65407629, -23.01538697, 17.44811764, -7.61206901],
    [ 3.19039096, -2.49370375, 14.62107937, -20.60140709],
    [ -3.22417204, -3.84054355, 11.33769442, -10.99891267],
    [ -1.72428208, -8.77858418, 0.42213747, 5.82815214],
    [-11.00619177, 11.4472371 , 9.01590721, 5.02494339]])
    1
    2
    # 通过行索引选择数据
    print(df.iloc[2, 1])
    1
    -2.493703754774101
    1
    df.iloc[1:4, 1:4]
     c2c3c4
    2019-02-28 -23.015387 17.448118 -7.612069
    2019-03-31 -2.493704 14.621079 -20.601407
    2019-04-30 -3.840544 11.337694 -10.998913

    4.4 使用逻辑判断取值

    1
    df[df['c1'] > 0]
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    1
    df[(df['c1'] > 0) & (df['c2'] > -8)]
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407

    五、DataFrame值替换

    1
    df
     c1c2c3c4
    2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
    2019-02-28 8.654076 -23.015387 17.448118 -7.612069
    2019-03-31 3.190391 -2.493704 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    2
    df.iloc[0:3, 0:2] = 0
    df
     c1c2c3c4
    2019-01-31 0.000000 0.000000 -5.281718 -10.729686
    2019-02-28 0.000000 0.000000 17.448118 -7.612069
    2019-03-31 0.000000 0.000000 14.621079 -20.601407
    2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    df['c3'] > 10
    1
    2
    3
    4
    5
    6
    7
    2019-01-31    False
    2019-02-28 True
    2019-03-31 True
    2019-04-30 True
    2019-05-31 False
    2019-06-30 False
    Freq: M, Name: c3, dtype: bool
    1
    2
    3
    # 针对行做处理
    df[df['c3'] > 10] = 100
    df
     c1c2c3c4
    2019-01-31 0.000000 0.000000 -5.281718 -10.729686
    2019-02-28 100.000000 100.000000 100.000000 100.000000
    2019-03-31 100.000000 100.000000 100.000000 100.000000
    2019-04-30 100.000000 100.000000 100.000000 100.000000
    2019-05-31 -1.724282 -8.778584 0.422137 5.828152
    2019-06-30 -11.006192 11.447237 9.015907 5.024943
    1
    2
    3
    4
    # 针对行做处理
    df = df.astype(np.int32)
    df[df['c3'].isin([100])] = 1000
    df
     c1c2c3c4
    2019-01-31 0 0 -5 -10
    2019-02-28 1000 1000 1000 1000
    2019-03-31 1000 1000 1000 1000
    2019-04-30 1000 1000 1000 1000
    2019-05-31 -1 -8 0 5
    2019-06-30 -11 11 9 5

    六、读取CSV文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    import pandas as pd
    from io import StringIO
    test_data = '''
    5.1,,1.4,0.2
    4.9,3.0,1.4,0.2
    4.7,3.2,,0.2
    7.0,3.2,4.7,1.4
    6.4,3.2,4.5,1.5
    6.9,3.1,4.9,
    ,,,
    '''

    test_data = StringIO(test_data)
    df = pd.read_csv(test_data, header=None)
    df.columns = ['c1', 'c2', 'c3', 'c4']
    df
     c1c2c3c4
    0 5.1 NaN 1.4 0.2
    1 4.9 3.0 1.4 0.2
    2 4.7 3.2 NaN 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    5 6.9 3.1 4.9 NaN
    6 NaN NaN NaN NaN

    七、处理丢失数据

    1
    df.isnull()
     c1c2c3c4
    0 False True False False
    1 False False False False
    2 False False True False
    3 False False False False
    4 False False False False
    5 False False False True
    6 True True True True
    1
    2
    # 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值
    print(df.isnull().sum())
    1
    2
    3
    4
    5
    c1    1
    c2 2
    c3 2
    c4 2
    dtype: int64
    1
    2
    # axis=0删除有NaN值的行
    df.dropna(axis=0)
     c1c2c3c4
    1 4.9 3.0 1.4 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    1
    2
    # axis=1删除有NaN值的列
    df.dropna(axis=1)
     
    0
    1
    2
    3
    4
    5
    6
    1
    2
    # 删除全为NaN值得行或列
    df.dropna(how='all')
     c1c2c3c4
    0 5.1 NaN 1.4 0.2
    1 4.9 3.0 1.4 0.2
    2 4.7 3.2 NaN 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    5 6.9 3.1 4.9 NaN
    1
    2
    # 删除行不为4个值的
    df.dropna(thresh=4)
     c1c2c3c4
    1 4.9 3.0 1.4 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    1
    2
    # 删除c2中有NaN值的行
    df.dropna(subset=['c2'])
     c1c2c3c4
    1 4.9 3.0 1.4 0.2
    2 4.7 3.2 NaN 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    5 6.9 3.1 4.9 NaN
    1
    2
    # 填充nan值
    df.fillna(value=10)
     c1c2c3c4
    0 5.1 10.0 1.4 0.2
    1 4.9 3.0 1.4 0.2
    2 4.7 3.2 10.0 0.2
    3 7.0 3.2 4.7 1.4
    4 6.4 3.2 4.5 1.5
    5 6.9 3.1 4.9 10.0
    6 10.0 10.0 10.0 10.0

    八、合并数据

    1
    2
    df1 = pd.DataFrame(np.zeros((3, 4)))
    df1
     0123
    0 0.0 0.0 0.0 0.0
    1 0.0 0.0 0.0 0.0
    2 0.0 0.0 0.0 0.0
    1
    2
    df2 = pd.DataFrame(np.ones((3, 4)))
    df2
     0123
    0 1.0 1.0 1.0 1.0
    1 1.0 1.0 1.0 1.0
    2 1.0 1.0 1.0 1.0
    1
    2
    # axis=0合并列
    pd.concat((df1, df2), axis=0)
     0123
    0 0.0 0.0 0.0 0.0
    1 0.0 0.0 0.0 0.0
    2 0.0 0.0 0.0 0.0
    0 1.0 1.0 1.0 1.0
    1 1.0 1.0 1.0 1.0
    2 1.0 1.0 1.0 1.0
    1
    2
    # axis=1合并行
    pd.concat((df1, df2), axis=1)
     01230123
    0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
    1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
    2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
    1
    2
    # append只能合并列
    df1.append(df2)
     0123
    0 0.0 0.0 0.0 0.0
    1 0.0 0.0 0.0 0.0
    2 0.0 0.0 0.0 0.0
    0 1.0 1.0 1.0 1.0
    1 1.0 1.0 1.0 1.0
    2 1.0 1.0 1.0 1.0

    九、导入导出数据

    使用df = pd.read_excel(filename)读取文件,使用df.to_excel(filename)保存文件。

    9.1 读取文件导入数据

    读取文件导入数据函数主要参数:

    参数详解
    sep 指定分隔符,可用正则表达式如’s+’
    header=None 指定文件无行名
    name 指定列名
    index_col 指定某列作为索引
    skip_row 指定跳过某些行
    na_values 指定某些字符串表示缺失值
    parse_dates 指定某些列是否被解析为日期,布尔值或列表
    1
    2
    df = pd.read_excel(filename)
    df = pd.read_csv(filename)

    9.2 写入文件导出数据

    写入文件函数的主要参数:

    参数详解
    sep 分隔符
    na_rep 指定缺失值转换的字符串,默认为空字符串
    header=False 不保存列名
    index=False 不保存行索引
    cols 指定输出的列,传入列表
    1
    df.to_excel(filename)

    十、pandas读取json文件

    1
    2
    3
    4
    5
    6
    7
    8
    strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},
    {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},
    {"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},
    {"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},
    {"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

    df = pd.read_json(strtext, orient='records')
    df
     codecode1code2issuetimettery
    0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
    1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min
    2 5,9,1,2,9 298329129 NaN 20130801-3389 1013395346000 min
    3 3,8,7,3,3 298588733 NaN 20130801-3388 1013395286000 min
    4 0,8,5,2,7 298818527 NaN 20130801-3387 1013395226000 min
    1
    2
    3
    df.to_excel('pandas处理json.xlsx',
    index=False,
    columns=["ttery", "issue", "code", "code1", "code2", "time"])

    10.1 orient参数的五种形式

    orient是表明预期的json字符串格式。orient的设置有以下五个值:

    1.’split’ : dict like {index -> [index], columns -> [columns], data -> [values]}

    这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

    1
    2
    3
    s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
    df = pd.read_json(s, orient='split')
    df
     ab
    1 1 3
    2 2 8
    3 3 9

    2.’records’ : list like [{column -> value}, … , {column -> value}]

    这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

    1
    2
    3
    4
    5
    strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},
    {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'

    df = pd.read_json(strtext, orient='records')
    df
     codecode1code2issuetimettery
    0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
    1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min

    3.’index’ : dict like {index -> {column -> value}}

    以索引为key,以列字段构成的字典为键值。如:

    1
    2
    3
    s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
    df = pd.read_json(s, orient='index')
    df
     ab
    0 1 2
    1 9 11

    4.’columns’ : dict like {column -> {index -> value}}

    这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

    1
    2
    3
    s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
    df = pd.read_json(s, orient='columns')
    df
     ab
    0 1 2
    1 9 11

    5.’values’ : just the values array。

    values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。

    1
    2
    3
    s = '[["a",1],["b",2]]'
    df = pd.read_json(s, orient='values')
    df
     01
    0 a 1
    1 b 2

    十一、pandas读取sql语句

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    import numpy as np
    import pandas as pd
    import pymysql


    def conn(sql):
    # 连接到mysql数据库
    conn = pymysql.connect(
    host="localhost",
    port=3306,
    user="root",
    passwd="123",
    db="db1",
    )
    try:
    data = pd.read_sql(sql, con=conn)
    return data
    except Exception as e:
    print("SQL is not correct!")
    finally:
    conn.close()


    sql = "select * from test1 limit 0, 10" # sql语句
    data = conn(sql)
    print(data.columns.tolist()) # 查看字段
    print(data) # 查看数据
  • 相关阅读:
    文件下载(Servlet/Struts2)
    Spring学习(一)---依赖注入和控制反转
    MyBatis学习(三)---MyBatis和Spring整合
    MyBatis学习(二)---数据表之间关联
    MyBatis学习(一)---配置文件,Mapper接口和动态SQL
    转载:常见端口介绍
    CentOS7 yum提示:another app is currently holding the yum lock;waiting for it to exit
    批量删除文件,只保留目录下最新的几个文件,其他均删除
    转载:SQL Server 如何设置数据库的默认初始大小和自动增长大小
    阿里云ECS使用秘钥或者密码登录
  • 原文地址:https://www.cnblogs.com/bubu99/p/13559742.html
Copyright © 2011-2022 走看看