zoukankan      html  css  js  c++  java
  • Python数据分析之pandas学习

    1.1 数据结构介绍

      参考博客:http://www.cnblogs.com/nxld/p/6058591.html

      1、pandas介绍

          1. 在pandas中有两类非常重要的数据结构,即序列Series和数据框DataFrame。
          2. Series类似于numpy中的一维数组,除了通吃一维数组可用的函数或方法,而且其可通过索引标签的方式获取数据,还具有索引的自动对齐功能;
          3. DataFrame类似于numpy中的二维数组,同样可以通用numpy数组的函数和方法,而且还具有其他灵活应用,后续会介绍到。

      2、Series创建的三种方式

        1、通过一维数组创建序列

    import numpy as np, pandas as pd
    arr1 = np.arange(10)
    print arr1,type(arr1)   # [0 1 2 3 4 5 6 7 8 9] <type 'numpy.ndarray'>
    
    s1 = pd.Series(arr1)
    print s1,type(s1)
    # 0    0
    # 1    1
    # 2    2
    # 3    3
    # 4    4
    # 5    5
    # 6    6
    # 7    7
    # 8    8
    # 9    9
    # dtype: int64 <class 'pandas.core.series.Series'>
    通过一维数组创建序列

        2、通过字典的方式创建序列

    import numpy as np, pandas as pd
    dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50}
    s2 = pd.Series(dic1)
    print s2, type(s2)
    # a    10
    # b    20
    # c    30
    # d    40
    # e    50
    # dtype: int64 <class 'pandas.core.series.Series'>
    通过字典的方式创建序列

        3、通过DataFrame中的某一行或某一列创建序列

      3、DataFrame创建的三种方式

         1、通过二维数组创建数据框

    import numpy as np, pandas as pd
    arr2 = np.array(np.arange(12)).reshape(4,3)
    print arr2,type(arr2)
    # [[ 0  1  2]
    #  [ 3  4  5]
    #  [ 6  7  8]
    #  [ 9 10 11]]
    
    df1 = pd.DataFrame(arr2)
    print df1,type(df1)
    #    0   1   2
    # 0  0   1   2
    # 1  3   4   5
    # 2  6   7   8
    # 3  9  10  11
    通过二维数组创建数据框

        2.1 通过字典的方式创建数据框

    import numpy as np, pandas as pd
    dic2 = {'a':[1,2,3,4],
            'b':[5,6,7,8],
            'c':[9,10,11,12],
            'd':[13,14,15,16]
            }
    df2 = pd.DataFrame(dic2)
    print df2
    
    #    a  b   c   d
    # 0  1  5   9  13
    # 1  2  6  10  14
    # 2  3  7  11  15
    # 3  4  8  12  16
    法1:字典列表生成DataFrame
    import numpy as np, pandas as pd
    dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},
            'two':{'a':5,'b':6,'c':7,'d':8},
            'three':{'a':9,'b':10,'c':11,'d':12}
            }
    df3 = pd.DataFrame(dic3)
    print df3, type(df3)
    
    #    one  three  two
    # a    1      9    5
    # b    2     10    6
    # c    3     11    7
    # d    4     12    8 
    法2:嵌套字典生成DataFram
    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    
    d = {
        "slagroupcount": [
            {
            "g_sla": 99.943755250038564,
            "weight": 20.0,
            "g_t_v": 19.988751050007714,
            "sla_nums": 14,
            "id": 1,
            "name": "大数据"
        },
            {
            "g_sla": 99.994763756058816,
            "weight": 20.0,
            "g_t_v": 19.998952751211764,
            "sla_nums": 6,
            "id": 2,
            "name": "基础架构"
        },
        ],
        "slacount": 99.611111411465515
    }
    
    result = {}
    gcounts = []
    subs = []
    
    for i in range(10):
        day_of_result = d
        gcounts.append(float(day_of_result['slacount']))  # "slacount": 99.611111411465515
        subs += day_of_result['slagroupcount']  # slagroupcount是一个列表,列表中包含多个字典
    result['slacount'] = sum(gcounts) / len(gcounts)
    print subs
    df = pd.DataFrame(subs)  # subs = [{},{},,{},{}....]
    # print df
    
    g = df.groupby('name').mean()  # 将数据按照name分组计算平均值
    print g
    '''  # 下面是g的打印结果(按照name分组,求出各项平均值)
               g_sla      g_t_v      id     sla_nums  weight
    name                                            
    基础架构  99.994764   19.998953   2         6     20.0
    大数据    99.943755   19.988751   1        14     20.0
    '''
    举例:df.groupby对数据框进行分组
    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    '''一:这里字典d是GroupCountResult表中result字段中的一条数据'''
    d = {
        "slagroupcount": [
            {
            "g_sla": 99.943755250038564,
            "weight": 20.0,
            "g_t_v": 19.988751050007714,
            "sla_nums": 14,
            "id": 1,
            "name": "大数据"
        },
            {
            "g_sla": 99.994763756058816,
            "weight": 20.0,
            "g_t_v": 19.998952751211764,
            "sla_nums": 6,
            "id": 2,
            "name": "基础架构"
        },
        ],
        "slacount": 99.611111411465515
    }
    
    
    '''二:模拟获取最近10天sla平均值:下面使用for循环伪造从GroupCountResult表中取出了10条数据,进行平均值计算'''
    result = {}
    gcounts = []
    subs = []
    for i in range(10):
        day_of_result = d
        gcounts.append(float(day_of_result['slacount']))  # "slacount": 99.611111411465515
        subs += day_of_result['slagroupcount']  # slagroupcount是一个列表,列表中包含多个字典
    df = pd.DataFrame(subs)  # subs = [{},{},,{},{}....]
    g = df.groupby('name').mean()  # 将数据按照name分组计算平均值
    print g
    '''  # 下面是g的打印结果(按照name分组,求出各项平均值)
               g_sla      g_t_v      id     sla_nums  weight
    name                                            
    基础架构  99.994764   19.998953   2         6     20.0
    大数据    99.943755   19.988751   1        14     20.0
    '''
    
    
    '''三:将利用pandas计算出来的结果循环到字典中'''
    result = {}
    result['slagroupcount'] = []
    for index, row in g.iterrows():
        result['slagroupcount'].append({'name': row.name,
                                        'id': int(row.id),
                                        'weight': row.weight,
                                        'sla_nums': row.sla_nums,
                                        'g_sla': row.g_sla,
                                        'g_t_v': row.g_t_v})
    print result['slagroupcount']
    '''  # 这里的d就是求出上面10条平均值后生成的字典
    d = {
        "slagroupcount": [
            {
            "g_sla": 99.943755250038564,
            "weight": 20.0,
            "g_t_v": 19.988751050007714,
            "sla_nums": 14,
            "id": 1,
            "name": "大数据"
        },
            {
            "g_sla": 99.994763756058816,
            "weight": 20.0,
            "g_t_v": 19.998952751211764,
            "sla_nums": 6,
            "id": 2,
            "name": "基础架构"
        },
        ],
        "slacount": 99.611111411465515
    }
    '''
    举例2:字典生成数据框,分组求平均值,然后将结果存入新字典

        2.2 对数据框分组求值

    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    
    li = [
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 97.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 93.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 95.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 98.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 87.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 73.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 55.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 78.07472},
    ]
    
    # 第一步:将列表字典转换成数据框
    df = pd.DataFrame(li)  # 将列表字典转换成数据框
    
    # 第二步:将数据按照name分组计算平均值
    g = df.groupby('name').mean()  # 将数据按照name分组计算平均值
    # print g
    '''
                     sla
    name                
    Hospital01  95.82472
    Hospital02  73.32472
    '''
    
    # 第三步:将二中分组后的值转换成字典
    print g.to_dict()
    '''
    {
      "sla": {
        "Hospital01": 95.82472, 
        "Hospital02": 73.32472
      }
    }
    '''
    例1:对其中一个指标进行分组求值
    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    
    li = [
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 97.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 93.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 95.07472},
    {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 98.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 87.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 73.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 55.07472},
    {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 78.07472},
    ]
    
    # 第一步:将列表字典转换成数据框
    df = pd.DataFrame(li)  # 将列表字典转换成数据框
    
    # 第二步:将数据框按照 service,name,abbreviation 同时分组
    service_name_group = df.groupby([df['service'], df['name'], df['abbreviation']]).mean()
    # print service_name_group
    '''
    service name       abbreviation          
    mongodb Hospital01 sdhospital    96.07472
            Hospital02 sysucc        75.57472
    redmine Hospital01 sdhospital    95.57472
            Hospital02 sysucc        71.07472
    '''
    
    # 第三步:将分组后的结果转换成字典
    # print service_name_group.to_dict()
    '''
    {
        'sla': {
            ('redmine', 'Hospital01', 'sdhospital'): 95.57472,
            ('redmine', 'Hospital02', 'sysucc'): 71.07472,
            ('mongodb', 'Hospital02', 'sysucc'): 75.57472,
            ('mongodb', 'Hospital01', 'sdhospital'): 96.07472
        }
    }
    '''
    
    # 第四步:将转换成的字典转换成我们想要的字典格式
    context = {}
    for k, v in service_name_group.to_dict()['sla'].items():
        context.setdefault(k[0], [])  # {'mongodb': [], 'redmine': []}
        context[k[0]].append({'name': k[1], 'sla': v, 'abbreviation': k[2]})
    ''' 这是for循环k,v的结果
    ('redmine', 'Hospital01', 'sdhospital') 95.57472
    ('redmine', 'Hospital02', 'sysucc') 71.07472
    ('mongodb', 'Hospital02', 'sysucc') 75.57472
    ('mongodb', 'Hospital01', 'sdhospital') 96.07472
    '''
    # print context
    # 这里d是我们最终想要得到的结果
    d = {
      "mongodb": [
        {
          "abbreviation": "sysucc",
          "name": "Hospital02",
          "sla": 75.57472
        },
        {
          "abbreviation": "sdhospital",
          "name": "Hospital01",
          "sla": 96.07472
        }
      ],
      "redmine": [
        {
          "abbreviation": "sdhospital",
          "name": "Hospital01",
          "sla": 95.57472
        },
        {
          "abbreviation": "sysucc",
          "name": "Hospital02",
          "sla": 71.07472
        }
      ]
    }
    例2:同时对多个指标进行分组求值
    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    
    li = [
        {'name':'zhangsan','times':'first','math':88,'chinese':82},
        {'name':'zhangsan','times':'second','math':84,'chinese':83},
        {'name':'zhangsan','times':'third','math':85,'chinese':87},
        {'name': 'lisi', 'times': 'first', 'math': 88, 'chinese': 82},
        {'name': 'lisi', 'times': 'second', 'math': 84, 'chinese': 83},
        {'name': 'lisi', 'times': 'third', 'math': 85, 'chinese': 87},
    ]
    
    # 第一步:将列表字典转换成数据框
    df = pd.DataFrame(li)  # subs = [{},{},,{},{}....]
    
    # 第二步:将数据框按照name分组
    g = df.groupby([df['name']]).mean()
    # print g
    '''
              chinese       math
    name                        
    lisi         84.0  85.666667
    zhangsan     84.0  85.666667
    '''
    
    # 第三步:将利用pandas计算出来的结果循环到字典中
    result = []
    for index, row in g.iterrows():
        result.append({'name': row.name,
                        'math': int(row.math),
                        'chinese': row.chinese,
                        })
    # print result
    ret_li = [
      {
        "chinese": 84,
        "name": "lisi",
        "math": 85
      },
      {
        "chinese": 84,
        "name": "zhangsan",
        "math": 85
      }
    ]
    例3:对一个指标多个数据分组求值(求zhangsan,lisi两个用户三次考试语文数学平均成绩)

        2.3 对数据框进行过滤查询 

    # -*- coding: utf-8 -*-
    import json
    import pandas as pd
    
    li = [
        {'name':'zhangsan','times':'first','math':88,'chinese':82},
        {'name':'zhangsan','times':'second','math':84,'chinese':83},
        {'name':'zhangsan','times':'third','math':85,'chinese':87},
        {'name': 'lisi', 'times': 'first', 'math': 88, 'chinese': 82},
        {'name': 'lisi', 'times': 'second', 'math': 84, 'chinese': 83},
        {'name': 'lisi', 'times': 'third', 'math': 85, 'chinese': 87},
    ]
    
    # 第一步:将列表字典转换成数据框
    df = pd.DataFrame(li)  # subs = [{},{},,{},{}....]
    
    # 第二步:过滤出zhangsan用户,第一次考试的结果
    result = df[(df['name'] == 'zhangsan') & (df['times']=='first')]
    # result = df[(df['name'] == 'zhangsan') | (df['times']=='first')]  # 过滤出name='zhangsan' 或者 times='first' 的条目
    
    # 第三步:将第二步中过滤的结果添加到字典中
    li = []
    for index, row in result.iterrows():
        li.append({
            '姓名':row['name'],
            '第几次考试':row['times'],
            '数学成绩':row['math'],
            '语文成绩':row['chinese']
        })
    print json.dumps(li)
    
    '''
    [{
        "第几次考试": "first",
        "语文成绩": 82,
        "数学成绩": 88,
        "姓名": "zhangsan"
    }]
    '''
    例1:对数据框进行条件过滤

    1.2 数据索引index

      1、通过索引值或索引标签获取数据

    import numpy as np, pandas as pd
    
    #1、通过列表生成Series
    s4 = pd.Series(np.array([1,2,3,4]))
    print s4
    # 0    1
    # 1    2
    # 2    3
    # 3    4
    
    #2、为Series自定义的索引值
    s4.index = ['a','b','c','d']
    print s4
    # a    1
    # b    2
    # c    3
    # d    4
    
    #3、通过两种索引均可获取到值
    print s4[3],s4['d']  # 4 4
    通过索引值或索引标签获取数据

      2、自动化对齐

    #-*- coding:utf8 -*-
    import numpy as np, pandas as pd
    
    s5 = pd.Series(np.array([10,15,20,30]), index = ['a','b','c','d'])
    s6 = pd.Series(np.array([12,11,13,15]), index = ['a','c','g','b'])
    print s5 + s6
    # a    22.0
    # b    30.0
    # c    31.0
    # d     NaN
    # g     NaN
    
    # 说明:由于s5中的d和s6中的g没有对应的所有,所以数据的运算会产生两个缺失值NaN
    # 注意,这里的算术结果就实现了两个序列索引的自动对齐,而非简单的将两个序列加总或相除。
    # 对于数据框的对齐,不仅仅是行索引的自动对齐,同时也会自动对齐列索引(变量名)
    自动化对齐

    1.3 统计分析

    #-*- coding:utf8 -*-
    import numpy as np, pandas as pd
    
    np.random.seed(1234)
    d1 = pd.Series(2*np.random.normal(size = 100)+3)   # 生成Series 100个
    
    d1.count() #非空元素计算
    d1.min() #最小值
    d1.max() #最大值
    d1.idxmin() #最小值的位置,类似于R中的which.min函数
    d1.idxmax() #最大值的位置,类似于R中的which.max函数
    d1.quantile(0.1) #10%分位数
    d1.sum() #求和
    d1.mean() #均值
    d1.median() #中位数
    d1.mode() #众数
    d1.var() #方差
    d1.std() #标准差
    d1.mad() #平均绝对偏差
    d1.skew() #偏度
    d1.kurt() #峰度
    d1.describe() #一次性输出多个描述性统计指标
    统计分析基本使用
  • 相关阅读:
    spring 动态创建数据源
    现有‘abcdefghijkl’12个字符,将其所有的排列按字典序进行排序,给出任意一组排列,说出这租排列在所有排列中是第几小的
    javaweb项目运行时错误
    StringUtils.isEmpty和StringUtils.isBlank用法和区别
    启动项目时tomcat问题汇总
    hibernate 在web.xml中配置的作用
    Java几种常见的编码方式
    struts学习总结
    Javaweb开发中关于不同地方出现的绝对路径和相对路径
    解决中文乱码问题
  • 原文地址:https://www.cnblogs.com/jiaxinzhu/p/12596099.html
Copyright © 2011-2022 走看看