zoukankan      html  css  js  c++  java
  • Python基本数据统计(一)---- 便捷数据获取 & 数据准备和整理 & 数据显示

    1. 便捷数据获取

      1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)

      1.2 网络数据获取:

        1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)

          正则表达式(另外的单数章节)

        1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据

    In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl
    
    In [8]: from datetime import date
    
    In [9]: from datetime import datetime
    
    In [10]: import pandas as pd
    
    In [11]: today = date.today()
    
    In [12]: start = (today.year-1, today.month, today.day)
    
    In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today)  # 获取数据
    
    In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume']
    
    In [16]: list1 = []
    
    In [18]: for i in range(0,len(quotes)):
        ...:     x = date.fromordinal(int(quotes[i][0]))  # 取每一行的第一列,通过date.fromordinal设置为日期数据类型
        ...:     y = datetime.strftime(x,'%Y-%m-%d')  # 通过datetime.strftime把日期设置为指定格式
        ...:     list1.append(y)  # 将日期放入列表中
        ...:     
    
    In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields)  # index设置为日期,columns设置为字段
    
    In [20]: quotesdf = quotesdf.drop(['date'],axis=1)  # 删除date列
    
    In [21]: print quotesdf
                     open      close       high        low      volume
    2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
    2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
    2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

        1.2.3 通过自然语言工具包NLTK获取语料库等数据

          1. 下载nltk:pip install nltk

          2. 下载语料库:

    In [1]: import nltk
    
    In [2]: nltk.download()
    NLTK Downloader
    ---------------------------------------------------------------------------
        d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
    ---------------------------------------------------------------------------
    Downloader> d
    
    Download which package (l=list; x=cancel)?
      Identifier> gutenberg
        Downloading package gutenberg to /root/nltk_data...
          Package gutenberg is already up-to-date!

          3. 获取数据:

    In [3]: from nltk.corpus import gutenberg
    
    In [4]: print gutenberg.fileids()
    [u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']
    
    In [5]: texts = gutenberg.words('shakespeare-hamlet.txt')
    
    In [6]: texts
    Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]

    2. 数据准备和整理

      2.1 quotes数据加入[ 列 ]属性名

    In [79]: quotesdf = pd.DataFrame(quotes)
    
    In [80]: quotesdf
    Out[80]: 
                0          1          2          3          4           5
    0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
    1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
    2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
    3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0
    
    [253 rows x 6 columns]
    
    In [81]: fields = ['date','open','close','high','low','volume']
    
    In [82]: quotesdf = pd.DataFrame(quotes,columns=fields)  # 设置列属性名称
    
    In [83]: quotesdf
    Out[83]: 
             date       open      close       high        low      volume
    0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
    1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
    2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
    3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

      2.2 quotes数据加入[ index ]属性名

    In [84]: quotesdf
    Out[84]: 
             date       open      close       high        low      volume
    0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
    1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
    2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
    
    [253 rows x 6 columns]
    
    In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields)  # 把index属性从0,1,2...改为1,2,3...
    
    In [86]: quotesdf
    Out[86]: 
             date       open      close       high        low      volume
    1    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
    2    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
    3    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

      2.3 日期转换:Gregorian日历表示法 => 普通表示方法

    In [88]: from datetime import date
    
    In [89]: firstday = date.fromordinal(735190)
    
    In [93]: firstday
    Out[93]: datetime.date(2013, 11, 18)
    
    In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d')
    
    In [96]: firstday
    Out[96]: '2013-11-18'

      2.4 创建时间序列:

    In [120]: import pandas as pd
    
    In [121]: dates = pd.date_range('20170101', periods=7)  # 根据起始日期和长度生成日期序列
    
    In [122]: dates
    Out[122]: 
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D')
    
    In [123]: import numpy as np
    
    In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC'))  # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数
    
    In [125]: dates
    Out[125]: 
                       A         B         C
    2017-01-01  0.705927  0.311453  1.455362
    2017-01-02 -0.331531 -0.358449  0.175375
    2017-01-03 -0.284583 -1.760700 -0.582880
    2017-01-04 -0.759392 -2.080658 -2.015328
    2017-01-05 -0.517370  0.906072 -0.106568
    2017-01-06 -0.252802 -2.135604 -0.692153
    2017-01-07 -0.275184  0.142973 -1.262126

      2.5 练习

    In [101]: datetime.now()  # 显示当前日期和时间
    Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258)
    =========================================
    In [108]: datetime.now().month  # 显示当前月份
    Out[108]: 1
    
    =========================================
    In [126]: import pandas as pd
    
    In [127]: dates = pd.date_range('2015-02-01',periods=10)
    
    In [128]: dates
    Out[128]: 
    DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D')
    
    In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value'])
    
    In [134]: res
    Out[134]: 
                value
    2015-02-01      1
    2015-02-02      2
    2015-02-03      3
    2015-02-04      4
    2015-02-05      5
    2015-02-06      6
    2015-02-07      7
    2015-02-08      8
    2015-02-09      9
    2015-02-10     10

    3. 数据显示

      3.1 显示方式:

    In [180]: quotesdf2.index  # 显示索引
    Out[180]: 
    Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25',
           ...
           u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17',
           u'2017-01-18', u'2017-01-19'],
          dtype='object', length=253)
    
    In [181]: quotesdf2.columns  # 显示列名
    Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object')
    
    In [182]: quotesdf2.values  # 显示数据的值
    Out[182]: 
    array([[  6.03741455e+01,   6.18359160e+01,   6.23362562e+01,
              6.01288817e+01,   9.04380000e+06],
           ..., 
           [  7.76100010e+01,   7.66900020e+01,   7.77799990e+01,
              7.66100010e+01,   7.79110000e+06]])
    
    In [183]: quotesdf2.describe  # 显示数据描述
    Out[183]: 
    <bound method DataFrame.describe of                  open      close       high        low      volume
    2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
    2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
    2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

      3.2 索引的格式:u 表示unicode编码

      3.3 显示行:

    In [193]: quotesdf.head(2)  # 专用方式显示头两行
    Out[193]: 
           date       open      close       high        low     volume
    1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
    2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0
    
    In [194]: quotesdf.tail(2)  # 专用方式显示尾两行
    Out[194]: 
             date       open      close       high        low     volume
    252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
    253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0
    
    In [195]: quotesdf[:2]  # 切片方式显示头两行
    Out[195]: 
           date       open      close       high        low     volume
    1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
    2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0
    
    In [197]: quotesdf[251:]  # 切片方式显示尾两行
    Out[197]: 
             date       open      close       high        low     volume
    252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
    253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

    4. 数据选择

    5. 简单统计与处理

    6. Grouping

    7. Merge

  • 相关阅读:
    123
    p1216
    离线可持久化动态树
    线段树合并
    p2024
    树的dfs序,p1539,p1651,,2018/11/08模拟赛T3
    p1460
    CDQ分治,二维数点与三维数点,p1357与p2026与p2027与p2028与p2029
    自动AC机
    平衡二叉树之splay p1998
  • 原文地址:https://www.cnblogs.com/wnzhong/p/6323475.html
Copyright © 2011-2022 走看看