zoukankan      html  css  js  c++  java
  • 利用Python进行数据分析_Pandas_数据加载、存储与文件格式

    申明:本系列文章是自己在学习《利用Python进行数据分析》这本书的过程中,为了方便后期自己巩固知识而整理。

    1 pandas读取文件的解析函数

    read_csv 读取带分隔符的数据,默认分隔符 逗号

    read_table 读取带分隔符的数据,默认分隔符 “ ”

    read_fwf 读取定宽、列格式数据(无分隔符)

    read_clipboard 读取剪贴板中的数据(将网页转换为表格)

    1.1 读取excel数据

    import pandas as pd
    import numpy as np
    file = 'D:example.xls'
    pd = pd.read_excel(file)
    pd

    运行结果:

    1.1.1 不显示表头

    pd = pd.read_excel(file,header=None)

    运行结果:

    1.1.2 设置表头

    pd = pd.read_excel(file,names=['Year','Name','Math','Chinese','EngLish','Avg'])

    运行结果:

    1.1.3 指定索引

    pd = pd.read_excel(file,index_col= '姓名')

    运行结果:

    2 读取CSV数据

    import pandas as pd
    import numpy as np
    pd = pd.read_csv("d:\test.csv",engine='python')
    pd

    运行结果:

    import pandas as pd
    import numpy as np
    pd = pd.read_table("d:\test.csv",engine='python')
    pd

    运行结果:

    import pandas as pd
    import numpy as np
    pd = pd.read_fwf("d:\test.csv",engine='python')
    pd

    运行结果:

     

    3 将数据写出到文本格式

    将数据写出到csv格式,默认分隔符 逗号

    import pandas as pd
    import numpy as np
    pd = pd.read_fwf("d:\test.csv",engine='python')
    pd.to_csv("d:\test1.csv",encoding='gbk')

    运行结果:

    4 手工处理分隔符格式

    单字符分隔符文件,直接用csv模块

    import pandas as pd
    import numpy as np
    import csv
    file = 'D:\test.csv'
    pd = pd.read_csv(file,engine='python')
    pd.to_csv("d:\test1.csv",encoding='gbk',sep='/')
    f = open("d:\test1.csv")
    reader = csv.reader(f)
    for line in reader:
    print(line)

    运行结果:

    4.1 缺失值填充

    import pandas as pd
    import numpy as np
    import csv
    file = 'D:\test.csv'
    pd = pd.read_csv(file,engine='python')
    pd.to_csv("d:\test1.csv",encoding='gbk',sep='/',na_rep='NULL')
    f = open("d:\test1.csv")
    reader = csv.reader(f)
    for line in reader:
        print(line)

    运行结果:

    4.2 JSON

    4.2.1 json.loads 可将JSON字符串转换成Python形式

    import pandas as pd
    import numpy as np
    import json
    obj = """{
      "sucess" : "1",
      "header" : {
        "version" : 0,
        "compress" : false,
        "times" : 0
      },
      "data" : {
        "name" : "BankForQuotaTerrace",
        "attributes" : {
          "queryfound" : "1",
          "numfound" : "1",
          "reffound" : "1"
        },
        "columnmeta" : {
          "a0" : "DATE",
          "a1" : "DOUBLE",
          "a2" : "DOUBLE",
          "a3" : "DOUBLE",
          "a4" : "DOUBLE",
          "a5" : "DOUBLE",
          "a6" : "DATE",
          "a7" : "DOUBLE",
          "a8" : "DOUBLE",
          "a9" : "DOUBLE",
          "b0" : "DOUBLE",
          "b1" : "DOUBLE",
          "b2" : "DOUBLE",
          "b3" : "DOUBLE",
          "b4" : "DOUBLE",
          "b5" : "DOUBLE"
        },
        "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ]
      }
    }
    """
    result = json.loads(obj)
    result

    运行结果:

    4.2.2 json.dumps可将Python字符串转换成JSON形式

    result = json.loads(obj)
    asjson=json.dumps(result)
    asjson

    运行结果:

    4.2.3 JSON数据转换成DataFrame

    import pandas as pd
    import numpy as np
    from pandas import DataFrame
    import json
    obj = """{
      "sucess" : "1",
      "header" : {
        "version" : 0,
        "compress" : false,
        "times" : 0
      },
      "data" : {
        "name" : "BankForQuotaTerrace",
        "attributes" : {
          "queryfound" : "1",
          "numfound" : "1",
          "reffound" : "1"
        },
        "columnmeta" : {
          "a0" : "DATE",
          "a1" : "DOUBLE",
          "a2" : "DOUBLE",
          "a3" : "DOUBLE",
          "a4" : "DOUBLE",
          "a5" : "DOUBLE",
          "a6" : "DATE",
          "a7" : "DOUBLE",
          "a8" : "DOUBLE",
          "a9" : "DOUBLE",
          "b0" : "DOUBLE",
          "b1" : "DOUBLE",
          "b2" : "DOUBLE",
          "b3" : "DOUBLE",
          "b4" : "DOUBLE",
          "b5" : "DOUBLE"
        },
        "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ]
      }
    }
    """
    result = json.loads(obj)
    result
    jsondf = DataFrame(result['data'],columns = ['name','attributes','columnmeta'],index={1,2,3})
    jsondf

    运行结果:

    备注:其中attributes和columnmeta,存在嵌套,这个问题后面再补充。

    4.3 XML和HTML

    爬取同花顺网页中的列表数据,并转换成DataFrame

     

    在爬取的时候,我这里没有考虑爬分页的数据,有兴趣的可以自己尝试,我这里主要是想尝试爬取数据后转成DataFrame

    代码如下:

    import pandas as pd
    import numpy as np
    from pandas.core.frame import DataFrame
    from lxml.html import parse
    import requests
    from bs4 import BeautifulSoup
    import time
    
    url = 'http://data.10jqka.com.cn/market/longhu/'
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
    response = requests.get(url = url,headers = headers)
    html = response.content
    soup = BeautifulSoup(html, 'lxml')
    s = soup.find_all('div','yyb')
    
    # 获取dataframe所需的columns
    def getcol():
        col = []
        for i in s: 
            lzs = i.find_all('thead')
            for k in lzs: 
                lbs = k.find_all('th') 
                for j in lbs:
                    col.append(j.text.strip('
    '))
                return col
            
    # 获取dataframe所需的values
    def getvalues():
        val = []
        for j in s:
            v = j.find_all('tbody')
            for k in v:
                vv = k.find_all('tr')
                list = []
                for l in vv:
                    tdlist = []
                    vvv = l.find_all('td')
                    for m in vvv:
                        tdlist.append(m.text)
                    list.append(tdlist)
                return(list)
    
    if __name__ == "__main__":
        cols = getcol()
        values = getvalues()
        data=DataFrame(values,columns=cols)
        print(data)

    运行结果:

    4.4 二进制数据格式

    pandas对象的save方法保存,load方法读回到Python

    4.5 HDF5格式

    HDF是层次型数据格式,HDF5文件含一个文件系统式的节点结构,支持多个数据集、元数据,可以高效的分块读写。Python中的HDF5库有2个接口:PyTables和h5py。

    海量数据应该考虑用这个,现在我没用着,先不研究了。

    4.6 使用HTML和Web API

    import requests
    import pandas as pd
    from pandas import DataFrame
    import json
    url = 'http://t.weather.sojson.com/api/weather/city/101030100'
    resp = requests.get(url)
    data = json.loads(resp.text)#这里的data是一个dict
    jsondf = DataFrame(data['cityInfo'],columns =['city','cityId','parent','updateTime'],index=[1])#实例化
    jsondf

    运行结果:

    4.7 使用数据库

    4.7.1 sqlite3

    import sqlite3
    import pandas.io.sql as sql
    con = sqlite3.connect()
    sql.read_frame('select * from test',con)#con 是一个连接对象

    4.7.1 MongoDB

    没装。先搁置。

  • 相关阅读:
    ASP.NET线程相关配置
    ECshop 数据库表结构
    PHPnow 升级后 PHP不支持GD、MySQL
    C# 创建iis站点以及IIS站点属性,iis不能启动站点
    CSPS_107
    CSPS_106
    CSPS_105
    CSPS_104
    CSPS_103
    CSPS_102
  • 原文地址:https://www.cnblogs.com/zhouwp/p/10139021.html
Copyright © 2011-2022 走看看