zoukankan html css js c++ java

[译]使用Pandas读取大型Excel文件

上周我参加了dataisbeautiful subreddit上的Dataviz Battle，我们不得不从TSA声明数据集创建可视化。我喜欢这种比赛，因为大多数时候你最终都会学习很多有用的东西。
这次数据非常干净，但它分散在几个PDF文件和Excel文件中。在从PDF中提取数据的过程中，我了解了一些工具和库，最后我使用了tabula-py，这是Java库tabula的Python包装器。至于Excel文件，我发现单行 - 简单pd.read_excel- 是不够的。
最大的Excel文件大约是7MB，包含一个大约100k行的工作表。我虽然Pandas可以一次性读取文件而没有任何问题（我的计算机上有10GB的RAM），但显然我错了。
解决方案是以块的形式读取文件。该pd.read_excel函数没有像pd.read_sql这样的游标，所以我不得不手动实现这个逻辑。这是我做的：

import os
import pandas as pd


HERE = os.path.abspath(os.path.dirname(__file__))
DATA_DIR = os.path.abspath(os.path.join(HERE, '..', 'data'))


def make_df_from_excel(file_name, nrows):
    """Read from an Excel file in chunks and make a single DataFrame.

    Parameters
    ----------
    file_name : str
    nrows : int
        Number of rows to read at a time. These Excel files are too big,
        so we can't read all rows in one go.
    """
    file_path = os.path.abspath(os.path.join(DATA_DIR, file_name))
    xl = pd.ExcelFile(file_path)

    # In this case, there was only a single Worksheet in the Workbook.
    sheetname = xl.sheet_names[0]

    # Read the header outside of the loop, so all chunk reads are
    # consistent across all loop iterations.
    df_header = pd.read_excel(file_path, sheetname=sheetname, nrows=1)
    print(f"Excel file: {file_name} (worksheet: {sheetname})")

    chunks = []
    i_chunk = 0
    # The first row is the header. We have already read it, so we skip it.
    skiprows = 1
    while True:
        df_chunk = pd.read_excel(
            file_path, sheetname=sheetname,
            nrows=nrows, skiprows=skiprows, header=None)
        skiprows += nrows
        # When there is no data, we know we can break out of the loop.
        if not df_chunk.shape[0]:
            break
        else:
            print(f"  - chunk {i_chunk} ({df_chunk.shape[0]} rows)")
            chunks.append(df_chunk)
        i_chunk += 1

    df_chunks = pd.concat(chunks)
    # Rename the columns to concatenate the chunks with the header.
    columns = {i: col for i, col in enumerate(df_header.columns.tolist())}
    df_chunks.rename(columns=columns, inplace=True)
    df = pd.concat([df_header, df_chunks])
    return df


if __name__ == '__main__':
    df = make_df_from_excel('claims-2002-2006_0.xls', nrows=10000)

要记住的另一件事。当工作在Python Excel文件，你可能需要您是否需要从/读/写数据时使用不同的包.xls和.xlsx文件。
这个数据集包含两个.xls和.xlsx文件，所以我不得不使用xlrd来读取它们。请注意，如果您唯一关心的是读取.xlsx文件，那么即使xlrd 仍然可以更快，openpyxl也是可行的方法。
这次我没有写任何Excel文件，但如果你需要，那么你想要xlsxwriter。我记得用它来创建包含许多复杂工作表和单元格注释的工作簿（即Excel文件）。您甚至可以使用它来创建带有迷你图和VBA宏的工作表！

原文来源：https://www.giacomodebidda.com/reading-large-excel-files-with-pandas/

查看全文

相关阅读:
hdu 1057 (simulation, use sentinel to avoid boudary testing, use swap trick to avoid extra copy.）分类： hdoj 2015-06-19 11:58 25人阅读评论(0) 收藏
 hdu 1053 (huffman coding, greedy algorithm, std::partition, std::priority_queue ) 分类： hdoj 2015-06-18 19:11 22人阅读评论(0) 收藏
 hdu 1052 (greedy algorithm) 分类： hdoj 2015-06-18 16:49 35人阅读评论(0) 收藏
 hdu 1051 (greedy algorithm, how a little modification turn 15ms to 0ms) 分类： hdoj 2015-06-18 12:54 29人阅读评论(0) 收藏
 hdu 1050 (preinitilization or postcleansing, std::fill) 分类： hdoj 2015-06-18 11:33 34人阅读评论(0) 收藏
 hdu 1047 (big integer sum, fgets or scanf, make you func return useful infos) 分类： hdoj 2015-06-18 08:21 39人阅读评论(0) 收藏
 hdu 1041 (OO approach, private constructor to prevent instantiation, sprintf) 分类： hdoj 2015-06-17 15:57 25人阅读评论(0) 收藏
 hdu 1039 (string process, fgets, scanf, neat utilization of switch clause) 分类： hdoj 2015-06-16 22:15 38人阅读评论(0) 收藏
 hdu 1036 (I/O routines, fgets, sscanf, %02d, rounding, atoi, strtol) 分类： hdoj 2015-06-16 19:37 32人阅读评论(0) 收藏
 查漏补缺（一）

原文地址：https://www.cnblogs.com/everfight/p/pandas_read_large_number.html