zoukankan      html  css  js  c++  java
  • Pandas系列(十)-转换连接详解

    目录

    • 1. 拼接
    • 1.1 append
    • 1.2 concat
    • 2. 关联
    • 2.1 merge
    • 2.2 join

    数据准备

    # 导入相关库
    import numpy as np
    import pandas as pd
    """
    拼接
    有两个DataFrame,都存储了用户的一些信息,现在要拼接起来,组成一个DataFrame,如何实现呢?
    """
    data1 = {
        "name": ["Tom", "Bob"],
        "age": [18, 30],
        "city": ["Bei Jing ", "Shang Hai "]
    }
    df1 = pd.DataFrame(data=data1)
    df1
    Out[85]: 
      name  age        city
    0  Tom   18   Bei Jing 
    1  Bob   30  Shang Hai 
    data2 = {
        "name": ["Mary", "James"],
        "age": [35, 18],
        "city": ["Guang Zhou", "Shen Zhen"]
    }
    df2 = pd.DataFrame(data=data2)
    df2
    Out[86]: 
        name  age        city
    0   Mary   35  Guang Zhou
    1  James   18   Shen Zhen
    

     1. 拼接

      1.1 append

    def append(self, other, ignore_index=False,verify_integrity=False, sort=None):

      append 是最简单的拼接两个DataFrame的方法。

    df1.append(df2)
    Out[87]: 
        name  age        city
    0    Tom   18   Bei Jing 
    1    Bob   30  Shang Hai 
    0   Mary   35  Guang Zhou
    1  James   18   Shen Zhen

    可以看到,拼接后的索引默认还是原有的索引,如果想要重新生成索引的话,设置参数 ignore_index=True 即可。

    df1.append(df2, ignore_index=True)
    Out[88]: 
        name  age        city
    0    Tom   18   Bei Jing 
    1    Bob   30  Shang Hai 
    2   Mary   35  Guang Zhou
    3  James   18   Shen Zhen

      1.2 concat

      除了 append 这种方式之外,还有 concat 这种方式可以实现相同的功能。

    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
               keys=None, levels=None, names=None, verify_integrity=False,
               sort=None, copy=True):

     例子

    objs=[df1, df2]
    pd.concat(objs, ignore_index=True)
    Out[89]: 
        name  age        city
    0    Tom   18   Bei Jing 
    1    Bob   30  Shang Hai 
    2   Mary   35  Guang Zhou
    3  James   18   Shen Zhen

    如果想要区分出不同的DataFrame的数据,可以通过设置参数 keys,当然得设置参数 ignore_index=False。

    pd.concat(objs, ignore_index=False, keys=["df1", "df2"])
    Out[90]: 
            name  age        city
    df1 0    Tom   18   Bei Jing 
        1    Bob   30  Shang Hai 
    df2 0   Mary   35  Guang Zhou
        1  James   18   Shen Zhen
    

    2. 关联

      有两个DataFrame,分别存储了用户的部分信息,现在需要将用户的这些信息关联起来,如何实现呢?

      2.1 merge

        def merge(self, right, how='inner', on=None, left_on=None, right_on=None,
                  left_index=False, right_index=False, sort=False,
                  suffixes=('_x', '_y'), copy=True, indicator=False,
                  validate=None):

      通过 pd.merge 可以关联两个DataFrame,这里我们设置参数 on="name",表示依据 name 来作为关联键。默认how='inner',我们可以设置成outer

    data1 = {
        "name": ["Tom", "Bob", "Mary", "James"],
        "age": [18, 30, 35, 18],
        "city": ["Bei Jing ", "Shang Hai ", "Guang Zhou", "Shen Zhen"]
    }
    df1 = pd.DataFrame(data=data1)
    df1
    data2 = {"name": ["Bob", "Mary", "James", "Andy"],
            "sex": ["male", "female", "male", np.nan],
             "income": [8000, 8000, 4000, 6000]
    }
    df2 = pd.DataFrame(data=data2)
    df2
    pd.merge(df1,df2,on="name")
    Out[91]: 
        name  age        city     sex  income
    0    Bob   30  Shang Hai     male    8000
    1   Mary   35  Guang Zhou  female    8000
    2  James   18   Shen Zhen    male    4000
    #关联后发现数据变少了,只有 3 行数据,这是因为默认关联的方式是 inner,如果不想丢失任何数据,可以设置参数 how="outer"。
    pd.merge(df1,df2,on="name",how="outer")
    Out[92]: 
        name   age        city     sex  income
    0    Tom  18.0   Bei Jing      NaN     NaN
    1    Bob  30.0  Shang Hai     male  8000.0
    2   Mary  35.0  Guang Zhou  female  8000.0
    3  James  18.0   Shen Zhen    male  4000.0
    4   Andy   NaN         NaN     NaN  6000.0

    如果我们想保留左边所有的数据,可以设置参数 how="left";反之,如果想保留右边的所有数据,可以设置参数 how="right"

    pd.merge(df1, df2, on="name", how="left")
    Out[93]: 
        name  age        city     sex  income
    0    Tom   18   Bei Jing      NaN     NaN
    1    Bob   30  Shang Hai     male  8000.0
    2   Mary   35  Guang Zhou  female  8000.0
    3  James   18   Shen Zhen    male  4000.0

    有时候,两个 DataFrame 中需要关联的键的名称不一样,可以通过 left_on 和 right_on 来分别设置。

    df1.rename(columns={"name": "name1"}, inplace=True)
    df1
    Out[94]: 
       name1  age        city
    0    Tom   18   Bei Jing 
    1    Bob   30  Shang Hai 
    2   Mary   35  Guang Zhou
    3  James   18   Shen Zhen
    df2.rename(columns={"name": "name2"}, inplace=True)
    df2
    Out[95]: 
       name2     sex  income
    0    Bob    male    8000
    1   Mary  female    8000
    2  James    male    4000
    3   Andy     NaN    6000
    pd.merge(df1, df2, left_on="name1", right_on="name2")
    Out[96]: 
       name1  age        city  name2     sex  income
    0    Bob   30  Shang Hai     Bob    male    8000
    1   Mary   35  Guang Zhou   Mary  female    8000
    2  James   18   Shen Zhen  James    male    4000 

      有时候,两个DataFrame中都包含相同名称的字段,如何处理呢?

      我们可以设置参数 suffixes,默认 suffixes=('_x', '_y') 表示将相同名称的左边的DataFrame的字段名加上后缀 _x,右边加上后缀 _y。

    df1["sex"] = "male"
    df1
    Out[97]: 
       name1  age        city   sex
    0    Tom   18   Bei Jing   male
    1    Bob   30  Shang Hai   male
    2   Mary   35  Guang Zhou  male
    3  James   18   Shen Zhen  male
    pd.merge(df1, df2, left_on="name1", right_on="name2")
    Out[98]: 
       name1  age        city sex_x  name2   sex_y  income
    0    Bob   30  Shang Hai   male    Bob    male    8000
    1   Mary   35  Guang Zhou  male   Mary  female    8000
    2  James   18   Shen Zhen  male  James    male    4000
    pd.merge(df1, df2, left_on="name1", right_on="name2", suffixes=("_left", "_right"))
    Out[99]: 
       name1  age        city sex_left  name2 sex_right  income
    0    Bob   30  Shang Hai      male    Bob      male    8000
    1   Mary   35  Guang Zhou     male   Mary    female    8000
    2  James   18   Shen Zhen     male  James      male    4000
    

      2.2 join

    def join(self, other, on=None, how='left', lsuffix='', rsuffix='',sort=False):

      除了 merge 这种方式外,还可以通过 join 这种方式实现关联。相比 merge,join 这种方式有以下几个不同:

      (1)默认参数on=None,表示关联时使用左边和右边的索引作为键,设置参数on可以指定的是关联时左边的所用到的键名

      (2)左边和右边字段名称重复时,通过设置参数 lsuffix 和 rsuffix 来解决。

    df1.join(df2.set_index("name2"), on="name1", lsuffix="_left")
    Out[100]: 
       name1  age        city sex_left     sex  income
    0    Tom   18   Bei Jing      male     NaN     NaN
    1    Bob   30  Shang Hai      male    male  8000.0
    2   Mary   35  Guang Zhou     male  female  8000.0
    3  James   18   Shen Zhen     male    male  4000.0

    数据合并综合代码

    from pandas import concat
    
    
    def data_concat(dir_name, to_file_path, drop_duplicates: bool = False):
        """
        数据纵向合并
        :param dir_name: 数据来源文件夹名称
        :param to_file_path: 合并数据保存文件夹
        :param drop_duplicates: 是否去重
        :return:
        """
        objs = (read_excel(f'{dir_name}/{file}') for file in os.listdir(dir_name))
        merge_data = concat(objs=objs, ignore_index=True)
        if drop_duplicates:
            merge_data.drop_duplicates(inplace=True)
        merge_data.to_excel(to_file_path, index=False)
    
    
    if __name__ == '__main__':
        data_concat(dir_name='data1', to_file_path='merge_data.xlsx', drop_duplicates=True)
    • 融汇贯通 
    # -*- coding: utf-8 -*-
    
    """
    Datetime: 2020/07/05
    Author: Zhang Yafei
    Description: 合并文件
    """
    from pandas import read_csv, read_excel, merge, concat, DataFrame
    
    
    def read_file(file_path, on):
        if file_path.endswith('.csv'):
            return read_csv(file_path)
        if file_path.endswith('.xls') or file_path.endswith('xlsx'):
            return read_excel(file_path)
    
    
    def df_to_file(df: DataFrame, file_path: str, index: bool = True, encoding: str = 'utf_8_sig'):
        if file_path.endswith('.csv'):
            df.to_csv(file_path, index=index, encoding=encoding)
        if file_path.endswith('.xls') or file_path.endswith('xlsx'):
            df.to_excel(file_path, index=index)
    
    
    def merge_two_data(file1: str, file2: str, on: str = None, left_on: str = None, right_on: str = None,
                       how: str = 'inner', to_file: str = None):
        """
        横向合并两个文件
        @param file1:
        @param file2:
        @param on:
        @param left_on:
        @param right_on:
        @param how:
        @param to_file:
        @return:
        """
        df1 = read_file(file1)
        df2 = read_file(file2)
        merge_df = merge(df1, df2, on=on, how=how, left_on=left_on, right_on=right_on)
        if to_file:
            if to_file.endswith('.csv'):
                merge_df.to_csv(to_file, encoding='utf_8_sig', index=False)
            elif to_file.endswith('xls') or to_file.endswith('xlsx'):
                merge_df.to_excel(to_file, index=False)
        else:
            return merge_df
    
    
    def append_two_file(file1: str, file2: str, to_file: str = None):
        """
        纵向合并两个文件
        @param file1:
        @param file2:
        @param to_file:
        @return:
        """
        df1 = read_file(file1)
        df2 = read_file(file2)
        df3 = df1.append(df2, ignore_index=True)
        if to_file:
            df_to_file(df3, to_file, index=False)
        else:
            return df3
    
    
    def join_two_file(file1: str, file2: str, on: str = None, how: str = 'left', to_file: str = None):
        """
        横向合并两个文件
        @param file1:
        @param file2:
        @param on:
        @param how:
        @param to_file:
        @return:
        """
        df1 = read_file(file1)
        df2 = read_file(file2)
        df3 = df1.join(df2, on=on, how=how)
        if to_file:
            df_to_file(df3, to_file, index=False)
        else:
            return df3
    
    
    def concat_more_data(axis: int = 0, to_file=None, encoding='utf_8_sig', *files):
        """
        多个文件合并
        @param axis: 0/index 1/column 若axis=1, 默认基于索引将多个文件合并
        @param to_file: 导出文件路径
        @param encoding: 导出文件编码
        @param files: 合并文件路径
        @return:
        """
        if len(files) > 1:
            objs = [read_file(file) for file in files]
            merge_data = concat(objs=objs, axis=axis)
            if to_file:
                df_to_file(merge_data, to_file, index=False, encoding=encoding)
            else:
                return merge_data
        else:
            raise Exception('合并的文件个数小于2,不能进行合并,请输入大于等于两个文件路径')
    

      

     

  • 相关阅读:
    【Android
    【Android
    【Android
    【Android
    压测工具Siege
    压测工具Webbench
    Varnish与Squid的对比
    大牛的法宝[转]
    告别平庸的10种简单方法
    一个优秀的研发团队应该具备什么特征[转]
  • 原文地址:https://www.cnblogs.com/zhangyafei/p/10517069.html
Copyright © 2011-2022 走看看