在进行数据分析时,经常需要按照一定的条件创建新的数据列,然后进行进一步分析
-
直接复制
-
-
df.assign方法
-
按照条件选择分组分别赋值
import pandas as pd file_path = "../files/beijing_tianqi_2018.csv" df = pd.read_csv(file_path) print(df.head())
# 设定索引为日期,方便按日期筛选 df.set_index('ymd', inplace=True) # 替换温度的后缀℃ df.loc[:, 'bWendu'] = df.loc[:, 'bWendu'].str.replace('℃', '').astype('int32') df.loc[:, 'yWendu'] = df.loc[:, 'yWendu'].str.replace('℃', '').astype('int32')
实例:计算温度差
# 注意df['bWendu']其实是一个Series,后面的减法返回的是Series df.loc[:, 'wencha'] = df['bWendu'] - df['yWendu']
完整代码:
import pandas as pd file_path = "../files/beijing_tianqi_2018.csv" df = pd.read_csv(file_path) # 替换温度的后缀℃, 并转为int32(修改列) df.loc[:, 'bWendu'] = df.loc[:, 'bWendu'].str.replace('℃', '').astype('int32') df.loc[:, 'yWendu'] = df.loc[:, 'yWendu'].str.replace('℃', '').astype('int32') print(df.head()) print('*' * 50, ' ') # 计算温度差(新增列) # 注意df['bWendu']其实是一个Series,后面的减法返回的是Series df.loc[:, 'wencha'] = df['bWendu'] - df['yWendu'] print(df.head())
实例:添加一列温度类型
-
如果温度大于33度就是高温
-
低于-10度就是低温
-
否则是常温
import pandas as pd file_path = "../files/beijing_tianqi_2018.csv" df = pd.read_csv(file_path) # 替换温度的后缀℃, 并转为int32(修改列) df.loc[:, 'bWendu'] = df.loc[:, 'bWendu'].str.replace('℃', '').astype('int32') df.loc[:, 'yWendu'] = df.loc[:, 'yWendu'].str.replace('℃', '').astype('int32') print(df.head()) print('*' * 50, ' ') def get_wendu_type(x): if x['bWendu'] > 33: return "高温" elif x['yWendu'] < -10: return "低温" else: return "常温" # 注意需要设置axis--1,这时Series的index是columns df.loc[:, 'wendu_type'] = df.apply(get_wendu_type, axis=1) # 打印前几行数据 print(df.head()) print('*' * 50, ' ') # 查看温度类型的计数 print(df['wendu_type'].value_counts())
import pandas as pd file_path = "../files/beijing_tianqi_2018.csv" df = pd.read_csv(file_path) # 替换温度的后缀℃, 并转为int32(修改列) df.loc[:, 'bWendu'] = df.loc[:, 'bWendu'].str.replace('℃', '').astype('int32') df.loc[:, 'yWendu'] = df.loc[:, 'yWendu'].str.replace('℃', '').astype('int32') print(df.head()) print('*' * 50, ' ') df_huashi = df.assign( yWendu_huashi=lambda x: x['yWendu'] * 9 / 5 + 32, bWendu_huashi=lambda x: x['bWendu'] * 9 / 5 + 32 ) print(df_huashi.head()) print('*' * 50, ' ')
按条件先选择数据,然后对着部分数据赋值新列
实例:高低温差大于10度,则认为温差较大
import pandas as pd file_path = "../files/beijing_tianqi_2018.csv" df = pd.read_csv(file_path) # 替换温度的后缀℃, 并转为int32(修改列) df.loc[:, 'bWendu'] = df.loc[:, 'bWendu'].str.replace('℃', '').astype('int32') df.loc[:, 'yWendu'] = df.loc[:, 'yWendu'].str.replace('℃', '').astype('int32') # 打印前几行数据 print(df.head()) print('*' * 50, ' ') # 先创建空列(这是第一种创建新列的方法) df['wencha_type'] = "" df.loc[df['bWendu'] - df['yWendu'] > 10, 'wencha_type'] = "温差大" df.loc[df['bWendu'] - df['yWendu'] <= 10, 'wencha_type'] = "温差正常" # 打印前几行数据 print(df.head()) print('*' * 50, ' ') # 查看温差类型的计数 print(df['wencha_type'].value_counts())