zoukankan      html  css  js  c++  java
  • Apply vs transform on a group object

    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other.

    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). 

     https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object

    Two major differences between apply and transform

    There are two major differences between the transform and apply groupby methods.

    • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
    • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

    So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

    Inspecting the custom function

    It can help quite a bit to inspect the input to your custom function passed to apply or transform.

    Examples

    Let's create some sample data and inspect the groups so that you can see what I am talking about:

    df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                       'a':[4,5,1,3], 'b':[6,10,3,11]})
    df

    Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

    def inspect(x):
        print(type(x))
        raise

    Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

    df.groupby('State').apply(inspect)
    
    <class 'pandas.core.frame.DataFrame'>
    <class 'pandas.core.frame.DataFrame'>
    RuntimeError

    As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

    Now, let's do the same thing with transform

    df.groupby('State').transform(inspect)
    <class 'pandas.core.series.Series'>
    <class 'pandas.core.series.Series'>
    RuntimeError

    It is passed a Series - a totally different Pandas object.

    So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

    def subtract_two(x):
        return x['a'] - x['b']
    
    df.groupby('State').transform(subtract_two)
    KeyError: ('a', 'occurred at index a')

    We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

    df.groupby('State').apply(subtract_two)
    
    State     
    Florida  2   -2
             3   -8
    Texas    0   -2
             1   -5
    dtype: int64

    The output is a Series and a little confusing as the original index is kept, but we have access to all columns.


    Displaying the passed pandas object

    It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the displayfunction from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

    from IPython.display import display
    def subtract_two(x):
        display(x)
        return x['a'] - x['b']

    Screenshot: enter image description here


    Transform must return a single dimensional sequence the same size as the group

    The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

    def return_three(x):
        return np.array([1, 2, 3])
    
    df.groupby('State').transform(return_three)
    ValueError: transform must return a scalar value for each group

    The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

    def rand_group_len(x):
        return np.random.rand(len(x))
    
    df.groupby('State').transform(rand_group_len)
    
              a         b
    0  0.962070  0.151440
    1  0.440956  0.782176
    2  0.642218  0.483257
    3  0.056047  0.238208

    Returning a single scalar object also works for transform

    If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

    def group_sum(x):
        return x.sum()
    
    df.groupby('State').transform(group_sum)
    
       a   b
    0  9  16
    1  9  16
    2  4  14
    3  4  14
  • 相关阅读:
    [原创]如何在Windows下安装Bugfree2.0.0.1
    [原创]网站性能优化利器之一"google page speed"
    [原创]下一代Web 应用程序安全性测试工具HP WebInspect简介
    [原创]微软软件项目管理Team Foundation Server之测试人员
    [原创]Yeepay网站安全测试漏洞之跨站脚本注入
    [原创]软件测试过程改进的内容和注意事项
    [原创]快钱99bill网站安全性测试漏洞之“跨站式脚本注入”
    马化腾内部讲座:让产品自己召唤人
    [转贴]可用性测试
    [原创]浅谈缺陷分析的意义和方法
  • 原文地址:https://www.cnblogs.com/andy-0212/p/10020088.html
Copyright © 2011-2022 走看看