14  Advanced Data Grouping and Data Aggregation

14.1 Grouping Operations

In Chapter 6 you practiced basic grouping operations over pandas DataFrames. The Pandas DataFrame.groupby() method split the original DataFrame based on a given key into several groups, refer to the following figure.

Grouping Operation

The following example applies a groupby() operation to a DataFrame df using the column Division as the grouping key. Notice that it returns a DataFrameGroupBy.

import pandas as pd
data = {
    'EmployeeID': ['F0099','F1010','M2020','M2021','E0100','E1000','E3434'],
    'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias','Francois'],     # Column 'Name' with names
    'Age': [25, 30, 35,50,47,59,55],                      # Column 'Age' with ages
    'Division': ['Finance','Finance','Marketing','Marketing','Engineering','Engineering','Engineering'],
    'City': ['New York', 'Los Angeles', 'Beijing','Hong Kong','London','Beijing','London'],  # Column 'City' with cities
    'Salary': [70000, 80000, 60000,45000,125000,36000,250000],# Column 'Salary' with salaries
    'Bonus': [1000,1500,30000,23000,20000,25000,15000]

}

# Step 3: Create the DataFrame
df = pd.DataFrame(data)
df.set_index('EmployeeID',inplace=True)

# Display the DataFrame
df
Name Age Division City Salary Bonus
EmployeeID
F0099 Alice 25 Finance New York 70000 1000
F1010 Bob 30 Finance Los Angeles 80000 1500
M2020 Charlie 35 Marketing Beijing 60000 30000
M2021 John 50 Marketing Hong Kong 45000 23000
E0100 Sean 47 Engineering London 125000 20000
E1000 Tobias 59 Engineering Beijing 36000 25000
E3434 Francois 55 Engineering London 250000 15000
# grouping by Division
dfGroupedByDivision=df.groupby('Division')

dfGroupedByDivision
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11f816450>
DataFrameGroupBy objects

the previous groupby() operation returned a DataFrameGroupBy object which, as you will see next can be iterated over using for loops.

Iterating over Groups

It is possible to implement fine-grained processing of groups by iterating over them using ‘for’ loops. The following example iterates over dfGroupedByDivision, in this case just printing each group’s rows.

for division,group in dfGroupedByDivision:
    print('I have found:',division)
    print(group)
I have found: Engineering
                Name  Age     Division     City  Salary  Bonus
EmployeeID                                                    
E0100           Sean   47  Engineering   London  125000  20000
E1000         Tobias   59  Engineering  Beijing   36000  25000
E3434       Francois   55  Engineering   London  250000  15000
I have found: Finance
             Name  Age Division         City  Salary  Bonus
EmployeeID                                                 
F0099       Alice   25  Finance     New York   70000   1000
F1010         Bob   30  Finance  Los Angeles   80000   1500
I have found: Marketing
               Name  Age   Division       City  Salary  Bonus
EmployeeID                                                   
M2020       Charlie   35  Marketing    Beijing   60000  30000
M2021          John   50  Marketing  Hong Kong   45000  23000
Iterating over DataFrameGroupBy objects

Iterating over a DataFrameGroupBy object is not as complicated at it seems, just bear in mind that you are looping over a series of groups and that each group is a yet another Pandas DataFrame.

As in any other regular for loop you can implement conditional filtering to process only specific groups. The following example iterates over dfGroupedByDivision, in this case it prints each group’s rows and finds the top earner. Given the if clause inside the loop only the Engineering division is processed.

for division,group in dfGroupedByDivision:
    print('I have found:',division)
    if division=='Engineering':
        print('I am processing:',division)
        print(group)
        topEarner=group['Salary'].max()
        print("The higest salary in the group is:",topEarner)
        print("\n")
I have found: Engineering
I am processing: Engineering
                Name  Age     Division     City  Salary  Bonus
EmployeeID                                                    
E0100           Sean   47  Engineering   London  125000  20000
E1000         Tobias   59  Engineering  Beijing   36000  25000
E3434       Francois   55  Engineering   London  250000  15000
The higest salary in the group is: 250000


I have found: Finance
I have found: Marketing
Caution when iterating over DataFrameGroupBy objects

We advise caution when deciding to iterate over a DataFrameGroupBy object as it usually degrades performance and memory bottlenecks when dealing with large datasets. We recommend, whenever possible, to rely on aggregation, transformation and filtration operations. We will review these operations in subsequent sections of this chapter.

14.2 Grouping-Aggregation Operations

In Chapter 6 you practiced basic aggregation and grouping of pandas DataFrames. A grouping-aggregation operation takes a DataFrameGroupBy object and applies a method, or your own function, to each group, refer to the following figure. Advanced grouping-aggregation operations are implemented combining the DataFrame.groupby() with the DataFrameGroupBy.agg() method.

Grouping-Aggregation Operation

Single column grouping, single column aggregation

You know already how to to perform grouping operations on a single column and subsequent aggregation of another column. For instance the following example: (1) groups the dataframe df on the column Division, (2) aggregates the Salary column using the sum() method.

df.groupby('Division').agg({'Salary':'sum'})
Salary
Division
Engineering 411000
Finance 150000
Marketing 105000

Multiple column grouping, single column aggregation

Grouping by several columns is quite a common requirement in data science. The following example (1) groups the dataframe df on the column Division and City, (2) aggregates the Salary column using the sum() method.

df.groupby(['Division','City']).agg({'Salary':'sum'})
Salary
Division City
Engineering Beijing 36000
London 375000
Finance Los Angeles 80000
New York 70000
Marketing Beijing 60000
Hong Kong 45000

Please notice that the output of this operation is a DataFrame with two indexes: Division and City.

Multiple column grouping, multiple column aggregation

Indeed it is possible to perform grouping operations over several columns and aggregation over other columns. The following example (1) groups the dataframe df on the column Division and City, (2) aggregates both Salary and Bonus columns using the DataFrameGroupBy.sum() method.

df.groupby(['Division','City']).agg({'Salary':'sum','Bonus':'sum'})
Salary Bonus
Division City
Engineering Beijing 36000 25000
London 375000 35000
Finance Los Angeles 80000 1500
New York 70000 1000
Marketing Beijing 60000 30000
Hong Kong 45000 23000
Determining the type of Object

Oftentimes you will need to determine the type of object being created after an operation in Python, for this the method type() is useful, for instance for the previous example:

result1=df.groupby(['Division','City']).agg({'Salary':'sum','Bonus':'sum'})
type(result1)
pandas.core.frame.DataFrame

The DataFrameGroupBy.agg() method allows aggregation operations based on built-in methods as well as your own defined functions. For instance the following example: (1) groups the dataframe df on the column Division, (2) aggregates the Salary column using the DataFrameGroupBy.sum() method, (3) aggregates the Bonus column using the DataFrameGroupBy.mean() method.

df.groupby('Division').agg({'Salary':'sum','Bonus':'mean'})
Salary Bonus
Division
Engineering 411000 20000.0
Finance 150000 1250.0
Marketing 105000 26500.0

The following code illustrates a combined grouping and aggregation operation based on the built-in method DataFrameGroupBy.mean() and a regular function called custom_func()

def custom_func(x):
    return x.max() - x.min()


df.groupby('Division').agg({'Salary':custom_func,'Bonus':'mean'})
Salary Bonus
Division
Engineering 214000 20000.0
Finance 10000 1250.0
Marketing 15000 26500.0
Renaming columns after groupby operations

After a grouping and aggregation operation it is advisable to rename the columns to avoid misinterpretations. For instance

groupedBySalaryBonus=df.groupby('Division').agg({'Salary':'sum','Bonus':'mean'})
groupedBySalaryBonus.rename(columns={'Salary':'TotalSalary','Bonus':'AverageBonus'},inplace=True)
groupedBySalaryBonus
TotalSalary AverageBonus
Division
Engineering 411000 20000.0
Finance 150000 1250.0
Marketing 105000 26500.0

Indeed you can use lambda functions to define an aggregation function, for instance:

df.groupby('Division').agg({'Bonus':'sum','Salary':lambda x: x.max() - x.min()})
Bonus Salary
Division
Engineering 60000 214000
Finance 2500 10000
Marketing 53000 15000

Refer to the following table for a list of built-in DataFrameGroupBy methods available to perform aggregation operations.

Method Description
any() Compute whether any of the values in the groups are truthy
all() Compute whether all of the values in the groups are truthy
count() Compute the number of non-NA values in the groups
cov() Compute the covariance of the groups
first() Compute the first occurring value in each group
idxmax() Compute the index of the maximum value in each group
idxmin() Compute the index of the minimum value in each group
last() Compute the last occurring value in each group
max() Compute the maximum value in each group
mean() Compute the mean of each group
median() Compute the median of each group
min() Compute the minimum value in each group
nunique() Compute the number of unique values in each group
prod() Compute the product of the values in each group
quantile() Compute a given quantile of the values in each group
sem() Compute the standard error of the mean of the values in each group
size() Compute the number of values in each group
skew() Compute the skew of the values in each group
std() Compute the standard deviation of the values in each group
sum() Compute the sum of the values in each group
var() Compute the variance of the values in each group

14.3 Grouping-Transformation Operations

Transformation operations in Pandas are useful when you want to perform some computation on groups and return an object that is the same size as the original DataFrame. This is different from grouping-aggregation operations that, usually, produces a single value per group (e.g. the mean ). Transformation operations are useful for statistical normalization, group-based feature engineering, and custom transformations.

Grouping-transformation operations are implemented combining the DataFrame.groupby() with the DataFrameGroupBy.transform() method.

Grouping-Transformation Operation

The following code illustrates a combined grouping and transformation operation based on the built-in method DataFrameGroupBy.cumsum(). It groups the dataframe df on the column Division, (2) computes the cumulative sum of the Age column within each group.

df
Name Age Division City Salary Bonus
EmployeeID
F0099 Alice 25 Finance New York 70000 1000
F1010 Bob 30 Finance Los Angeles 80000 1500
M2020 Charlie 35 Marketing Beijing 60000 30000
M2021 John 50 Marketing Hong Kong 45000 23000
E0100 Sean 47 Engineering London 125000 20000
E1000 Tobias 59 Engineering Beijing 36000 25000
E3434 Francois 55 Engineering London 250000 15000
df.groupby('Division')['Age'].transform('cumsum')
EmployeeID
F0099     25
F1010     55
M2020     35
M2021     85
E0100     47
E1000    106
E3434    161
Name: Age, dtype: int64

The following example groups the dataframe df on the column Division, (2) computes the rank of Salary column within each group.

df.groupby('Division')['Age'].transform('rank')
EmployeeID
F0099    1.0
F1010    2.0
M2020    1.0
M2021    2.0
E0100    1.0
E1000    3.0
E3434    2.0
Name: Age, dtype: float64

We usually want to add the output of a grouping-transformation operation back to the original DataFrame.

df['Age_Rank']=df.groupby('Division')['Age'].transform('rank')
df
Name Age Division City Salary Bonus Age_Rank
EmployeeID
F0099 Alice 25 Finance New York 70000 1000 1.0
F1010 Bob 30 Finance Los Angeles 80000 1500 2.0
M2020 Charlie 35 Marketing Beijing 60000 30000 1.0
M2021 John 50 Marketing Hong Kong 45000 23000 2.0
E0100 Sean 47 Engineering London 125000 20000 1.0
E1000 Tobias 59 Engineering Beijing 36000 25000 3.0
E3434 Francois 55 Engineering London 250000 15000 2.0
Rank based transformations

The previous example illustrates a quite common use case in practice: ordering within groups for instance for instance to rank customers based on their purchases or to rank commercial branches based on their sales performance.

A specially interesting use case in which transformations are indicated is the so called Z-Score Normalization whereby we need to compute the Z-score for each value within its group. A Z-score represents the number of standard deviations a data point is from the mean of its group. For this, you can use transform with a lambda function.

# Compute the Z-score within each group
df['Salary_Zscored'] = df.groupby('Division')['Salary'].transform(lambda x: (x - x.mean()) / x.std())
df
Name Age Division City Salary Bonus Age_Rank Salary_Zscored
EmployeeID
F0099 Alice 25 Finance New York 70000 1000 1.0 -0.707107
F1010 Bob 30 Finance Los Angeles 80000 1500 2.0 0.707107
M2020 Charlie 35 Marketing Beijing 60000 30000 1.0 0.707107
M2021 John 50 Marketing Hong Kong 45000 23000 2.0 -0.707107
E0100 Sean 47 Engineering London 125000 20000 1.0 -0.111624
E1000 Tobias 59 Engineering Beijing 36000 25000 3.0 -0.939504
E3434 Francois 55 Engineering London 250000 15000 2.0 1.051129

Refer to the following table for a list of built-in DataFrameGroupBy methods available to perform transformation operations.

Method Description
bfill() Back fill NA values within each group
cumcount() Compute the cumulative count within each group
cummax() Compute the cumulative max within each group
cummin() Compute the cumulative min within each group
cumprod() Compute the cumulative product within each group
cumsum() Compute the cumulative sum within each group
diff() Compute the difference between adjacent values within each group
ffill() Forward fill NA values within each group
pct_change() Compute the percent change between adjacent values within each group
rank() Compute the rank of each value within each group
shift() Shift values up or down within each group

14.4 Grouping-Filtration Operations

Filtration in Pandas allows you to filter out entire groups based on a condition. This is particularly useful when you’re working with grouped data and you want to exclude some groups based on aggregate statistics or other criteria. Refer to the following figure.

Grouping-Filtration Operation

The following example groups the dataframe df on the column Division, (2) filters the first row of each group

df.groupby('Division').head(1)
Name Age Division City Salary Bonus Age_Rank Salary_Zscored
EmployeeID
F0099 Alice 25 Finance New York 70000 1000 1.0 -0.707107
M2020 Charlie 35 Marketing Beijing 60000 30000 1.0 0.707107
E0100 Sean 47 Engineering London 125000 20000 1.0 -0.111624

The following example groups the dataframe df on the column Division, (2) extracts only groups in which the salary mean is larger than 70000.

df.groupby('Division').filter(lambda x: x['Salary'].mean() > 70000)
Name Age Division City Salary Bonus Age_Rank Salary_Zscored
EmployeeID
F0099 Alice 25 Finance New York 70000 1000 1.0 -0.707107
F1010 Bob 30 Finance Los Angeles 80000 1500 2.0 0.707107
E0100 Sean 47 Engineering London 125000 20000 1.0 -0.111624
E1000 Tobias 59 Engineering Beijing 36000 25000 3.0 -0.939504
E3434 Francois 55 Engineering London 250000 15000 2.0 1.051129

14.5 Data Grouping and Aggregation in Practice

14.5.1 First Example

Let’s have a look at the following Jupyter Notebook. This notebook illustrates the application of groupby operations to a healthcare dataset.

14.5.2 Second Example

The following Retail Example illustrates how to perform advanced data grouping and subsequent aggregations on a retail dataset

14.6 Conclusions

This chapter explores advanced grouping operations in pandas, building on the DataFrame.groupby() method to split DataFrames into groups for analysis. It introduces iterating over DataFrameGroupBy objects for fine-grained processing. Grouping-aggregation operations, implemented via agg(), allow for single or multi-column grouping with diverse aggregation methods, including built-in functions and custom or lambda functions. Grouping-transformation operations, using transform(), enable calculations such as cumulative sums, ranking within groups, and Z-score normalization, adding new columns to the original DataFrame. Lastly, grouping-filtration operations filter out entire groups based on conditions, such as excluding groups with low mean values. Practical use cases in digital marketing, healthcare, and retail highlight the real-world utility of these techniques, with links to illustrative Jupyter Notebooks.

In the next chapter you will explore how to perform advanced data retrieval from external datasources including SQL databases.

14.7 Further Readings

For those interested in additional examples and references on comprehensions feel free to check the following: