6  Data Grouping and Data Aggregation Operations

The ability to separate a DataFrame into groups based on some characteristics (e.g. customer ID) and perform some computations over each group (e.g. purchases across time) is critical in data science as it allows us to conduct group-based analyses, for instance performance benchmarking (e.g. customer churning, business units revenues). In Pandas this is accomplished with data aggregation and grouping operations.

Data aggregation is the process of summarizing a dataset, usually to compute a metric (for instance the mean of a data series) or to condense large amounts of data into a more manageable form for subsequent processing or visualization.

Data Grouping is the process of splitting a given dataset into groups based on some characteristics (e.g. regional areas, hospital name, customer ID)

It is possible to perform really complex data transformations by combining data grouping and aggregation. For example, as shown in the following figure, you might want to calculate the average sales per region from a dataset that includes tons of sales records or simply summarize information to be consumed by business managers.

Uses of Data Aggregation and Grouping

6.1 Data Aggregation

Pandas based data aggregation is performed by applying a method, for instance mean(), to a specific column of a DataFrame, refer to the following figure.

Pandas based Data Aggregation

The following example illustrates how to compute the mean of a sales data. To do so pandas applies the mean() method to the column ‘Sales’ of the dataframe corporatesales.

import pandas as pd

# Sample data
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South'],
    'Sales': [200, 150, 300, 400, 250, 100]
}

corporatesales = pd.DataFrame(data)

corporatesales
Region Sales
0 North 200
1 South 150
2 East 300
3 West 400
4 North 250
5 South 100
# Average Sales
mean_sales = corporatesales['Sales'].mean()
print("Average sales:\n", mean_sales)
Average sales:
 233.33333333333334

Following on the previous example we can compute the minimum and maximum sales by applying the min() and max() methods to the column ‘Sales’:

# Min Sales
min_sales = corporatesales['Sales'].min()
print("Minimum sales:\n", min_sales)
Minimum sales:
 100
# Max Sales
max_sales = corporatesales['Sales'].max()
print("Maximum sales:\n", max_sales)
Maximum sales:
 400

Pandas offers convenient methods to aggregate data from a given DataFrame, refer to the following table. In future chapters you will learn how to develop and apply your own customized methods.

Function Name Description
count Number of non-null values
cumsum Cumulative sum of non-null values
cumprod Cumulative product of non-null values
first, last First and last non-null values
mean Mean of non-null values
median Arithmetic median of non-null values
min,max Minimum and maximum of non-null values
prod Product of non-null values
size Compute group sizes, returning result as a Series
sum Sum of non-null values
std, var Sample standard deviation and variance

6.2 Data Aggregation and Grouping Operations

Pandas provides data aggregation and grouping capabilities by way of the groupby() method. Pandas groupby() operations follow the split-apply-combine paradigm whereby a given DataFrame is: (1) Split into groups, (2) a function is applied to each group, (3) results are combined into a new DataFrame.

Example (Addition across groups)

The following figure illustrates a data aggregation and grouping operation that:

  1. Splits the original DataFrame df into groups based on the key ‘CustomerID’.
  2. Applies the method sum() to each of the previous groups.
  3. combines the result in a new DataFrame ‘total_sales’

Data Aggregation and Grouping Example

The code required to implement this operation is:

import pandas as pd

# Sample data
data = {
    'CustomerID': ['A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C'],
    'Purchases': [0,5,10,5,10,15,10,15,20]
}

df = pd.DataFrame(data)

df
CustomerID Purchases
0 A 0
1 B 5
2 C 10
3 A 5
4 B 10
5 C 15
6 A 10
7 B 15
8 C 20
# Group by CustomerID
grouped = df.groupby('CustomerID')

# Total Purchases per Customer
total_sales = grouped['Purchases'].sum()
print("Total Sales per Customer:\n", total_sales)
Total Sales per Customer:
 CustomerID
A    15
B    30
C    45
Name: Purchases, dtype: int64

It is possible to perform the same groupby operation with just one line of code as following:

# one liner groupby
total_sales = df.groupby('CustomerID')['Purchases'].sum()

print("Total Sales per Customer:\n", total_sales)
Total Sales per Customer:
 CustomerID
A    15
B    30
C    45
Name: Purchases, dtype: int64

Example (Counting across groups)

The following code illustrates how to count the number of elements accross groups:

group_counts = df.groupby('CustomerID')['Purchases'].count()

print("Number of elements per group:\n", group_counts)
Number of elements per group:
 CustomerID
A    3
B    3
C    3
Name: Purchases, dtype: int64
Using Generative AI for coding purposes

Indeed it is possible to rely on your favorite assistant to get a starting code for any given groupby-aggregation operation. To do so you just need to submit the following prompt:

“I need python code to compute the top 3 per group in a pandas”

6.3 Data Grouping in Practice

6.3.1 First Example

Let’s have a look at the following Jupyter Notebook. This notebook processes the well-known titanic dataset and performs several groupby operations to find differences in the age and ticket price across males and females.

Challenge yourself by developing some code to determine the odds of surviving of females against males. Try to work it out on your own before checking the solution below:

Code
# We split the titanic dataset into females and males and compute the mean of the Survived column for each group
titanic.groupby('Sex')['Survived'].mean()

6.3.2 Second Example

Let’s have a look at the following Jupyter Notebook. This notebook processes an econometric dataset and computes several economic metrics such as life expectancy, gross domestic product (GDP) across countries. This allows us to determine, for instance, which countries have the largest life expectancy and whether there is any relationship between GDP and life expectancy.

Challenge yourself by developing some code to determine the relationship between GDP and life expectancy (Tip. consider splitting into continents). Try to work it out on your own before checking the solutions below:

Code
# We split the gapminder dataset by continent then we compute the mean
# for the columns lifeExp and gdpPercap
gapminderDataFrame.groupby('continent')[['lifeExp','gdpPercap']].mean()
Using Generative AI for coding purposes

In the previous example try the following prompt on Google Gemini:

“How to determine the relationship between GDP and life expectancy?”

6.4 Conclusion

Data scientists must be proficient in the application of aggregation and grouping operations to process data and derive meaningful insights from it. Pandas groupby() is an extremely versatile operation capable of processing complex time series, econometric panel data and many others. We have just touched the tip of the iceberg of what it is possible to accomplish with groupby(), future chapters will illustrate more advanced uses of grouping operations. it is advised to invest a significant amount of time mastering groupby operations; they are the bread and butter of any data scientist.

In the next chapter you will learn how to filter data based on conditions, this will allow you to select based on gender or salary income for instance.

6.5 Further Readings

For those of you in need of additional, more advanced, topics please refer to the following references: