17  Advanced Data Transformations

At this point you have a good command on how to ingest, combine transform, select, filter and visualize datasets. Moreover you are already familiar with time series specific operations.

In this chapter you will learn how to perform advanced transformations on existing dataframes so that they can be further processed by a machine learning pipeline, a statistical technique or just ingested into a dashboarding service.

17.1 Long v.s. Wide format DataFrames

In data analysis, long format and wide format refer to two ways of organizing and representing datasets. Each has its use cases, depending on the type of analysis or visualization you need.

Long format

Long format DataFrames have a tidy data structure where each row represents one observation of a single variable or event. Typically, a long format has one identifier column, one variable name column, and one value column, this makes the number of columns fewer compared to a wide format. Since each variable is represented as a separate row, the total number of rows can be larger.

Long format works well when you need to track multiple observations (e.g., stock prices, sensor readings) over time. It is also easier to apply groupby(), aggregation, and pivoting operations on a long format dataset than on their wide format counterparts.

Long format is suitable for multidimensional data as it can handle multiple levels of categorical data (e.g., subject, condition, time) through additional identifier columns.

Wide format

Wide format DataFrames have only one row per entity or observation. Each row represents a single entity (e.g., a person, product, or event) with multiple variables as columns. Wide format has multiple columns, where each variable is stored in a separate column.

Wide format is more compact when you have multiple measurements or variables associated with each entity, making it easy to read and display as a table. Wide format facilitates cross-sectional analysis or summary tables as each row contains all relevant data points for an observation. Wide format is usually the preferred format for machine learning training.

The following figure illustrates the differences between long and wide format.

long v.s. wide formats

Long to Wide Transformation

Pandas offers several methods to transform long format DataFrames to wide format. These methods reshape the data, allowing you to summarize or pivot variables into columns for easier analysis. Main methods are DataFrame.pivot_table() and DataFrame.pivot().

The following example illustrates the use of DataFrame.pivot(). It transforms the dataframe df_long into a wide format, refer to the following figure.

long-wide-example
import pandas as pd

data = {'ID': ['A', 'A', 'B', 'B','C','C'], 
        'Subject': ['Maths', 'English', 'Maths', 'English','Maths','English'], 
        'Score': [85, 92, 78, 88, 95, 97]}

df_long = pd.DataFrame(data)

df_long
ID Subject Score
0 A Maths 85
1 A English 92
2 B Maths 78
3 B English 88
4 C Maths 95
5 C English 97
# Use pivot to reshape
df_wide = df_long.pivot(index='ID', columns='Subject', values='Score')
df_wide
Subject English Maths
ID
A 92 85
B 88 78
C 97 95

The DataFrame.pivot_table() works like DataFrame.pivot() but with the ability to aggregate duplicate entries. It is useful when the data contains duplicate rows for the same index-column pair, and you need to aggregate them.

# Use pivot_table with aggregation
df_wide = df_long.pivot_table(index='ID', columns='Subject', values='Score', aggfunc='mean')
df_wide
Subject English Maths
ID
A 92.0 85.0
B 88.0 78.0
C 97.0 95.0

Wide to Long Transformation

The DataFrame.melt() method is used to reshape data from wide format to long format. It “unpivots” a DataFrame, turning multiple columns into rows, with new columns to hold the variable names and their corresponding values.

The following example transforms a wide format dataframe into its long wide counterpart, refer to the following figure.

wide-long-example
# Sample DataFrame in wide format
df_wide = pd.DataFrame({
    'ID': ['A', 'B', 'C'],
    'Math': [85, 78, 95],
    'English': [92, 88, 97]
})

df_wide
ID Math English
0 A 85 92
1 B 78 88
2 C 95 97
# Convert wide format to long format using melt()

# id_vars='ID': The ID column remains unchanged, acting as the identifier for each observation.
# var_name='Subject': A new column named 'Subject' is created to hold the column names ('Math', 'English') from the original DataFrame.
# value_name='Score': A new column named 'Score' is created to store the values from the original columns.


df_long = pd.melt(df_wide, id_vars='ID', var_name='Subject', value_name='Score')
df_long
ID Subject Score
0 A Math 85
1 B Math 78
2 C Math 95
3 A English 92
4 B English 88
5 C English 97

An alternative to reshape DataFrames from wide format to long format is DataFrame.wide_to_long(). It is particularly useful when the column names contain patterns or suffixes. It converts a DataFrame with columns sharing the same prefix (or structure) into a long format by stacking them based on those patterns.

The following example illustrates the use of DataFrame.wide_to_long() to transform a wide format dataframe, this time however we need to interpret the content of some columns in advance.

import pandas as pd

# Sample DataFrame in wide format
df_wide = pd.DataFrame({
    'ID': [1, 2, 3],
    'Math_2020': [85, 78, 90],
    'Math_2021': [88, 82, 92]
})

df_wide
ID Math_2020 Math_2021
0 1 85 88
1 2 78 82
2 3 90 92
# Convert wide to long format using wide_to_long()

# stubnames='Math': This indicates that we want to reshape columns starting with "Math".
# i='ID':The ID column remains as the identifier for each row.
# j='Year':A new Year column is created to hold the suffix values (2020, 2021) extracted from the original column names.
# sep='_':The separator between the stub name ('Math') and the suffix ('2020', '2021') is _.

df_long = pd.wide_to_long(df_wide, stubnames='Math', i='ID', j='Year', sep='_')
df_long
Math
ID Year
1 2020 85
2 2020 78
3 2020 90
1 2021 88
2 2021 82
3 2021 92

17.2 Hierarchical Formats

As data practitioner you need to be prepared to deal with complex formats combining both long and wide formats such as the one shown in the following figure.

hierachical formats

DataFrame flattening, in data science jargon, is the process of transforming a complex, hierarchical structure (like multi-level columns or nested lists within columns) into a simpler, more straightforward structure. Depending on the type of structure you are working with, different methods can be used to “flatten” a DataFrame.

Explode

The DataFrame.explode() method in pandas is used to transform a column in a DataFrame that contains list-like or iterable elements (such as lists, tuples, or sets) into separate rows. This is useful when you have data in a single cell that needs to be expanded across multiple rows for further analysis. The rest of the DataFrame’s columns are repeated for each expanded row.

The following code illustrates the use of DataFrame.explode() to transform the original format into another that can be subsequently converted into a long format by using DataFrame.melt()

import pandas as pd

# Creating a hybrid DataFrame
df_hybrid = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],  # String (object) type
    'Age': [25, 30, 35],                  # Integer type
    'Salary': [70000.50, 80000, 90000.75],# Float type
    'Skills': [['Python', 'SQL'], ['Excel'], ['Python', 'Java']], # List of strings
    'Employed': [True, False, True],      # Boolean type
    'StartDate': pd.to_datetime(['2020-01-01', '2021-06-15', '2019-03-12']) # DateTime type
})

df_hybrid
Name Age Salary Skills Employed StartDate
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01
1 Bob 30 80000.00 [Excel] False 2021-06-15
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12
df_exploded = df_hybrid.explode('Skills')
df_exploded
Name Age Salary Skills Employed StartDate
0 Alice 25 70000.50 Python True 2020-01-01
0 Alice 25 70000.50 SQL True 2020-01-01
1 Bob 30 80000.00 Excel False 2021-06-15
2 Charlie 35 90000.75 Python True 2019-03-12
2 Charlie 35 90000.75 Java True 2019-03-12
df_long = pd.melt(df_exploded, id_vars=['Name'], 
                  value_vars=['Age', 'Salary', 'Skills', 'Employed', 'StartDate'], 
                  var_name='Attribute', value_name='Value')

df_long
Name Attribute Value
0 Alice Age 25
1 Alice Age 25
2 Bob Age 30
3 Charlie Age 35
4 Charlie Age 35
5 Alice Salary 70000.5
6 Alice Salary 70000.5
7 Bob Salary 80000.0
8 Charlie Salary 90000.75
9 Charlie Salary 90000.75
10 Alice Skills Python
11 Alice Skills SQL
12 Bob Skills Excel
13 Charlie Skills Python
14 Charlie Skills Java
15 Alice Employed True
16 Alice Employed True
17 Bob Employed False
18 Charlie Employed True
19 Charlie Employed True
20 Alice StartDate 2020-01-01 00:00:00
21 Alice StartDate 2020-01-01 00:00:00
22 Bob StartDate 2021-06-15 00:00:00
23 Charlie StartDate 2019-03-12 00:00:00
24 Charlie StartDate 2019-03-12 00:00:00

apply(pd.Series)

We use apply(pd.Series) to split elements within a column (e.g. lists, tuples, dictionaries) into multiple columns.

the following code illustrates the use of apply(pd.Series) to flatten the hierarchical dataframe df_complex.

# DataFrame with more complex nested data
df_complex = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Scores': [{'Math': 90, 'Science': 85}, {'Math': 75, 'Science': 80}]
})

df_complex
Name Scores
0 Alice {'Math': 90, 'Science': 85}
1 Bob {'Math': 75, 'Science': 80}
df_complex['Scores'].apply(pd.Series)
Math Science
0 90 85
1 75 80

Indeed, it is possible to attach the flattened structure to the original dataframe as follows:

scores_flattened=df_complex['Scores'].apply(pd.Series)

df_flattened=df_complex.join(scores_flattened)
df_flattened
Name Scores Math Science
0 Alice {'Math': 90, 'Science': 85} 90 85
1 Bob {'Math': 75, 'Science': 80} 75 80
Using Generative AI for coding purposes

Flattening complex dataframe structures can be time consuming and frustrating. For the previous example it is possible to submit the following prompt to your preferred generative AI assistant.

Flatten the following dataframe:

# DataFrame with more complex nested data
df_complex = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Scores': [{'Math': 90, 'Science': 85}, {'Math': 75, 'Science': 80}]
})

17.3 Handling Categorical Variables

Handling categorical variables in pandas is essential when working with datasets that contain categorical data (i.e., data that can take on a limited number of categories or values). Processing categorical information is relevant in statistical modeling as well as in machine learning (e.g. codification of an image into a discrete set of bits).

The good news is that Pandas provides several tools to work efficiently with categorical variables.

One-Hot Encoding

In Pandas, DataFrame.get_dummies() is a method used for one-hot encoding of categorical variables. It converts categorical columns (with discrete values) into multiple binary columns (0s and 1s) for each unique value in the original column.

The following code performs a one-hot encoding of the variable ‘City’. You will notice that for each city category a new column is created. This format is the one required to train machine learning systems or perform statistical analyses over the data.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'City': ['New York', 'Los Angeles', 'New York', 'San Francisco', 'Los Angeles']
})

df
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie New York
3 Alice San Francisco
4 Bob Los Angeles
df_encoded = pd.get_dummies(df, columns=['City'],dtype='int')

df_encoded
Name City_Los Angeles City_New York City_San Francisco
0 Alice 0 1 0
1 Bob 1 0 0
2 Charlie 0 1 0
3 Alice 0 0 1
4 Bob 1 0 0

The method DataFrame.from_dummies() creates a categorical DataFrame from a DataFrame of dummy variables as illustrated in the following example:

df_encoded.set_index('Name',inplace=True)

pd.from_dummies(df_encoded)
Name
Alice City_New York
Bob City_Los Angeles
Charlie City_New York
Alice City_San Francisco
Bob City_Los Angeles

Label Encoding

Label encoding replaces each category with an integer value. This is useful when you want to preserve the ordinal relationship between categories, or when one-hot encoding would result in too many columns.

The following example converts a categorical variable to a numeric code.

# Convert the 'City' column to categorical type
df['City'] = df['City'].astype('category')

# Convert categorical 'City' to numeric codes
df['City_Code'] = df['City'].cat.codes

df
Name City City_Code
0 Alice New York 1
1 Bob Los Angeles 0
2 Charlie New York 1
3 Alice San Francisco 2
4 Bob Los Angeles 0

Manual Encoding

You can manually map categories to numbers using the Series.map() method. This is especially useful when categories have a specific order or meaning.

# Manual mapping
city_mapping = {'New York': 101, 'Los Angeles': 202, 'San Francisco': 303}
df['City_Mapped'] = df['City'].map(city_mapping)

df
Name City City_Code City_Mapped
0 Alice New York 1 101
1 Bob Los Angeles 0 202
2 Charlie New York 1 101
3 Alice San Francisco 2 303
4 Bob Los Angeles 0 202

17.4 Data Binning

Binning, also known as discretisation, is the process of converting continuous numerical variables into discrete categories or intervals. This technique is commonly used in data preprocessing and analysis to group continuous data into “bins” or ranges, making the data easier to interpret, analyze, or use in machine learning models.

Binning is Important in Data Science as it: (1) smooths out noise in data by grouping individual data points into broader categories, (2) reduces the number of unique values, binning makes the data more interpretable and easier to work with, (3) creates new features for machine learning models, especially in algorithms that prefer categorical data (e.g., decision trees).

Pandas provides the pandas.cut() and pandas.qcut() methods for binning.

Pandas cut

The pandas.cut() method is used to bin continuous data into discrete intervals. It is a way of segmenting or sorting data values into bins or categories, which makes it useful for grouping and analyzing continuous data.

The following example bins the column ‘Salary’ into three discrete categories

# Define bins for salary discretization
bins = [0, 75000, 85000, float('inf')]
labels = ['Low', 'Medium', 'High']

# Discretize the 'Salary' column
df_hybrid['Salary_Level'] = pd.cut(df_hybrid['Salary'], bins=bins, labels=labels)

df_hybrid
Name Age Salary Skills Employed StartDate Salary_Level
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01 Low
1 Bob 30 80000.00 [Excel] False 2021-06-15 Medium
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12 High

Pandas qcut

Pandas.qcut() is used to divide data into equal-sized quantiles. Instead of specifying the exact bin edges, qcut() automatically divides the data into equal-sized groups based on the quantiles.

## we drop the column 'Salary_Level'
df_hybrid.drop('Salary_Level',inplace=True,axis=1)

## We bin our data using qcut()
df_hybrid['Salary_Discretized'] = pd.qcut(df_hybrid['Salary'], q=2, labels=['Low', 'High'])
df_hybrid
Name Age Salary Skills Employed StartDate Salary_Discretized
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01 Low
1 Bob 30 80000.00 [Excel] False 2021-06-15 Low
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12 High

17.5 Advanced Data Transformation in Practice

First Example (Digital Marketing)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real information from customer’s preferences regarding digital streaming content. This kind of transformations are required in machine learning contexts, for instance to train a recommender system.

Second Example (Healthcare)

The following notebook Jupyter Notebook illustrates some data transformations performed over a healthcare dataset for the purposes of operational benchmarking and visualization.

Third Example (Retail)

The following notebook Jupyter Notebook illustrates some data transformations performed over a retail dataset for the purposes of operational benchmarking and visualization.

17.6 Conclusion

This chapter delves into advanced data transformations, essential for preparing datasets for machine learning pipelines, statistical analyses, or dashboard integration. It contrasts long and wide data formats, explaining that long formats, with fewer columns but more rows, are ideal for multidimensional data and operations like aggregation and pivoting. Wide formats, on the other hand, are compact and better suited for machine learning models and summary analyses. Methods like DataFrame.pivot() and DataFrame.pivot_table() convert long data to wide formats, while methods like DataFrame.melt() and DataFrame.wide_to_long() reverse this transformation. Hierarchical and complex structures, such as nested or multi-level columns, can be flattened using DataFrame.explode(), apply(pd.Series), or related methods for streamlined processing.

The chapter also covers handling categorical variables through encoding techniques. One-hot encoding creates binary columns for each category, while label encoding assigns numeric codes to categories. Manual mapping is also explored for custom encodings. It introduces binning, which discretises continuous variables into categorical intervals using DataFrame.cut() for defined bins or DataFrame.qcut() for quantile-based bins, aiding in noise reduction and feature creation. Practical examples from digital marketing, healthcare, and retail highlight these techniques’ applications, illustrating their role in transforming raw data into formats ready for advanced analysis or modeling.

17.7 Further Readings

For those of you in need of additional, more advanced, topics please refer to the following references: