17  Advanced Data Transformations

At this point you have a good command on how to ingest, combine transform, select, filter and visualize datasets. Moreover you are already familiar with time series specific operations.

In this chapter you will learn how to perform advanced transformations on existing dataframes so that they can be further processed by a machine learning pipeline, a statistical technique or just ingested into a dashboarding service.

17.1 Long v.s. Wide format DataFrames

In data analysis, long format and wide format refer to two ways of organizing and representing datasets. Each has its use cases, depending on the type of analysis or visualization you need.

Long format

Long format DataFrames have a tidy data structure where each row represents one observation of a single variable or event. Typically, a long format has one identifier column, one variable name column, and one value column, this makes the number of columns fewer compared to a wide format. Since each variable is represented as a separate row, the total number of rows can be larger.

Long format works well when you need to track multiple observations (e.g., stock prices, sensor readings) over time. It is also easier to apply groupby(), aggregation, and pivoting operations on a long format dataset than on their wide format counterparts.

Long format is suitable for multidimensional data as it can handle multiple levels of categorical data (e.g., subject, condition, time) through additional identifier columns.

Wide format

Wide format DataFrames have only one row per entity or observation. Each row represents a single entity (e.g., a person, product, or event) with multiple variables as columns. Wide format has multiple columns, where each variable is stored in a separate column.

Wide format is more compact when you have multiple measurements or variables associated with each entity, making it easy to read and display as a table. Wide format facilitates cross-sectional analysis or summary tables as each row contains all relevant data points for an observation. Wide format is usually the preferred format for machine learning training.

The following figure illustrates the differences between long and wide format.

long v.s. wide formats

Long to Wide Transformation

Pandas offers several methods to transform long format DataFrames to wide format. These methods reshape the data, allowing you to summarize or pivot variables into columns for easier analysis. Main methods are DataFrame.pivot_table() and DataFrame.pivot().

The following example illustrates the use of DataFrame.pivot(). It transforms the dataframe df_long into a wide format, refer to the following figure.

long-wide-example
import pandas as pd

data = {'ID': ['A', 'A', 'B', 'B','C','C'], 
        'Subject': ['Maths', 'English', 'Maths', 'English','Maths','English'], 
        'Score': [85, 92, 78, 88, 95, 97]}

df_long = pd.DataFrame(data)

df_long
ID Subject Score
0 A Maths 85
1 A English 92
2 B Maths 78
3 B English 88
4 C Maths 95
5 C English 97
# Use pivot to reshape
df_wide = df_long.pivot(index='ID', columns='Subject', values='Score')
df_wide
Subject English Maths
ID
A 92 85
B 88 78
C 97 95

The DataFrame.pivot_table() works like DataFrame.pivot() but with the ability to aggregate duplicate entries. It is useful when the data contains duplicate rows for the same index-column pair, and you need to aggregate them.

# Use pivot_table with aggregation
df_wide = df_long.pivot_table(index='ID', columns='Subject', values='Score', aggfunc='mean')
df_wide
Subject English Maths
ID
A 92.0 85.0
B 88.0 78.0
C 97.0 95.0

Wide to Long Transformation

The DataFrame.melt() method is used to reshape data from wide format to long format. It “unpivots” a DataFrame, turning multiple columns into rows, with new columns to hold the variable names and their corresponding values.

The following example transforms a wide format dataframe into its long wide counterpart, refer to the following figure.

wide-long-example
# Sample DataFrame in wide format
df_wide = pd.DataFrame({
    'ID': ['A', 'B', 'C'],
    'Math': [85, 78, 95],
    'English': [92, 88, 97]
})

df_wide
ID Math English
0 A 85 92
1 B 78 88
2 C 95 97
# Convert wide format to long format using melt()

# id_vars='ID': The ID column remains unchanged, acting as the identifier for each observation.
# var_name='Subject': A new column named 'Subject' is created to hold the column names ('Math', 'English') from the original DataFrame.
# value_name='Score': A new column named 'Score' is created to store the values from the original columns.


df_long = pd.melt(df_wide, id_vars='ID', var_name='Subject', value_name='Score')
df_long
ID Subject Score
0 A Math 85
1 B Math 78
2 C Math 95
3 A English 92
4 B English 88
5 C English 97

An alternative to reshape DataFrames from wide format to long format is DataFrame.wide_to_long(). It is particularly useful when the column names contain patterns or suffixes. It converts a DataFrame with columns sharing the same prefix (or structure) into a long format by stacking them based on those patterns.

The following example illustrates the use of DataFrame.wide_to_long() to transform a wide format dataframe, this time however we need to interpret the content of some columns in advance.

import pandas as pd

# Sample DataFrame in wide format
df_wide = pd.DataFrame({
    'ID': [1, 2, 3],
    'Math_2020': [85, 78, 90],
    'Math_2021': [88, 82, 92]
})

df_wide
ID Math_2020 Math_2021
0 1 85 88
1 2 78 82
2 3 90 92
# Convert wide to long format using wide_to_long()

# stubnames='Math': This indicates that we want to reshape columns starting with "Math".
# i='ID':The ID column remains as the identifier for each row.
# j='Year':A new Year column is created to hold the suffix values (2020, 2021) extracted from the original column names.
# sep='_':The separator between the stub name ('Math') and the suffix ('2020', '2021') is _.

df_long = pd.wide_to_long(df_wide, stubnames='Math', i='ID', j='Year', sep='_')
df_long
Math
ID Year
1 2020 85
2 2020 78
3 2020 90
1 2021 88
2 2021 82
3 2021 92

17.2 Hierarchical Formats

As data practitioner you need to be prepared to deal with complex formats combining both long and wide formats such as the one shown in the following figure.

hierachical formats

DataFrame flattening, in data science jargon, is the process of transforming a complex, hierarchical structure (like multi-level columns or nested lists within columns) into a simpler, more straightforward structure. Depending on the type of structure you are working with, different methods can be used to “flatten” a DataFrame.

Explode

The DataFrame.explode() method in pandas is used to transform a column in a DataFrame that contains list-like or iterable elements (such as lists, tuples, or sets) into separate rows. This is useful when you have data in a single cell that needs to be expanded across multiple rows for further analysis. The rest of the DataFrame’s columns are repeated for each expanded row.

The following code illustrates the use of DataFrame.explode() to transform the original format into another that can be subsequently converted into a long format by using DataFrame.melt()

import pandas as pd

# Creating a hybrid DataFrame
df_hybrid = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],  # String (object) type
    'Age': [25, 30, 35],                  # Integer type
    'Salary': [70000.50, 80000, 90000.75],# Float type
    'Skills': [['Python', 'SQL'], ['Excel'], ['Python', 'Java']], # List of strings
    'Employed': [True, False, True],      # Boolean type
    'StartDate': pd.to_datetime(['2020-01-01', '2021-06-15', '2019-03-12']) # DateTime type
})

df_hybrid
Name Age Salary Skills Employed StartDate
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01
1 Bob 30 80000.00 [Excel] False 2021-06-15
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12
df_exploded = df_hybrid.explode('Skills')
df_exploded
Name Age Salary Skills Employed StartDate
0 Alice 25 70000.50 Python True 2020-01-01
0 Alice 25 70000.50 SQL True 2020-01-01
1 Bob 30 80000.00 Excel False 2021-06-15
2 Charlie 35 90000.75 Python True 2019-03-12
2 Charlie 35 90000.75 Java True 2019-03-12
df_long = pd.melt(df_exploded, id_vars=['Name'], 
                  value_vars=['Age', 'Salary', 'Skills', 'Employed', 'StartDate'], 
                  var_name='Attribute', value_name='Value')

df_long
Name Attribute Value
0 Alice Age 25
1 Alice Age 25
2 Bob Age 30
3 Charlie Age 35
4 Charlie Age 35
5 Alice Salary 70000.5
6 Alice Salary 70000.5
7 Bob Salary 80000.0
8 Charlie Salary 90000.75
9 Charlie Salary 90000.75
10 Alice Skills Python
11 Alice Skills SQL
12 Bob Skills Excel
13 Charlie Skills Python
14 Charlie Skills Java
15 Alice Employed True
16 Alice Employed True
17 Bob Employed False
18 Charlie Employed True
19 Charlie Employed True
20 Alice StartDate 2020-01-01 00:00:00
21 Alice StartDate 2020-01-01 00:00:00
22 Bob StartDate 2021-06-15 00:00:00
23 Charlie StartDate 2019-03-12 00:00:00
24 Charlie StartDate 2019-03-12 00:00:00

apply(pd.Series)

We use apply(pd.Series) to split elements within a column (e.g. lists, tuples, dictionaries) into multiple columns.

the following code illustrates the use of apply(pd.Series) to flatten the hierarchical dataframe df_complex.

# DataFrame with more complex nested data
df_complex = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Scores': [{'Math': 90, 'Science': 85}, {'Math': 75, 'Science': 80}]
})

df_complex
Name Scores
0 Alice {'Math': 90, 'Science': 85}
1 Bob {'Math': 75, 'Science': 80}
df_complex['Scores'].apply(pd.Series)
Math Science
0 90 85
1 75 80

Indeed, it is possible to attach the flattened structure to the original dataframe as follows:

scores_flattened=df_complex['Scores'].apply(pd.Series)

df_flattened=df_complex.join(scores_flattened)
df_flattened
Name Scores Math Science
0 Alice {'Math': 90, 'Science': 85} 90 85
1 Bob {'Math': 75, 'Science': 80} 75 80
Using Generative AI for coding purposes

Flattening complex dataframe structures can be time consuming and frustrating. For the previous example it is possible to submit the following prompt to your preferred generative AI assistant.

Flatten the following dataframe:

# DataFrame with more complex nested data
df_complex = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Scores': [{'Math': 90, 'Science': 85}, {'Math': 75, 'Science': 80}]
})

17.3 Handling Categorical Variables

Handling categorical variables in pandas is essential when working with datasets that contain categorical data (i.e., data that can take on a limited number of categories or values). Processing categorical information is relevant in statistical modeling as well as in machine learning (e.g. codification of an image into a discrete set of bits).

The good news is that Pandas provides several tools to work efficiently with categorical variables.

One-Hot Encoding

In Pandas, DataFrame.get_dummies() is a method used for one-hot encoding of categorical variables. It converts categorical columns (with discrete values) into multiple binary columns (0s and 1s) for each unique value in the original column.

The following code performs a one-hot encoding of the variable ‘City’. You will notice that for each city category a new column is created. This format is the one required to train machine learning systems or perform statistical analyses over the data.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'City': ['New York', 'Los Angeles', 'New York', 'San Francisco', 'Los Angeles']
})

df
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie New York
3 Alice San Francisco
4 Bob Los Angeles
df_encoded = pd.get_dummies(df, columns=['City'],dtype='int')

df_encoded
Name City_Los Angeles City_New York City_San Francisco
0 Alice 0 1 0
1 Bob 1 0 0
2 Charlie 0 1 0
3 Alice 0 0 1
4 Bob 1 0 0

The method DataFrame.from_dummies() creates a categorical DataFrame from a DataFrame of dummy variables as illustrated in the following example:

df_encoded.set_index('Name',inplace=True)

pd.from_dummies(df_encoded)
Name
Alice City_New York
Bob City_Los Angeles
Charlie City_New York
Alice City_San Francisco
Bob City_Los Angeles

Label Encoding

Label encoding replaces each category with an integer value. This is useful when you want to preserve the ordinal relationship between categories, or when one-hot encoding would result in too many columns.

The following example converts a categorical variable to a numeric code.

# Convert the 'City' column to categorical type
df['City'] = df['City'].astype('category')

# Convert categorical 'City' to numeric codes
df['City_Code'] = df['City'].cat.codes

df
Name City City_Code
0 Alice New York 1
1 Bob Los Angeles 0
2 Charlie New York 1
3 Alice San Francisco 2
4 Bob Los Angeles 0

Manual Encoding

You can manually map categories to numbers using the Series.map() method. This is especially useful when categories have a specific order or meaning.

# Manual mapping
city_mapping = {'New York': 101, 'Los Angeles': 202, 'San Francisco': 303}
df['City_Mapped'] = df['City'].map(city_mapping)

df
Name City City_Code City_Mapped
0 Alice New York 1 101
1 Bob Los Angeles 0 202
2 Charlie New York 1 101
3 Alice San Francisco 2 303
4 Bob Los Angeles 0 202

17.4 Data Binning

Binning, also known as discretisation, is the process of converting continuous numerical variables into discrete categories or intervals. This technique is commonly used in data preprocessing and analysis to group continuous data into “bins” or ranges, making the data easier to interpret, analyze, or use in machine learning models.

Binning is Important in Data Science as it: (1) smooths out noise in data by grouping individual data points into broader categories, (2) reduces the number of unique values, binning makes the data more interpretable and easier to work with, (3) creates new features for machine learning models, especially in algorithms that prefer categorical data (e.g., decision trees).

Pandas provides the pandas.cut() and pandas.qcut() methods for binning.

Pandas cut

The pandas.cut() method is used to bin continuous data into discrete intervals. It is a way of segmenting or sorting data values into bins or categories, which makes it useful for grouping and analyzing continuous data.

The following example bins the column ‘Salary’ into three discrete categories

# Define bins for salary discretization
bins = [0, 75000, 85000, float('inf')]
labels = ['Low', 'Medium', 'High']

# Discretize the 'Salary' column
df_hybrid['Salary_Level'] = pd.cut(df_hybrid['Salary'], bins=bins, labels=labels)

df_hybrid
Name Age Salary Skills Employed StartDate Salary_Level
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01 Low
1 Bob 30 80000.00 [Excel] False 2021-06-15 Medium
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12 High

Pandas qcut

Pandas.qcut() is used to divide data into equal-sized quantiles. Instead of specifying the exact bin edges, qcut() automatically divides the data into equal-sized groups based on the quantiles.

## we drop the column 'Salary_Level'
df_hybrid.drop('Salary_Level',inplace=True,axis=1)

## We bin our data using qcut()
df_hybrid['Salary_Discretized'] = pd.qcut(df_hybrid['Salary'], q=2, labels=['Low', 'High'])
df_hybrid
Name Age Salary Skills Employed StartDate Salary_Discretized
0 Alice 25 70000.50 [Python, SQL] True 2020-01-01 Low
1 Bob 30 80000.00 [Excel] False 2021-06-15 Low
2 Charlie 35 90000.75 [Python, Java] True 2019-03-12 High

17.5 Code Interpretation Challenge

Following please find some examples of python code. Try to understand what the code is trying to accomplish before checking the solution below:

Example: Expanding the Titanic dataset

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)



# Take a random sample of 5 observations from the dataset
titanic_sample = df.sample(n=5, random_state=42)

# Transform the DataFrame into long format
titanic_long = titanic_sample.melt(id_vars=['PassengerId'], var_name='Attribute', value_name='Value')

# Display the transformed DataFrame
display(titanic_long)

The melt() function is used to transform the sampled DataFrame from wide to long format, id_vars=['PassengerId'] keeps PassengerId as the identifier column. var_name='Attribute' stores the column names in an Attribute column, value_name='Value' stores the corresponding values in a Value column.

PassengerId Attribute Value
0 710 Survived 1
1 440 Survived 0
2 841 Survived 0
3 721 Survived 1
4 40 Survived 1
5 710 Pclass 3
6 440 Pclass 2
7 841 Pclass 3
8 721 Pclass 2
9 40 Pclass 3
10 710 Name Moubarek, Master. Halim Gonios ("William George")
11 440 Name Kvillner, Mr. Johan Henrik Johannesson
12 841 Name Alhomaki, Mr. Ilmari Rudolf
13 721 Name Harper, Miss. Annie Jessie "Nina"
14 40 Name Nicola-Yarred, Miss. Jamila
15 710 Sex male
16 440 Sex male
17 841 Sex male
18 721 Sex female
19 40 Sex female
20 710 Age NaN
21 440 Age 31.0
22 841 Age 20.0
23 721 Age 6.0
24 40 Age 14.0
25 710 SibSp 1
26 440 SibSp 0
27 841 SibSp 0
28 721 SibSp 0
29 40 SibSp 1
30 710 Parch 1
31 440 Parch 0
32 841 Parch 0
33 721 Parch 1
34 40 Parch 0
35 710 Ticket 2661
36 440 Ticket C.A. 18723
37 841 Ticket SOTON/O2 3101287
38 721 Ticket 248727
39 40 Ticket 2651
40 710 Fare 15.2458
41 440 Fare 10.5
42 841 Fare 7.925
43 721 Fare 33.0
44 40 Fare 11.2417
45 710 Cabin NaN
46 440 Cabin NaN
47 841 Cabin NaN
48 721 Cabin NaN
49 40 Cabin NaN
50 710 Embarked C
51 440 Embarked S
52 841 Embarked S
53 721 Embarked S
54 40 Embarked C

Example: One Hot encoding of the Titanic dataset

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)

df_reduced=df[['PassengerId','Survived','Name','Pclass']]



# Perform one-hot encoding on the 'Pclass' column
titanic_encoded = pd.get_dummies(df_reduced, columns=['Pclass'], prefix='Class')

# Display the resulting DataFrame
display(titanic_encoded.head())

The code applies one-hot encoding to the Pclass column using pd.get_dummies() to create binary columns for each class, and displays the transformed DataFrame.

PassengerId Survived Name Class_1 Class_2 Class_3
0 1 0 Braund, Mr. Owen Harris False False True
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... True False False
2 3 1 Heikkinen, Miss. Laina False False True
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) True False False
4 5 0 Allen, Mr. William Henry False False True

Example: Binning the Titanic dataset

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)

# Define age bins and labels
age_bins = [0, 10, 20, 40, 60, 100]
age_labels = ['0-10', '10-20', '20-40', '40-60', '60-100']

# Create a new column 'AgeGroup' with binned age values
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

# Display the updated DataFrame with the new 'AgeGroup' column
print(df[['Name','Age', 'AgeGroup']].head(10))

The code applies one-hot encoding to the Pclass column using pd.get_dummies() to create binary columns for each class, and displays the transformed DataFrame.

                                                Name   Age AgeGroup
0                            Braund, Mr. Owen Harris  22.0    20-40
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0    20-40
2                             Heikkinen, Miss. Laina  26.0    20-40
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0    20-40
4                           Allen, Mr. William Henry  35.0    20-40
5                                   Moran, Mr. James   NaN      NaN
6                            McCarthy, Mr. Timothy J  54.0    40-60
7                     Palsson, Master. Gosta Leonard   2.0     0-10
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  27.0    20-40
9                Nasser, Mrs. Nicholas (Adele Achem)  14.0    10-20

17.6 Advanced Data Transformation in Practice

First Example (Digital Marketing)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real information from customer’s preferences regarding digital streaming content. This kind of transformations are required in machine learning contexts, for instance to train a recommender system.

Second Example (Healthcare)

The following notebook Jupyter Notebook illustrates some data transformations performed over a healthcare dataset for the purposes of operational benchmarking and visualization.

Third Example (Retail)

The following notebook Jupyter Notebook illustrates some data transformations performed over a retail dataset for the purposes of operational benchmarking and visualization.

17.7 Conclusion

This chapter delves into advanced data transformations, essential for preparing datasets for machine learning pipelines, statistical analyses, or dashboard integration. It contrasts long and wide data formats, explaining that long formats, with fewer columns but more rows, are ideal for multidimensional data and operations like aggregation and pivoting. Wide formats, on the other hand, are compact and better suited for machine learning models and summary analyses. Methods like DataFrame.pivot() and DataFrame.pivot_table() convert long data to wide formats, while methods like DataFrame.melt() and DataFrame.wide_to_long() reverse this transformation. Hierarchical and complex structures, such as nested or multi-level columns, can be flattened using DataFrame.explode(), apply(pd.Series), or related methods for streamlined processing.

The chapter also covers handling categorical variables through encoding techniques. One-hot encoding creates binary columns for each category, while label encoding assigns numeric codes to categories. Manual mapping is also explored for custom encodings. It introduces binning, which discretises continuous variables into categorical intervals using DataFrame.cut() for defined bins or DataFrame.qcut() for quantile-based bins, aiding in noise reduction and feature creation. Practical examples from digital marketing, healthcare, and retail highlight these techniques’ applications, illustrating their role in transforming raw data into formats ready for advanced analysis or modeling.

17.8 Further Readings

For those of you in need of additional, more advanced, topics please refer to the following references: