10 Data Joining Operations

So far we have taken for granted that the data we required to carry out our analysis was available to us in a single, convenient format (e.g. csv,xls). This is usually not the case, data sources are scattered around the organization or the internet and require prior integration before any analysis or AI training can take place.

Some not so good news is that you will be spending a significant amount of your time as a data scientist, or ML engineer for that matter, chasing data and have it patched together before any meaningful analysis or AI training can take place. The good news is that Pandas offers several methods that allow you to manipulate and combine external datasets into ready for productions Pandas DataFrames.

In previous chapters we have practiced the processing and manipulation of individual DataFrames, in this chapter we will broaden our expertise by learning how to combine different datasets into a single DataFrame which is ready for further processing or analysis.

10.1 Concat Operations

In the case of DataFrames which are similar in the sense of having the same column structure, you can use the Pandas.concat() method to combine several DataFrames into one. The Pandas.concat() operation in pandas is used to stitch together or stack multiple dataframes along a particular axis (either rows or columns), refer to the following figure:

Example

Say we have two DataFrames, df1 and df2 and we need to concatenate both, one after the other, into a single DataFrame. We use the method Pandas.concat() as follows:

import pandas as pd

# Sample data
data1 = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 80000, 120000],
    'Bonus' : [15000, 10000, 80000]

}

df1 = pd.DataFrame(data1)
df1

	Name	Age	Salary	Bonus
0	Alice	25	70000	15000
1	Bob	30	80000	10000
2	Charlie	35	120000	80000

# Sample data
data2 = {
    'Name': ['Vicent', 'Mahmoud', 'FangFang'],
    'Age': [40, 39, 35],
    'Salary': [30000, 84000, 140000],
    'Bonus' : [1500, 8000, 5500]

}

df2 = pd.DataFrame(data2)
df2

	Name	Age	Salary	Bonus
0	Vicent	40	30000	1500
1	Mahmoud	39	84000	8000
2	FangFang	35	140000	5500

singledataframe=pd.concat([df1,df2])
singledataframe

	Name	Age	Salary	Bonus
0	Alice	25	70000	15000
1	Bob	30	80000	10000
2	Charlie	35	120000	80000
0	Vicent	40	30000	1500
1	Mahmoud	39	84000	8000
2	FangFang	35	140000	5500

Example

In the previous example you might have noticed that in singledataframe the index of the DataFrame is not unique (e.g. Bob and Mahmoud have the same index number ‘1’). If you need to generate a new index that is unique for each row you can specify the argument ignore_index=True as follows:

singledataframe=pd.concat([df1,df2],ignore_index=True)
singledataframe

	Name	Age	Salary	Bonus
0	Alice	25	70000	15000
1	Bob	30	80000	10000
2	Charlie	35	120000	80000
3	Vicent	40	30000	1500
4	Mahmoud	39	84000	8000
5	FangFang	35	140000	5500

Example

Now for a more realistic example, the following example processes two external datasets and have then concatenated for subsequent analysis.

import pandas as pd

air_quality_no2 = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/Python4DS102/main/data/air_quality_no2_long.csv",
                              parse_dates=True)



air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]


air_quality_no2.head()

	date.utc	location	parameter	value
0	2019-06-21 00:00:00+00:00	FR04014	no2	20.0
1	2019-06-20 23:00:00+00:00	FR04014	no2	21.8
2	2019-06-20 22:00:00+00:00	FR04014	no2	26.5
3	2019-06-20 21:00:00+00:00	FR04014	no2	24.9
4	2019-06-20 20:00:00+00:00	FR04014	no2	21.4

air_quality_pm25 = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/Python4DS102/main/data/air_quality_pm25_long.csv",
                               parse_dates=True)


air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                     "parameter", "value"]]

air_quality_pm25.head()

	date.utc	location	parameter	value
0	2019-06-18 06:00:00+00:00	BETR801	pm25	18.0
1	2019-06-17 08:00:00+00:00	BETR801	pm25	6.5
2	2019-06-17 07:00:00+00:00	BETR801	pm25	18.5
3	2019-06-17 06:00:00+00:00	BETR801	pm25	16.0
4	2019-06-17 05:00:00+00:00	BETR801	pm25	7.5

By way of Pandas.concatenate() We combine the measurements of NO2 and PM25, two tables with a similar structure, in a single table.

air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
air_quality

	date.utc	location	parameter	value
0	2019-06-18 06:00:00+00:00	BETR801	pm25	18.0
1	2019-06-17 08:00:00+00:00	BETR801	pm25	6.5
2	2019-06-17 07:00:00+00:00	BETR801	pm25	18.5
3	2019-06-17 06:00:00+00:00	BETR801	pm25	16.0
4	2019-06-17 05:00:00+00:00	BETR801	pm25	7.5
...	...	...	...	...
2063	2019-05-07 06:00:00+00:00	London Westminster	no2	26.0
2064	2019-05-07 04:00:00+00:00	London Westminster	no2	16.0
2065	2019-05-07 03:00:00+00:00	London Westminster	no2	19.0
2066	2019-05-07 02:00:00+00:00	London Westminster	no2	19.0
2067	2019-05-07 01:00:00+00:00	London Westminster	no2	23.0

3178 rows × 4 columns

Feel free to click on the following Jupyter Notebook to run the code on Googe Colab.

10.2 Merge Operations

Pandas concat operations are useful when we need to integrate DataFrames with similar structures (e.g. similar columns). Combining DaFrames which are not similar, in the sense of not having the same column structure, can be performed using the Pandas.merge() method. This method allows you to combine Pandas DataFrames based on a common key (e.g. customer ID), refer to the following figure.

Example

Say we have two DataFrames, df1 and df2 and we need to combine both into a single DataFrame result based on a common key Name. We use the method Pandas.concat() as follows:

df1 = pd.DataFrame({'Name': ['Anne', 'Bob', 'Charlie'], 'Age': [31, 42, 55]})
df2 = pd.DataFrame({'Name': ['Bob', 'Charlie', 'Doron'], 'Salary': [45000, 55500, 66700]})
df1

	Name	Age
0	Anne	31
1	Bob	42
2	Charlie	55

df2

	Name	Salary
0	Bob	45000
1	Charlie	55500
2	Doron	66700

result = pd.merge(df1, df2, on='Name')
result

	Name	Age	Salary
0	Bob	42	45000
1	Charlie	55	55500

10.2.1 Inner Merge operations

In the previous example you might have noticed that only those observations having the same key Name were merged into the result DataFrame. This type of merging is called ‘inner’ and it is useful when you nee to keep only the rows where keys from both DataFrames match, the syntax is:

Merge Type	Description	Example Syntax
Inner Join	Keeps only the rows where keys from both dataframes match	`pd.merge(df1, df2, on='key', how='inner')`

Example

Say we have two DataFrames left and right each containing information from our company employees. We want to combine both DataFrames into a single one name result. We use the EmployeeID as the key which to merge on.

 left = pd.DataFrame(
   {
    "EmployeeID": ["K0", "K1", "K2", "K3","K4"],
     "Name": ["Anne", "Bob", "Lola", "Martin","Zoe"],
       "Unit": ["Marketing", "Sales", "Engineering", "Finance","Human Resources"],
    }
 )
left

	EmployeeID	Name	Unit
0	K0	Anne	Marketing
1	K1	Bob	Sales
2	K2	Lola	Engineering
3	K3	Martin	Finance
4	K4	Zoe	Human Resources

 right = pd.DataFrame(
    {
        "EmployeeID": ["K0", "K1", "K2", "K3"],
        "Salary": [52000, 75200, 150000, 25000],
        "Bonus": ["5500", "32600", "25000", "15000"],
    }
 )
right

	EmployeeID	Salary	Bonus
0	K0	52000	5500
1	K1	75200	32600
2	K2	150000	25000
3	K3	25000	15000

 result = pd.merge(left, right, on="EmployeeID",how='inner')

 result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000	5500
1	K1	Bob	Sales	75200	32600
2	K2	Lola	Engineering	150000	25000
3	K3	Martin	Finance	25000	15000

Where is Zoe?

Notice that in the previous example the employee named ‘Zoe’ is missing from the resulting DataFrame. This is the case as her key ‘K4’ is missing in the second DataFrame and inner operations keep only the rows where keys from both dataframes match

10.2.2 Outer Merge operations

Sometimes you need to merge two DataFrames keeping all rows from both and filling in NaN (Not a Number) where there’s no match. This type of merging is called ‘outer’ and the syntax is:

Merge Type	Description	Example Syntax
Outer Join	Keeps all rows from both dataframes, filling in `NaN` where there’s no match	`pd.merge(df1, df2, on='key', how='outer')`

Example

Continuing with the previous example in this case we want to keep all employees in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='outer')

 result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000.0	5500
1	K1	Bob	Sales	75200.0	32600
2	K2	Lola	Engineering	150000.0	25000
3	K3	Martin	Finance	25000.0	15000
4	K4	Zoe	Human Resources	NaN	NaN

Zoe is back

Notice that this time the employee named ‘Zoe’ shows up in the resulting DataFrame yet there is no information on her salary and bonus. This is the case as in the second DataFrame right there isn’t any entry for Zoe’s key K4.

10.2.3 Left Merge operations

Sometimes you need to merge two DataFrames keeping all rows from the left DataFrame both and filling in NaN where there’s no match. This type of merging is called ‘left’ and the syntax is:

Merge Type	Description	Example Syntax
Left Join	Keeps all rows from the left dataframe and matches from the right dataframe	`pd.merge(df1, df2, on='key', how='left')`

Example

Continuing with the previous example in this case we want to keep all rows from the left DataFrame.

in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='left')

 result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000.0	5500
1	K1	Bob	Sales	75200.0	32600
2	K2	Lola	Engineering	150000.0	25000
3	K3	Martin	Finance	25000.0	15000
4	K4	Zoe	Human Resources	NaN	NaN

10.2.4 Right Merge operations

Sometimes you need to merge two DataFrames keeping all rows from the right DataFrame both and filling in NaN where there’s no match. This type of merging is called ‘right’ and the syntax is:

Merge Type	Description	Example Syntax
Right Join	Keeps all rows from the right dataframe and matches from the left dataframe	`pd.merge(df1, df2, on='key', how='right')`

Example

Continuing with the previous example in this case we want to keep all rows from the right DataFrame.

in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='right')

 result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000	5500
1	K1	Bob	Sales	75200	32600
2	K2	Lola	Engineering	150000	25000
3	K3	Martin	Finance	25000	15000

Where is Zoe?

Notice that in the previous example the employee named ‘Zoe’ is missing from the resulting DataFrame. This is the case as her key ‘K4’ is missing in the second DataFrame and right operations keep only rows from the right DataFrame

10.3 Join Operations

An alternative approach to combine Pandas DataFrames based on a common key is ‘DataFrame.join()’. It is quite similar to Pandas.merge().

Example

Given our previous dataframes, left and right and we need to combine both into a single DataFrame result based on a common key Name. We use the method DataFrame.join() as follows:

result = left.merge( right, on='EmployeeID',how='inner')
result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000	5500
1	K1	Bob	Sales	75200	32600
2	K2	Lola	Engineering	150000	25000
3	K3	Martin	Finance	25000	15000

The following code illustrates a join operation over the same pair of DataFrames

result = left.merge( right, on='EmployeeID',how='outer')
result

	EmployeeID	Name	Unit	Salary	Bonus
0	K0	Anne	Marketing	52000.0	5500
1	K1	Bob	Sales	75200.0	32600
2	K2	Lola	Engineering	150000.0	25000
3	K3	Martin	Finance	25000.0	15000
4	K4	Zoe	Human Resources	NaN	NaN

10.4 Data Joining in Practice

TODO include tip boxes suggesting the use of gemini to address software challenges

First Example

Let’s have a look at the following Jupyter Notebook. This notebook accesses, via API, external datasets relative to the quality of air in cities across the world, then concatenates different datasets and conducts a basic benchmark of air quality of different cities.

Second Example

Let’s have a look at the following Jupyter Notebook. This notebook continues with the analysis of air quality this time performing more advanced transformation on the datasets.

Using Generative AI for coding purposes

Don’t forget to ask Gemini in case you need some help on DataFrame transformations or visualizations

Third Example (Retail Lab)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates the processing of a series of retail related datasets.

10.5 Conclusion

The methods Pandas.concat(), Pandas.merge() and DataFrame.join() are important in any data science or ML related project as they allow you to combine real world, heterogeneous, datasets. Newcomers to data transformation tend to be confused as to which method to use to combine DataFrames:

Pandas.concat() applies when you just want to glue together DataFrames vertically (stacking rows) or horizontally (stacking columns). Think of it as just putting pieces next to each other.
Pandas.merge(), a bit more advanced, merge() is like SQL joins. It allows you to combine two DataFrames based on a common column(s). This is your go-to for combining data where there’s a common link but potentially with complex relationships.
DataFrame.join() is similar to merge(), but it is more for combining DataFrames based on their indices instead of columns. It’s super useful when your data is already aligned by index.

The following table provides a summary of the different types of merge operations

Merge Type	Description	Example Syntax
Inner Join	Keeps only the rows where keys from both dataframes match	`pd.merge(df1, df2, on='key', how='inner')`
Outer Join	Keeps all rows from both dataframes, filling in `NaN` where there’s no match	`pd.merge(df1, df2, on='key', how='outer')`
Left Join	Keeps all rows from the left dataframe and matches from the right dataframe	`pd.merge(df1, df2, on='key', how='left')`
Right Join	Keeps all rows from the right dataframe and matches from the left dataframe	`pd.merge(df1, df2, on='key', how='right')`
Multi-key	Merging on multiple keys	`pd.merge(df1, df2, on=['key1', 'key2'], how='inner')`
Different Columns	Merging on different columns from left and right dataframes	`pd.merge(df1, df2, left_on='key1', right_on='key2')`

10.6 Further reading

For those interested in additional examples and documentation, please refer to the following resources:

Quick Tutorial: Tutorial
The official reference: Pandas API
Additional examples: Kaggle
Advanced material: Python for Data Science: Chapter 8