10  Data Joining Operations

So far we have taken for granted that the data we required to carry out our analysis was available to us in a single, convenient format (e.g. csv,xls). This is usually not the case, data sources are scattered around the organization or the internet and require prior integration before any analysis or AI training can take place.

Some not so good news is that you will be spending a significant amount of your time as a data scientist, or ML engineer for that matter, chasing data and have it patched together before any meaningful analysis or AI training can take place. The good news is that Pandas offers several methods that allow you to manipulate and combine external datasets into ready for productions Pandas DataFrames.

In previous chapters we have practiced the processing and manipulation of individual DataFrames, in this chapter we will broaden our expertise by learning how to combine different datasets into a single DataFrame which is ready for further processing or analysis.

10.1 Concat Operations

In the case of DataFrames which are similar in the sense of having the same column structure, you can use the Pandas.concat() method to combine several DataFrames into one. The Pandas.concat() operation in pandas is used to stitch together or stack multiple dataframes along a particular axis (either rows or columns), refer to the following figure:

Pandas concat operations

Example

Say we have two DataFrames, df1 and df2 and we need to concatenate both, one after the other, into a single DataFrame. We use the method Pandas.concat() as follows:

import pandas as pd

# Sample data
data1 = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 80000, 120000],
    'Bonus' : [15000, 10000, 80000]

}

df1 = pd.DataFrame(data1)
df1
Name Age Salary Bonus
0 Alice 25 70000 15000
1 Bob 30 80000 10000
2 Charlie 35 120000 80000
# Sample data
data2 = {
    'Name': ['Vicent', 'Mahmoud', 'FangFang'],
    'Age': [40, 39, 35],
    'Salary': [30000, 84000, 140000],
    'Bonus' : [1500, 8000, 5500]

}

df2 = pd.DataFrame(data2)
df2
Name Age Salary Bonus
0 Vicent 40 30000 1500
1 Mahmoud 39 84000 8000
2 FangFang 35 140000 5500
singledataframe=pd.concat([df1,df2])
singledataframe
Name Age Salary Bonus
0 Alice 25 70000 15000
1 Bob 30 80000 10000
2 Charlie 35 120000 80000
0 Vicent 40 30000 1500
1 Mahmoud 39 84000 8000
2 FangFang 35 140000 5500

Example

In the previous example you might have noticed that in singledataframe the index of the DataFrame is not unique (e.g. Bob and Mahmoud have the same index number ‘1’). If you need to generate a new index that is unique for each row you can specify the argument ignore_index=True as follows:

singledataframe=pd.concat([df1,df2],ignore_index=True)
singledataframe
Name Age Salary Bonus
0 Alice 25 70000 15000
1 Bob 30 80000 10000
2 Charlie 35 120000 80000
3 Vicent 40 30000 1500
4 Mahmoud 39 84000 8000
5 FangFang 35 140000 5500

Example

Now for a more realistic example, the following example processes two external datasets and have then concatenated for subsequent analysis.

import pandas as pd

air_quality_no2 = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/Python4DS102/main/data/air_quality_no2_long.csv",
                              parse_dates=True)



air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]


air_quality_no2.head()
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4
air_quality_pm25 = pd.read_csv("https://raw.githubusercontent.com/thousandoaks/Python4DS102/main/data/air_quality_pm25_long.csv",
                               parse_dates=True)


air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                     "parameter", "value"]]

air_quality_pm25.head()
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5

By way of Pandas.concatenate() We combine the measurements of NO2 and PM25, two tables with a similar structure, in a single table.

air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
air_quality
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
... ... ... ... ...
2063 2019-05-07 06:00:00+00:00 London Westminster no2 26.0
2064 2019-05-07 04:00:00+00:00 London Westminster no2 16.0
2065 2019-05-07 03:00:00+00:00 London Westminster no2 19.0
2066 2019-05-07 02:00:00+00:00 London Westminster no2 19.0
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0

3178 rows × 4 columns

Feel free to click on the following Jupyter Notebook to run the code on Googe Colab.

10.2 Merge Operations

Pandas concat operations are useful when we need to integrate DataFrames with similar structures (e.g. similar columns). Combining DaFrames which are not similar, in the sense of not having the same column structure, can be performed using the Pandas.merge() method. This method allows you to combine Pandas DataFrames based on a common key (e.g. customer ID), refer to the following figure.

Pandas Merge Operation

Example

Say we have two DataFrames, df1 and df2 and we need to combine both into a single DataFrame result based on a common key Name. We use the method Pandas.concat() as follows:

df1 = pd.DataFrame({'Name': ['Anne', 'Bob', 'Charlie'], 'Age': [31, 42, 55]})
df2 = pd.DataFrame({'Name': ['Bob', 'Charlie', 'Doron'], 'Salary': [45000, 55500, 66700]})
df1
Name Age
0 Anne 31
1 Bob 42
2 Charlie 55
df2
Name Salary
0 Bob 45000
1 Charlie 55500
2 Doron 66700
result = pd.merge(df1, df2, on='Name')
result
Name Age Salary
0 Bob 42 45000
1 Charlie 55 55500

10.2.1 Inner Merge operations

In the previous example you might have noticed that only those observations having the same key Name were merged into the result DataFrame. This type of merging is called ‘inner’ and it is useful when you nee to keep only the rows where keys from both DataFrames match, the syntax is:

Merge Type Description Example Syntax
Inner Join Keeps only the rows where keys from both dataframes match pd.merge(df1, df2, on='key', how='inner')

Example

Say we have two DataFrames left and right each containing information from our company employees. We want to combine both DataFrames into a single one name result. We use the EmployeeID as the key which to merge on.

 left = pd.DataFrame(
   {
    "EmployeeID": ["K0", "K1", "K2", "K3","K4"],
     "Name": ["Anne", "Bob", "Lola", "Martin","Zoe"],
       "Unit": ["Marketing", "Sales", "Engineering", "Finance","Human Resources"],
    }
 )
left
EmployeeID Name Unit
0 K0 Anne Marketing
1 K1 Bob Sales
2 K2 Lola Engineering
3 K3 Martin Finance
4 K4 Zoe Human Resources
 right = pd.DataFrame(
    {
        "EmployeeID": ["K0", "K1", "K2", "K3"],
        "Salary": [52000, 75200, 150000, 25000],
        "Bonus": ["5500", "32600", "25000", "15000"],
    }
 )
right
EmployeeID Salary Bonus
0 K0 52000 5500
1 K1 75200 32600
2 K2 150000 25000
3 K3 25000 15000
 result = pd.merge(left, right, on="EmployeeID",how='inner')

 result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000 5500
1 K1 Bob Sales 75200 32600
2 K2 Lola Engineering 150000 25000
3 K3 Martin Finance 25000 15000
Where is Zoe?

Notice that in the previous example the employee named ‘Zoe’ is missing from the resulting DataFrame. This is the case as her key ‘K4’ is missing in the second DataFrame and inner operations keep only the rows where keys from both dataframes match

10.2.2 Outer Merge operations

Sometimes you need to merge two DataFrames keeping all rows from both and filling in NaN (Not a Number) where there’s no match. This type of merging is called ‘outer’ and the syntax is:

Merge Type Description Example Syntax
Outer Join Keeps all rows from both dataframes, filling in NaN where there’s no match pd.merge(df1, df2, on='key', how='outer')

Example

Continuing with the previous example in this case we want to keep all employees in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='outer')

 result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000.0 5500
1 K1 Bob Sales 75200.0 32600
2 K2 Lola Engineering 150000.0 25000
3 K3 Martin Finance 25000.0 15000
4 K4 Zoe Human Resources NaN NaN
Zoe is back

Notice that this time the employee named ‘Zoe’ shows up in the resulting DataFrame yet there is no information on her salary and bonus. This is the case as in the second DataFrame right there isn’t any entry for Zoe’s key K4.

10.2.3 Left Merge operations

Sometimes you need to merge two DataFrames keeping all rows from the left DataFrame both and filling in NaN where there’s no match. This type of merging is called ‘left’ and the syntax is:

Merge Type Description Example Syntax
Left Join Keeps all rows from the left dataframe and matches from the right dataframe pd.merge(df1, df2, on='key', how='left')

Example

Continuing with the previous example in this case we want to keep all rows from the left DataFrame.

in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='left')

 result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000.0 5500
1 K1 Bob Sales 75200.0 32600
2 K2 Lola Engineering 150000.0 25000
3 K3 Martin Finance 25000.0 15000
4 K4 Zoe Human Resources NaN NaN

10.2.4 Right Merge operations

Sometimes you need to merge two DataFrames keeping all rows from the right DataFrame both and filling in NaN where there’s no match. This type of merging is called ‘right’ and the syntax is:

Merge Type Description Example Syntax
Right Join Keeps all rows from the right dataframe and matches from the left dataframe pd.merge(df1, df2, on='key', how='right')

Example

Continuing with the previous example in this case we want to keep all rows from the right DataFrame.

in the resulting DataFrame result making sure that missing information is flagged as NaN

 result = pd.merge(left, right, on="EmployeeID",how='right')

 result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000 5500
1 K1 Bob Sales 75200 32600
2 K2 Lola Engineering 150000 25000
3 K3 Martin Finance 25000 15000
Where is Zoe?

Notice that in the previous example the employee named ‘Zoe’ is missing from the resulting DataFrame. This is the case as her key ‘K4’ is missing in the second DataFrame and right operations keep only rows from the right DataFrame

10.3 Join Operations

An alternative approach to combine Pandas DataFrames based on a common key is ‘DataFrame.join()’. It is quite similar to Pandas.merge().

Example

Given our previous dataframes, left and right and we need to combine both into a single DataFrame result based on a common key Name. We use the method DataFrame.join() as follows:

result = left.merge( right, on='EmployeeID',how='inner')
result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000 5500
1 K1 Bob Sales 75200 32600
2 K2 Lola Engineering 150000 25000
3 K3 Martin Finance 25000 15000

The following code illustrates a join operation over the same pair of DataFrames

result = left.merge( right, on='EmployeeID',how='outer')
result
EmployeeID Name Unit Salary Bonus
0 K0 Anne Marketing 52000.0 5500
1 K1 Bob Sales 75200.0 32600
2 K2 Lola Engineering 150000.0 25000
3 K3 Martin Finance 25000.0 15000
4 K4 Zoe Human Resources NaN NaN

10.4 Data Transformation in Practice

TODO include tip boxes suggesting the use of gemini to address software challenges

First Example

Let’s have a look at the following Jupyter Notebook. This notebook accesses, via API, external datasets relative to the quality of air in cities across the world, then concatenates different datasets and conducts a basic benchmark of air quality of different cities.

Second Example

Let’s have a look at the following Jupyter Notebook. This notebook continues with the analysis of air quality this time performing more advanced transformation on the datasets.

Using Generative AI for coding purposes

Don’t forget to ask Gemini in case you need some help on DataFrame transformations or visualizations

Third Example (Retail Lab)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates the processing of a series of retail related datasets.

10.5 Conclusion

The methods Pandas.concat(), Pandas.merge() and DataFrame.join() are important in any data science or ML related project as they allow you to combine real world, heterogeneous, datasets. Newcomers to data transformation tend to be confused as to which method to use to combine DataFrames:

  • Pandas.concat() applies when you just want to glue together DataFrames vertically (stacking rows) or horizontally (stacking columns). Think of it as just putting pieces next to each other.

  • Pandas.merge(), a bit more advanced, merge() is like SQL joins. It allows you to combine two DataFrames based on a common column(s). This is your go-to for combining data where there’s a common link but potentially with complex relationships.

  • DataFrame.join() is similar to merge(), but it is more for combining DataFrames based on their indices instead of columns. It’s super useful when your data is already aligned by index.

The following table provides a summary of the different types of merge operations

Merge Type Description Example Syntax
Inner Join Keeps only the rows where keys from both dataframes match pd.merge(df1, df2, on='key', how='inner')
Outer Join Keeps all rows from both dataframes, filling in NaN where there’s no match pd.merge(df1, df2, on='key', how='outer')
Left Join Keeps all rows from the left dataframe and matches from the right dataframe pd.merge(df1, df2, on='key', how='left')
Right Join Keeps all rows from the right dataframe and matches from the left dataframe pd.merge(df1, df2, on='key', how='right')
Multi-key Merging on multiple keys pd.merge(df1, df2, on=['key1', 'key2'], how='inner')
Different Columns Merging on different columns from left and right dataframes pd.merge(df1, df2, left_on='key1', right_on='key2')

10.6 Further reading

For those interested in additional examples and documentation, please refer to the following resources: