13  Advanced Data Filtering Operations

In you practiced how to perform basic filtering on a pandas DataFrame. Filtering allows you to select observations based on a condition as shown in the following example.

# Step 1 import pandas
import pandas as pd

# Step 2: Define the dictionary
data = {
    'EmployeeID': ['X0099','M1010','J2020','J2021','X0100','X1000','F3434'],
    'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias','Francois'],     # Column 'Name' with names
    'Age': [25, 30, 35,50,47,59,55],                      # Column 'Age' with ages
    'City': ['New York', 'Los Angeles', 'Beijing','Hong Kong','London','Beijing','London'],  # Column 'City' with cities
    'Salary': [70000, 80000, 60000,45000,125000,36000,250000]           # Column 'Salary' with salaries
}

# Step 3: Create the DataFrame
df = pd.DataFrame(data)
df.set_index('EmployeeID',inplace=True)

# Display the DataFrame
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000
df[df['City']== 'Beijing']
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000

At this stage you know already how to iterate and apply conditionals to a variety of data structures, you are also familiar with lambda functions and how to have them applied to manipulate pandas Dataframes. In this chapter you will apply all these techniques to develop advanced filtering and fine-grained manipulation of your datasets.

13.1 Conditional Filtering by Name: the .loc[] operator

The DataFrame.loc[] operator in pandas is used for label-based indexing and selection. It allows you to access a group of rows and columns by label(s) or a boolean array, refer to the following figure.

dataframe.loc[] operator

13.1.1 Column Selection by Name

The DataFrame.loc[] operator is quite versatile, it allows column selection. The following example selects ALL rows and ALL columns from the df DataFrame.

df.loc[:,:]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000

If we want to be more specific and select ALL rows and columns named ‘City’ and ‘Salary’ then:

df.loc[:,['City','Salary']]
City Salary
EmployeeID
X0099 New York 70000
M1010 Los Angeles 80000
J2020 Beijing 60000
J2021 Hong Kong 45000
X0100 London 125000
X1000 Beijing 36000
F3434 London 250000

13.1.2 Row Selection by Name

To peform a row-based selection we use the indexing operator DataFrame.loc[] with the name of the row we want to retrieve.

For instance the following example retrieves the row with the name ‘X0099’ of the DataFrame df

df.loc['X0099']
Name         Alice
Age             25
City      New York
Salary       70000
Name: X0099, dtype: object

To select several rows by name we use the indexing operator DataFrame.loc[] this time with a list of names of the rows we want to retrieve. The following example retrieves rows with names ‘X0099’ and ‘J2021’.

df.loc[['X0099','J2021']]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
J2021 John 50 Hong Kong 45000

You might want to select some specific rows and columns named ‘City’ and ‘Salary’

df.loc['X0099':'J2021',['City','Salary']]
City Salary
EmployeeID
X0099 New York 70000
M1010 Los Angeles 80000
J2020 Beijing 60000
J2021 Hong Kong 45000

13.1.3 Conditional selection by Name

The DataFrame.loc[] operator allows conditional filtering, for instance:

df.loc[df['City']=='Beijing']
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000
df.loc[df['Salary']<50000]
Name Age City Salary
EmployeeID
J2021 John 50 Hong Kong 45000
X1000 Tobias 59 Beijing 36000

You can combine conditions using the & operator (AND) for a more nuanced data selection, for instance:

df.loc[(df['City']=='Beijing')&(df['Salary']>50000)]
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000

You can also combine conditions using the | operator (OR), for instance:

df.loc[(df['City']=='Beijing') | (df['City']=='New York')]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000

13.1.4 Conditional replacement by Name

In real scenarios you oftentimes need to transform your data based on certain criteria, the DataFrame.loc[] operator is specially useful for this. The following example updates the salary of the employee named ‘X099’.

df.loc['X0099', 'Salary'] = 95000
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 95000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000

Oftentimes you need to modify the values of columns based on some condition. Say that all employees from the Beijing office are to be relocated to a new office in Shanghai, we can update our dataset as follows:

df.loc[df['City']=='Beijing', 'City'] = 'Shanghai'
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 95000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Shanghai 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Shanghai 36000
F3434 Francois 55 London 250000

The following example illustrates how to create a new column Category and assign values to it based on another column Salary.

df['Rank']='None'

df.loc[df['Salary']>=75000,'Rank']='Director'
df.loc[df['Salary']<75000,'Rank']='Manager'
df
Name Age City Salary Rank
EmployeeID
X0099 Alice 25 New York 95000 Director
M1010 Bob 30 Los Angeles 80000 Director
J2020 Charlie 35 Shanghai 60000 Manager
J2021 John 50 Hong Kong 45000 Manager
X0100 Sean 47 London 125000 Director
X1000 Tobias 59 Shanghai 36000 Manager
F3434 Francois 55 London 250000 Director
Conditional replacement by Name

Conditional replacement by Name is a quite powerful feature that allows you to transform your datasets prior to any subsequent analysis or AI training. For instance to derive new columns based on information from existing columns.

The following example illustrates how to create a categorical variable (dummy) derived from the values of the variable ZZZ. This is quite useful in econometrics and machine learning.

df['Seniority']='None'

df.loc[df['Age']>=50,'Seniority']=1
df.loc[df['Age']<50,'Seniority']=0

df
Name Age City Salary Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 Director 0
M1010 Bob 30 Los Angeles 80000 Director 0
J2020 Charlie 35 Shanghai 60000 Manager 0
J2021 John 50 Hong Kong 45000 Manager 1
X0100 Sean 47 London 125000 Director 0
X1000 Tobias 59 Shanghai 36000 Manager 1
F3434 Francois 55 London 250000 Director 1

13.2 Conditional Filtering by Position: the .iloc[] operator

The DataFrame.iloc[] operator in pandas is used for integer-location based indexing and selection. It allows you to access a group of rows and columns by their position rather than by name. This is particularly useful when you need to work with the position of elements in the DataFrame rather than their labels, refer to the following figure.

dataframe.iloc[] operator

13.2.1 Column Selection by Position

The DataFrame.iloc[] operator allows you to select columns based on the position in the DataFrame. The following example selects ALL rows and ALL columns from the df DataFrame.

df.iloc[:,:]
Name Age City Salary Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 Director 0
M1010 Bob 30 Los Angeles 80000 Director 0
J2020 Charlie 35 Shanghai 60000 Manager 0
J2021 John 50 Hong Kong 45000 Manager 1
X0100 Sean 47 London 125000 Director 0
X1000 Tobias 59 Shanghai 36000 Manager 1
F3434 Francois 55 London 250000 Director 1

If we want to be more specific and select ALL rows and columns in the first and third place then:

df.iloc[:,[1,3]]
Age Salary
EmployeeID
X0099 25 95000
M1010 30 80000
J2020 35 60000
J2021 50 45000
X0100 47 125000
X1000 59 36000
F3434 55 250000

You might be interested in selecting a range of columns, for instance to select columns from the second position up to the fourth one:

df.iloc[:,1:4]
Age City Salary
EmployeeID
X0099 25 New York 95000
M1010 30 Los Angeles 80000
J2020 35 Shanghai 60000
J2021 50 Hong Kong 45000
X0100 47 London 125000
X1000 59 Shanghai 36000
F3434 55 London 250000

13.2.2 Row Selection by Position

To perform a row-based selection by position we use DataFrame.iloc[] with the indices of the rows we want to retrieve.

For instance the following example retrieves rows from the third position up to the sixth one.

df.iloc[2:6,:]
Name Age City Salary Rank Seniority
EmployeeID
J2020 Charlie 35 Shanghai 60000 Manager 0
J2021 John 50 Hong Kong 45000 Manager 1
X0100 Sean 47 London 125000 Director 0
X1000 Tobias 59 Shanghai 36000 Manager 1

13.2.3 Conditional replacement by Position

It is possible, yet uncommon in practice, to implement a conditional replacement based on position using the DataFrame.iloc[] operator.

The following example updates the fifth column of the first two rows with the value ‘Partner’.

df.iloc[0:2,5]='Partner'
df
Name Age City Salary Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 Director Partner
M1010 Bob 30 Los Angeles 80000 Director Partner
J2020 Charlie 35 Shanghai 60000 Manager 0
J2021 John 50 Hong Kong 45000 Manager 1
X0100 Sean 47 London 125000 Director 0
X1000 Tobias 59 Shanghai 36000 Manager 1
F3434 Francois 55 London 250000 Director 1

13.3 Code Interpretation Challenge

Following please find some examples of python code. Try to understand what the code is trying to accomplish before checking the solution below:

Example: Titanic Conditional Filtering by Position

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)


# Selecting the first 10 rows and specific columns (Name, Age, Sex) using iloc
first_10_passengers = df.iloc[:10, [3, 5, 4]]  # Index positions for 'Name', 'Age', 'Sex'

# Display the selected rows
display(first_10_passengers)

The .iloc[] operator selects the first 10 rows (:10) and specific columns based on their index positions: 3 corresponds to Name, 5 corresponds to Age, 4 corresponds to Sex.

Name Sex
0 Braund, Mr. Owen Harris M
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry M

Example: Titanic Conditional Replacement by Value

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)

# Replace all 'male' values in the 'Sex' column with 'M'
df.loc[df['Sex'] == 'male', 'Sex'] = 'M'

# Display the updated 'Sex' column
display(df[['Name', 'Sex']].head())

The .loc[] operator is used to filter rows where the Sex column equals ‘male’. The values in the Sex column with value==male are replaced with ‘M’.

Name Sex
0 Braund, Mr. Owen Harris M
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry M

Example: Titanic Conditional Replacement by Value

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)
df['Service']='None'
df.loc[df['Fare'] > 50, 'Service'] = 'Gold'
df.loc[df['Fare'] <= 50, 'Service'] = 'Silver'

# Display the updated 'Service' column
display(df[['Name','Fare', 'Service']].head())

The code uses the .loc[] operator to conditionally assign ‘Gold’ to the ‘Service’ column where ‘Fare’ is greater than 50 and assigns ‘Silver’ where ‘Fare’ is less than or equal to 50.

Name Fare Service
0 Braund, Mr. Owen Harris 7.2500 Silver
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 71.2833 Gold
2 Heikkinen, Miss. Laina 7.9250 Silver
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 53.1000 Gold
4 Allen, Mr. William Henry 8.0500 Silver

13.4 Advanced Filtering in Practice

First Example (Digital Marketing)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to select and change information based on conditions

Second Example (Healthcare)

The following notebook Jupyter Notebook illustrates some data filtering performed over a healthcare dataset.

Third Example (Retail)

The following notebook Jupyter Notebook illustrates the application of data filtering over a retail dataset.

13.5 Conclusions

This chapter builds upon foundational data filtering concepts in pandas, focusing on advanced techniques for fine-grained manipulation of DataFrames. It highlights the use of the .loc[] operator for label-based filtering and conditional transformations, such as selecting rows and columns by name, applying conditions to filter or update data, and deriving new columns based on existing data. Similarly, the .iloc[] operator enables filtering by positional indexing, allowing selection and updates based on row and column indices. Practical applications of these methods include conditional replacements, creating categorical variables, and processing subsets of data. The chapter concludes by showcasing real-world use cases in digital marketing, healthcare, and retail, with links to Jupyter Notebooks for further exploration.

In the next chapter you will learn how to perform advanced data grouping and aggregation

13.6 Further Reading

For those interested in additional examples and references on comprehensions feel free to check the following: