13 Advanced Data Filtering Operations

In Chapter 7 you practiced how to perform basic filtering on a pandas DataFrame. Filtering allows you to select observations based on a condition as shown in the following example.

# Step 1 import pandas
import pandas as pd

# Step 2: Define the dictionary
data = {
    'EmployeeID': ['X0099','M1010','J2020','J2021','X0100','X1000','F3434'],
    'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias','Francois'],     # Column 'Name' with names
    'Age': [25, 30, 35,50,47,59,55],                      # Column 'Age' with ages
    'City': ['New York', 'Los Angeles', 'Beijing','Hong Kong','London','Beijing','London'],  # Column 'City' with cities
    'Salary': [70000, 80000, 60000,45000,125000,36000,250000]           # Column 'Salary' with salaries
}

# Step 3: Create the DataFrame
df = pd.DataFrame(data)
df.set_index('EmployeeID',inplace=True)

# Display the DataFrame
df

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	70000
M1010	Bob	30	Los Angeles	80000
J2020	Charlie	35	Beijing	60000
J2021	John	50	Hong Kong	45000
X0100	Sean	47	London	125000
X1000	Tobias	59	Beijing	36000
F3434	Francois	55	London	250000

df[df['City']== 'Beijing']

	Name	Age	City	Salary
EmployeeID
J2020	Charlie	35	Beijing	60000
X1000	Tobias	59	Beijing	36000

At this stage you know already how to iterate and apply conditionals to a variety of data structures, you are also familiar with lambda functions and how to have them applied to manipulate pandas Dataframes. In this chapter you will apply all these techniques to develop advanced filtering and fine-grained manipulation of your datasets.

13.1 Conditional Filtering by Name: the .loc[] operator

The DataFrame.loc[] operator in pandas is used for label-based indexing and selection. It allows you to access a group of rows and columns by label(s) or a boolean array, refer to the following figure.

13.1.1 Column Selection by Name

The DataFrame.loc[] operator is quite versatile, it allows column selection. The following example selects ALL rows and ALL columns from the df DataFrame.

df.loc[:,:]

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	70000
M1010	Bob	30	Los Angeles	80000
J2020	Charlie	35	Beijing	60000
J2021	John	50	Hong Kong	45000
X0100	Sean	47	London	125000
X1000	Tobias	59	Beijing	36000
F3434	Francois	55	London	250000

If we want to be more specific and select ALL rows and columns named ‘City’ and ‘Salary’ then:

df.loc[:,['City','Salary']]

	City	Salary
EmployeeID
X0099	New York	70000
M1010	Los Angeles	80000
J2020	Beijing	60000
J2021	Hong Kong	45000
X0100	London	125000
X1000	Beijing	36000
F3434	London	250000

13.1.2 Row Selection by Name

To peform a row-based selection we use the indexing operator DataFrame.loc[] with the name of the row we want to retrieve.

For instance the following example retrieves the row with the name ‘X0099’ of the DataFrame df

df.loc['X0099']

Name         Alice
Age             25
City      New York
Salary       70000
Name: X0099, dtype: object

To select several rows by name we use the indexing operator DataFrame.loc[] this time with a list of names of the rows we want to retrieve. The following example retrieves rows with names ‘X0099’ and ‘J2021’.

df.loc[['X0099','J2021']]

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	70000
J2021	John	50	Hong Kong	45000

You might want to select some specific rows and columns named ‘City’ and ‘Salary’

df.loc['X0099':'J2021',['City','Salary']]

	City	Salary
EmployeeID
X0099	New York	70000
M1010	Los Angeles	80000
J2020	Beijing	60000
J2021	Hong Kong	45000

13.1.3 Conditional selection by Name

The DataFrame.loc[] operator allows conditional filtering, for instance:

df.loc[df['City']=='Beijing']

	Name	Age	City	Salary
EmployeeID
J2020	Charlie	35	Beijing	60000
X1000	Tobias	59	Beijing	36000

df.loc[df['Salary']<50000]

	Name	Age	City	Salary
EmployeeID
J2021	John	50	Hong Kong	45000
X1000	Tobias	59	Beijing	36000

You can combine conditions using the & operator (AND) for a more nuanced data selection, for instance:

df.loc[(df['City']=='Beijing')&(df['Salary']>50000)]

	Name	Age	City	Salary
EmployeeID
J2020	Charlie	35	Beijing	60000

You can also combine conditions using the | operator (OR), for instance:

df.loc[(df['City']=='Beijing') | (df['City']=='New York')]

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	70000
J2020	Charlie	35	Beijing	60000
X1000	Tobias	59	Beijing	36000

13.1.4 Conditional replacement by Name

In real scenarios you oftentimes need to transform your data based on certain criteria, the DataFrame.loc[] operator is specially useful for this. The following example updates the salary of the employee named ‘X099’.

df.loc['X0099', 'Salary'] = 95000
df

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	95000
M1010	Bob	30	Los Angeles	80000
J2020	Charlie	35	Beijing	60000
J2021	John	50	Hong Kong	45000
X0100	Sean	47	London	125000
X1000	Tobias	59	Beijing	36000
F3434	Francois	55	London	250000

Oftentimes you need to modify the values of columns based on some condition. Say that all employees from the Beijing office are to be relocated to a new office in Shanghai, we can update our dataset as follows:

df.loc[df['City']=='Beijing', 'City'] = 'Shanghai'
df

	Name	Age	City	Salary
EmployeeID
X0099	Alice	25	New York	95000
M1010	Bob	30	Los Angeles	80000
J2020	Charlie	35	Shanghai	60000
J2021	John	50	Hong Kong	45000
X0100	Sean	47	London	125000
X1000	Tobias	59	Shanghai	36000
F3434	Francois	55	London	250000

The following example illustrates how to create a new column Category and assign values to it based on another column Salary.

df['Rank']='None'

df.loc[df['Salary']>=75000,'Rank']='Director'
df.loc[df['Salary']<75000,'Rank']='Manager'
df

	Name	Age	City	Salary	Rank
EmployeeID
X0099	Alice	25	New York	95000	Director
M1010	Bob	30	Los Angeles	80000	Director
J2020	Charlie	35	Shanghai	60000	Manager
J2021	John	50	Hong Kong	45000	Manager
X0100	Sean	47	London	125000	Director
X1000	Tobias	59	Shanghai	36000	Manager
F3434	Francois	55	London	250000	Director

Conditional replacement by Name

Conditional replacement by Name is a quite powerful feature that allows you to transform your datasets prior to any subsequent analysis or AI training. For instance to derive new columns based on information from existing columns.

The following example illustrates how to create a categorical variable (dummy) derived from the values of the variable ZZZ. This is quite useful in econometrics and machine learning.

df['Seniority']='None'

df.loc[df['Age']>=50,'Seniority']=1
df.loc[df['Age']<50,'Seniority']=0

df

	Name	Age	City	Salary	Rank	Seniority
EmployeeID
X0099	Alice	25	New York	95000	Director	0
M1010	Bob	30	Los Angeles	80000	Director	0
J2020	Charlie	35	Shanghai	60000	Manager	0
J2021	John	50	Hong Kong	45000	Manager	1
X0100	Sean	47	London	125000	Director	0
X1000	Tobias	59	Shanghai	36000	Manager	1
F3434	Francois	55	London	250000	Director	1

13.2 Conditional Filtering by Position: the .iloc[] operator

The DataFrame.iloc[] operator in pandas is used for integer-location based indexing and selection. It allows you to access a group of rows and columns by their position rather than by name. This is particularly useful when you need to work with the position of elements in the DataFrame rather than their labels, refer to the following figure.

13.2.1 Column Selection by Position

The DataFrame.iloc[] operator allows you to select columns based on the position in the DataFrame. The following example selects ALL rows and ALL columns from the df DataFrame.

df.iloc[:,:]

	Name	Age	City	Salary	Rank	Seniority
EmployeeID
X0099	Alice	25	New York	95000	Director	0
M1010	Bob	30	Los Angeles	80000	Director	0
J2020	Charlie	35	Shanghai	60000	Manager	0
J2021	John	50	Hong Kong	45000	Manager	1
X0100	Sean	47	London	125000	Director	0
X1000	Tobias	59	Shanghai	36000	Manager	1
F3434	Francois	55	London	250000	Director	1

If we want to be more specific and select ALL rows and columns in the first and third place then:

df.iloc[:,[1,3]]

	Age	Salary
EmployeeID
X0099	25	95000
M1010	30	80000
J2020	35	60000
J2021	50	45000
X0100	47	125000
X1000	59	36000
F3434	55	250000

You might be interested in selecting a range of columns, for instance to select columns from the second position up to the fourth one:

df.iloc[:,1:4]

	Age	City	Salary
EmployeeID
X0099	25	New York	95000
M1010	30	Los Angeles	80000
J2020	35	Shanghai	60000
J2021	50	Hong Kong	45000
X0100	47	London	125000
X1000	59	Shanghai	36000
F3434	55	London	250000

13.2.2 Row Selection by Position

To perform a row-based selection by position we use DataFrame.iloc[] with the indices of the rows we want to retrieve.

For instance the following example retrieves rows from the third position up to the sixth one.

df.iloc[2:6,:]

	Name	Age	City	Salary	Rank	Seniority
EmployeeID
J2020	Charlie	35	Shanghai	60000	Manager	0
J2021	John	50	Hong Kong	45000	Manager	1
X0100	Sean	47	London	125000	Director	0
X1000	Tobias	59	Shanghai	36000	Manager	1

13.2.3 Conditional replacement by Position

It is possible, yet uncommon in practice, to implement a conditional replacement based on position using the DataFrame.iloc[] operator.

The following example updates the fifth column of the first two rows with the value ‘Partner’.

df.iloc[0:2,5]='Partner'
df

	Name	Age	City	Salary	Rank	Seniority
EmployeeID
X0099	Alice	25	New York	95000	Director	Partner
M1010	Bob	30	Los Angeles	80000	Director	Partner
J2020	Charlie	35	Shanghai	60000	Manager	0
J2021	John	50	Hong Kong	45000	Manager	1
X0100	Sean	47	London	125000	Director	0
X1000	Tobias	59	Shanghai	36000	Manager	1
F3434	Francois	55	London	250000	Director	1

13.3 Code Interpretation Challenge

Following please find some examples of python code. Try to understand what the code is trying to accomplish before checking the solution below:

Example: Titanic Conditional Filtering by Position

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)


# Selecting the first 10 rows and specific columns (Name, Age, Sex) using iloc
first_10_passengers = df.iloc[:10, [3, 5, 4]]  # Index positions for 'Name', 'Age', 'Sex'

# Display the selected rows
display(first_10_passengers)

Explanation:

The .iloc[] operator selects the first 10 rows (:10) and specific columns based on their index positions: 3 corresponds to Name, 5 corresponds to Age, 4 corresponds to Sex.

	Name	Sex
0	Braund, Mr. Owen Harris	M
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female
2	Heikkinen, Miss. Laina	female
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female
4	Allen, Mr. William Henry	M

Example: Titanic Conditional Replacement by Value

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)

# Replace all 'male' values in the 'Sex' column with 'M'
df.loc[df['Sex'] == 'male', 'Sex'] = 'M'

# Display the updated 'Sex' column
display(df[['Name', 'Sex']].head())

Explanation:

The .loc[] operator is used to filter rows where the Sex column equals ‘male’. The values in the Sex column with value==male are replaced with ‘M’.

	Name	Sex
0	Braund, Mr. Owen Harris	M
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female
2	Heikkinen, Miss. Laina	female
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female
4	Allen, Mr. William Henry	M

Example: Titanic Conditional Replacement by Value

import pandas as pd

# URL of the Titanic dataset
url = 'https://raw.githubusercontent.com/thousandoaks/Python4DS-I/refs/heads/main/datasets/Titanic-Dataset.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(url)
df['Service']='None'
df.loc[df['Fare'] > 50, 'Service'] = 'Gold'
df.loc[df['Fare'] <= 50, 'Service'] = 'Silver'

# Display the updated 'Service' column
display(df[['Name','Fare', 'Service']].head())

Explanation:

The code uses the .loc[] operator to conditionally assign ‘Gold’ to the ‘Service’ column where ‘Fare’ is greater than 50 and assigns ‘Silver’ where ‘Fare’ is less than or equal to 50.

	Name	Fare	Service
0	Braund, Mr. Owen Harris	7.2500	Silver
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	71.2833	Gold
2	Heikkinen, Miss. Laina	7.9250	Silver
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	53.1000	Gold
4	Allen, Mr. William Henry	8.0500	Silver

13.4 Advanced Filtering in Practice

First Example (Digital Marketing)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to select and change information based on conditions

Second Example (Healthcare)

The following notebook Jupyter Notebook illustrates some data filtering performed over a healthcare dataset.

Third Example (Retail)

The following notebook Jupyter Notebook illustrates the application of data filtering over a retail dataset.

13.5 Conclusions

This chapter builds upon foundational data filtering concepts in pandas, focusing on advanced techniques for fine-grained manipulation of DataFrames. It highlights the use of the .loc[] operator for label-based filtering and conditional transformations, such as selecting rows and columns by name, applying conditions to filter or update data, and deriving new columns based on existing data. Similarly, the .iloc[] operator enables filtering by positional indexing, allowing selection and updates based on row and column indices. Practical applications of these methods include conditional replacements, creating categorical variables, and processing subsets of data. The chapter concludes by showcasing real-world use cases in digital marketing, healthcare, and retail, with links to Jupyter Notebooks for further exploration.

In the next chapter you will learn how to perform advanced data grouping and aggregation

13.6 Further Reading

For those interested in additional examples and references on comprehensions feel free to check the following:

Pandas Filtering and Selection Tutorial