13  Advanced Data Filtering Operations

In Chapter 7 you practiced how to perform basic filtering on a pandas DataFrame. Filtering allows you to select observations based on a condition as shown in the following example.

# Step 1 import pandas
import pandas as pd

# Step 2: Define the dictionary
data = {
    'EmployeeID': ['X0099','M1010','J2020','J2021','X0100','X1000','F3434'],
    'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias','Francois'],     # Column 'Name' with names
    'Age': [25, 30, 35,50,47,59,55],                      # Column 'Age' with ages
    'City': ['New York', 'Los Angeles', 'Beijing','Hong Kong','London','Beijing','London'],  # Column 'City' with cities
    'Salary': [70000, 80000, 60000,45000,125000,36000,250000]           # Column 'Salary' with salaries
}

# Step 3: Create the DataFrame
df = pd.DataFrame(data)
df.set_index('EmployeeID',inplace=True)

# Display the DataFrame
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000
df[df['City']== 'Beijing']
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000

At this stage you know already how to iterate and apply conditionals to a variety of data structures, you are also familiar with lambda functions and how to have them applied to manipulate pandas Dataframes. In this chapter you will apply all these techniques to develop advanced filtering and fine-grained manipulation of your datasets.

13.1 Conditional Filtering by Name: the .loc[] operator

The DataFrame.loc[] operator in pandas is used for label-based indexing and selection. It allows you to access a group of rows and columns by label(s) or a boolean array, refer to the following figure.

dataframe.loc[] operator

13.1.1 Column Selection by Name

The DataFrame.loc[] operator is quite versatile, it allows column selection. The following example selects ALL rows and ALL columns from the df DataFrame.

df.loc[:,:]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000

If we want to be more specific and select ALL rows and columns named ‘City’ and ‘Salary’ then:

df.loc[:,['City','Salary']]
City Salary
EmployeeID
X0099 New York 70000
M1010 Los Angeles 80000
J2020 Beijing 60000
J2021 Hong Kong 45000
X0100 London 125000
X1000 Beijing 36000
F3434 London 250000

13.1.2 Row Selection by Name

To peform a row-based selection we use the indexing operator DataFrame.loc[] with the name of the row we want to retrieve.

For instance the following example retrieves the row with the name ‘X0099’ of the DataFrame df

df.loc['X0099']
Name         Alice
Age             25
City      New York
Salary       70000
Name: X0099, dtype: object

To select several rows by name we use the indexing operator DataFrame.loc[] this time with a list of names of the rows we want to retrieve. The following example retrieves rows with names ‘X0099’ and ‘J2021’.

df.loc[['X0099','J2021']]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
J2021 John 50 Hong Kong 45000

You might want to select some specific rows and columns named ‘City’ and ‘Salary’

df.loc['X0099':'J2021',['City','Salary']]
City Salary
EmployeeID
X0099 New York 70000
M1010 Los Angeles 80000
J2020 Beijing 60000
J2021 Hong Kong 45000

13.1.3 Conditional selection by Name

The DataFrame.loc[] operator allows conditional filtering, for instance:

df.loc[df['City']=='Beijing']
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000
df.loc[df['Salary']<50000]
Name Age City Salary
EmployeeID
J2021 John 50 Hong Kong 45000
X1000 Tobias 59 Beijing 36000

You can combine conditions using the & operator (AND) for a more nuanced data selection, for instance:

df.loc[(df['City']=='Beijing')&(df['Salary']>50000)]
Name Age City Salary
EmployeeID
J2020 Charlie 35 Beijing 60000

You can also combine conditions using the | operator (OR), for instance:

df.loc[(df['City']=='Beijing') | (df['City']=='New York')]
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 70000
J2020 Charlie 35 Beijing 60000
X1000 Tobias 59 Beijing 36000

13.1.4 Conditional replacement by Name

In real scenarios you oftentimes need to transform your data based on certain criteria, the DataFrame.loc[] operator is specially useful for this. The following example updates the salary of the employee named ‘X099’.

df.loc['X0099', 'Salary'] = 95000
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 95000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Beijing 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Beijing 36000
F3434 Francois 55 London 250000

Oftentimes you need to modify the values of columns based on some condition. Say that all employees from the Beijing office are to be relocated to a new office in Shanghai, we can update our dataset as follows:

df.loc[df['City']=='Beijing', 'City'] = 'Shanghai'
df
Name Age City Salary
EmployeeID
X0099 Alice 25 New York 95000
M1010 Bob 30 Los Angeles 80000
J2020 Charlie 35 Shanghai 60000
J2021 John 50 Hong Kong 45000
X0100 Sean 47 London 125000
X1000 Tobias 59 Shanghai 36000
F3434 Francois 55 London 250000

The following example illustrates how to create a new column Category and assign values to it based on another column Salary.

df['Category']='None'

df.loc[df['Salary']>=75000,'Rank']='Director'
df.loc[df['Salary']<75000,'Rank']='Manager'
df
Name Age City Salary Category Rank
EmployeeID
X0099 Alice 25 New York 95000 None Director
M1010 Bob 30 Los Angeles 80000 None Director
J2020 Charlie 35 Shanghai 60000 None Manager
J2021 John 50 Hong Kong 45000 None Manager
X0100 Sean 47 London 125000 None Director
X1000 Tobias 59 Shanghai 36000 None Manager
F3434 Francois 55 London 250000 None Director
Conditional replacement by Name

Conditional replacement by Name is a quite powerful feature that allows you to transform your datasets prior to any subsequent analysis or AI training. For instance to derive new columns based on information from existing columns.

The following example illustrates how to create a categorical variable (dummy) derived from the values of the variable ZZZ. This is quite useful in econometrics and machine learning.

df['Seniority']='None'

df.loc[df['Age']>=50,'Seniority']=1
df.loc[df['Age']<50,'Seniority']=0

df
Name Age City Salary Category Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 None Director 0
M1010 Bob 30 Los Angeles 80000 None Director 0
J2020 Charlie 35 Shanghai 60000 None Manager 0
J2021 John 50 Hong Kong 45000 None Manager 1
X0100 Sean 47 London 125000 None Director 0
X1000 Tobias 59 Shanghai 36000 None Manager 1
F3434 Francois 55 London 250000 None Director 1

13.2 Conditional Filtering by Position: the .iloc[] operator

The DataFrame.iloc[] operator in pandas is used for integer-location based indexing and selection. It allows you to access a group of rows and columns by their position rather than by name. This is particularly useful when you need to work with the position of elements in the DataFrame rather than their labels, refer to the following figure.

dataframe.iloc[] operator

13.2.1 Column Selection by Position

The DataFrame.iloc[] operator allows you to select columns based on the position in the DataFrame. The following example selects ALL rows and ALL columns from the df DataFrame.

df.iloc[:,:]
Name Age City Salary Category Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 None Director 0
M1010 Bob 30 Los Angeles 80000 None Director 0
J2020 Charlie 35 Shanghai 60000 None Manager 0
J2021 John 50 Hong Kong 45000 None Manager 1
X0100 Sean 47 London 125000 None Director 0
X1000 Tobias 59 Shanghai 36000 None Manager 1
F3434 Francois 55 London 250000 None Director 1

If we want to be more specific and select ALL rows and columns in the first and third place then:

df.iloc[:,[1,3]]
Age Salary
EmployeeID
X0099 25 95000
M1010 30 80000
J2020 35 60000
J2021 50 45000
X0100 47 125000
X1000 59 36000
F3434 55 250000

You might be interested in selecting a range of columns, for instance to select columns from the second position up to the fourth one:

df.iloc[:,1:4]
Age City Salary
EmployeeID
X0099 25 New York 95000
M1010 30 Los Angeles 80000
J2020 35 Shanghai 60000
J2021 50 Hong Kong 45000
X0100 47 London 125000
X1000 59 Shanghai 36000
F3434 55 London 250000

13.2.2 Row Selection by Position

To perform a row-based selection by position we use DataFrame.iloc[] with the indices of the rows we want to retrieve.

For instance the following example retrieves rows from the third position up to the sixth one.

df.iloc[2:6,:]
Name Age City Salary Category Rank Seniority
EmployeeID
J2020 Charlie 35 Shanghai 60000 None Manager 0
J2021 John 50 Hong Kong 45000 None Manager 1
X0100 Sean 47 London 125000 None Director 0
X1000 Tobias 59 Shanghai 36000 None Manager 1

13.2.3 Conditional replacement by Position

It is possible, yet uncommon in practice, to implement a conditional replacement based on position using the DataFrame.iloc[] operator.

The following example updates the fifth column of the first two rows with the value ‘Partner’.

df.iloc[0:2,5]='Partner'
df
Name Age City Salary Category Rank Seniority
EmployeeID
X0099 Alice 25 New York 95000 None Partner 0
M1010 Bob 30 Los Angeles 80000 None Partner 0
J2020 Charlie 35 Shanghai 60000 None Manager 0
J2021 John 50 Hong Kong 45000 None Manager 1
X0100 Sean 47 London 125000 None Director 0
X1000 Tobias 59 Shanghai 36000 None Manager 1
F3434 Francois 55 London 250000 None Director 1

13.3 Advanced Filtering in Practice

First Example (Digital Marketing)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to select and change information based on conditions

Second Example (Healthcare)

The following notebook Jupyter Notebook illustrates some data filtering performed over a healthcare dataset.

Third Example (Retail)

The following notebook Jupyter Notebook illustrates the application of data filtering over a retail dataset.

13.4 Conclusions

This chapter builds upon foundational data filtering concepts in pandas, focusing on advanced techniques for fine-grained manipulation of DataFrames. It highlights the use of the .loc[] operator for label-based filtering and conditional transformations, such as selecting rows and columns by name, applying conditions to filter or update data, and deriving new columns based on existing data. Similarly, the .iloc[] operator enables filtering by positional indexing, allowing selection and updates based on row and column indices. Practical applications of these methods include conditional replacements, creating categorical variables, and processing subsets of data. The chapter concludes by showcasing real-world use cases in digital marketing, healthcare, and retail, with links to Jupyter Notebooks for further exploration.

In the next chapter you will learn how to perform advanced data grouping and aggregation

13.5 Further Reading

For those interested in additional examples and references on comprehensions feel free to check the following: