In Chapter 7 you practiced how to perform basic filtering on a pandas DataFrame. Filtering allows you to select observations based on a condition as shown in the following example.
# Step 1 import pandasimport pandas as pd# Step 2: Define the dictionarydata = {'EmployeeID': ['X0099','M1010','J2020','J2021','X0100','X1000','F3434'],'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias','Francois'], # Column 'Name' with names'Age': [25, 30, 35,50,47,59,55], # Column 'Age' with ages'City': ['New York', 'Los Angeles', 'Beijing','Hong Kong','London','Beijing','London'], # Column 'City' with cities'Salary': [70000, 80000, 60000,45000,125000,36000,250000] # Column 'Salary' with salaries}# Step 3: Create the DataFramedf = pd.DataFrame(data)df.set_index('EmployeeID',inplace=True)# Display the DataFramedf
Name
Age
City
Salary
EmployeeID
X0099
Alice
25
New York
70000
M1010
Bob
30
Los Angeles
80000
J2020
Charlie
35
Beijing
60000
J2021
John
50
Hong Kong
45000
X0100
Sean
47
London
125000
X1000
Tobias
59
Beijing
36000
F3434
Francois
55
London
250000
df[df['City']=='Beijing']
Name
Age
City
Salary
EmployeeID
J2020
Charlie
35
Beijing
60000
X1000
Tobias
59
Beijing
36000
At this stage you know already how to iterate and apply conditionals to a variety of data structures, you are also familiar with lambda functions and how to have them applied to manipulate pandas Dataframes. In this chapter you will apply all these techniques to develop advanced filtering and fine-grained manipulation of your datasets.
13.1 Conditional Filtering by Name: the .loc[] operator
The DataFrame.loc[] operator in pandas is used for label-based indexing and selection. It allows you to access a group of rows and columns by label(s) or a boolean array, refer to the following figure.
13.1.1 Column Selection by Name
The DataFrame.loc[] operator is quite versatile, it allows column selection. The following example selects ALL rows and ALL columns from the df DataFrame.
df.loc[:,:]
Name
Age
City
Salary
EmployeeID
X0099
Alice
25
New York
70000
M1010
Bob
30
Los Angeles
80000
J2020
Charlie
35
Beijing
60000
J2021
John
50
Hong Kong
45000
X0100
Sean
47
London
125000
X1000
Tobias
59
Beijing
36000
F3434
Francois
55
London
250000
If we want to be more specific and select ALL rows and columns named ‘City’ and ‘Salary’ then:
df.loc[:,['City','Salary']]
City
Salary
EmployeeID
X0099
New York
70000
M1010
Los Angeles
80000
J2020
Beijing
60000
J2021
Hong Kong
45000
X0100
London
125000
X1000
Beijing
36000
F3434
London
250000
13.1.2 Row Selection by Name
To peform a row-based selection we use the indexing operator DataFrame.loc[] with the name of the row we want to retrieve.
For instance the following example retrieves the row with the name ‘X0099’ of the DataFrame df
df.loc['X0099']
Name Alice
Age 25
City New York
Salary 70000
Name: X0099, dtype: object
To select several rows by name we use the indexing operator DataFrame.loc[] this time with a list of names of the rows we want to retrieve. The following example retrieves rows with names ‘X0099’ and ‘J2021’.
df.loc[['X0099','J2021']]
Name
Age
City
Salary
EmployeeID
X0099
Alice
25
New York
70000
J2021
John
50
Hong Kong
45000
You might want to select some specific rows and columns named ‘City’ and ‘Salary’
df.loc['X0099':'J2021',['City','Salary']]
City
Salary
EmployeeID
X0099
New York
70000
M1010
Los Angeles
80000
J2020
Beijing
60000
J2021
Hong Kong
45000
13.1.3 Conditional selection by Name
The DataFrame.loc[] operator allows conditional filtering, for instance:
df.loc[df['City']=='Beijing']
Name
Age
City
Salary
EmployeeID
J2020
Charlie
35
Beijing
60000
X1000
Tobias
59
Beijing
36000
df.loc[df['Salary']<50000]
Name
Age
City
Salary
EmployeeID
J2021
John
50
Hong Kong
45000
X1000
Tobias
59
Beijing
36000
You can combine conditions using the & operator (AND) for a more nuanced data selection, for instance:
In real scenarios you oftentimes need to transform your data based on certain criteria, the DataFrame.loc[] operator is specially useful for this. The following example updates the salary of the employee named ‘X099’.
df.loc['X0099', 'Salary'] =95000df
Name
Age
City
Salary
EmployeeID
X0099
Alice
25
New York
95000
M1010
Bob
30
Los Angeles
80000
J2020
Charlie
35
Beijing
60000
J2021
John
50
Hong Kong
45000
X0100
Sean
47
London
125000
X1000
Tobias
59
Beijing
36000
F3434
Francois
55
London
250000
Oftentimes you need to modify the values of columns based on some condition. Say that all employees from the Beijing office are to be relocated to a new office in Shanghai, we can update our dataset as follows:
Conditional replacement by Name is a quite powerful feature that allows you to transform your datasets prior to any subsequent analysis or AI training. For instance to derive new columns based on information from existing columns.
The following example illustrates how to create a categorical variable (dummy) derived from the values of the variable ZZZ. This is quite useful in econometrics and machine learning.
13.2 Conditional Filtering by Position: the .iloc[] operator
The DataFrame.iloc[] operator in pandas is used for integer-location based indexing and selection. It allows you to access a group of rows and columns by their position rather than by name. This is particularly useful when you need to work with the position of elements in the DataFrame rather than their labels, refer to the following figure.
13.2.1 Column Selection by Position
The DataFrame.iloc[] operator allows you to select columns based on the position in the DataFrame. The following example selects ALL rows and ALL columns from the df DataFrame.
df.iloc[:,:]
Name
Age
City
Salary
Category
Rank
Seniority
EmployeeID
X0099
Alice
25
New York
95000
None
Director
0
M1010
Bob
30
Los Angeles
80000
None
Director
0
J2020
Charlie
35
Shanghai
60000
None
Manager
0
J2021
John
50
Hong Kong
45000
None
Manager
1
X0100
Sean
47
London
125000
None
Director
0
X1000
Tobias
59
Shanghai
36000
None
Manager
1
F3434
Francois
55
London
250000
None
Director
1
If we want to be more specific and select ALL rows and columns in the first and third place then:
df.iloc[:,[1,3]]
Age
Salary
EmployeeID
X0099
25
95000
M1010
30
80000
J2020
35
60000
J2021
50
45000
X0100
47
125000
X1000
59
36000
F3434
55
250000
You might be interested in selecting a range of columns, for instance to select columns from the second position up to the fourth one:
df.iloc[:,1:4]
Age
City
Salary
EmployeeID
X0099
25
New York
95000
M1010
30
Los Angeles
80000
J2020
35
Shanghai
60000
J2021
50
Hong Kong
45000
X0100
47
London
125000
X1000
59
Shanghai
36000
F3434
55
London
250000
13.2.2 Row Selection by Position
To perform a row-based selection by position we use DataFrame.iloc[] with the indices of the rows we want to retrieve.
For instance the following example retrieves rows from the third position up to the sixth one.
df.iloc[2:6,:]
Name
Age
City
Salary
Category
Rank
Seniority
EmployeeID
J2020
Charlie
35
Shanghai
60000
None
Manager
0
J2021
John
50
Hong Kong
45000
None
Manager
1
X0100
Sean
47
London
125000
None
Director
0
X1000
Tobias
59
Shanghai
36000
None
Manager
1
13.2.3 Conditional replacement by Position
It is possible, yet uncommon in practice, to implement a conditional replacement based on position using the DataFrame.iloc[] operator.
The following example updates the fifth column of the first two rows with the value ‘Partner’.
df.iloc[0:2,5]='Partner'df
Name
Age
City
Salary
Category
Rank
Seniority
EmployeeID
X0099
Alice
25
New York
95000
None
Partner
0
M1010
Bob
30
Los Angeles
80000
None
Partner
0
J2020
Charlie
35
Shanghai
60000
None
Manager
0
J2021
John
50
Hong Kong
45000
None
Manager
1
X0100
Sean
47
London
125000
None
Director
0
X1000
Tobias
59
Shanghai
36000
None
Manager
1
F3434
Francois
55
London
250000
None
Director
1
13.3 Advanced Filtering in Practice
First Example (Digital Marketing)
Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to select and change information based on conditions
Second Example (Healthcare)
The following notebook Jupyter Notebook illustrates some data filtering performed over a healthcare dataset.
Third Example (Retail)
The following notebook Jupyter Notebook illustrates the application of data filtering over a retail dataset.
13.4 Conclusions
This chapter builds upon foundational data filtering concepts in pandas, focusing on advanced techniques for fine-grained manipulation of DataFrames. It highlights the use of the .loc[] operator for label-based filtering and conditional transformations, such as selecting rows and columns by name, applying conditions to filter or update data, and deriving new columns based on existing data. Similarly, the .iloc[] operator enables filtering by positional indexing, allowing selection and updates based on row and column indices. Practical applications of these methods include conditional replacements, creating categorical variables, and processing subsets of data. The chapter concludes by showcasing real-world use cases in digital marketing, healthcare, and retail, with links to Jupyter Notebooks for further exploration.
In the next chapter you will learn how to perform advanced data grouping and aggregation
13.5 Further Reading
For those interested in additional examples and references on comprehensions feel free to check the following: