Filtering in Pandas is crucial for managing and analyzing complex datasets, as it allows you to isolate specific subsets of data that meet various conditions. In real-world scenarios, datasets can be enormous and contain a myriad of features, making it essential to query them effectively to extract meaningful insights. For instance, when working with sales data from a large retail company, a data analyst might need to investigate transactions that occurred within a specific time frame, involved certain products, or were made by specific customer demographics. By leveraging filtering, users can harness the immense power of Pandas to navigate and manipulate their data in a targeted manner, thus enhancing their analysis.
Consider a scenario involving a customer dataset with numerous attributes such as age, purchase frequency, and location. Imagine you want to extract data for customers who are between the ages of 18 and 35 and live in urban areas. Without proper filtering techniques, your analysis could be muddled with irrelevant entries, making it difficult to derive conclusions or formulate strategic recommendations. By applying filters, like combining multiple conditions using the & operator, you can efficiently curate the dataset to focus solely on that age group from urban locales. This streamlined approach not only reduces noise but also empowers data scientists to develop more tailored marketing strategies or product recommendations.
Moreover, filtering enhances the ability to conduct exploratory data analysis (EDA) by enabling pivoting on various perspectives of data. Suppose you’re analyzing a company’s employee performance dataset and want to identify employees who exceed performance metrics by a specific threshold while also belonging to particular departments. With filtering, you can carry out such advanced filtering easily and view correlations, trends, or any anomalies present within those subsets. For example a user could swiftly pull results of all employees exceeding their targets along with their performance scores and managers’ departments in one go. This capability to drill down into complex conditions not only improves the efficiency of data insights but also fosters more informed decision-making across various business functions.
In a way you know already how to carry out some important data filtering and selection operations. In Section 5.2.5 and Section 5.2.6 you have practiced how to select specific columns or rows in a DataFrame using DataFrame.loc[] and DataFrame.iloc[] operators. In Section 6.2 you have processed processed DataFrames using DataFrame.groupby().
In this chapter we will further explore how to formally define and apply filtering conditions to a Pandas DataFrame.
7.1 Boolean Operations on Pandas
Remember those comparison and logical operators from Section 3.1.3 ? They can be applied to Pandas DataFrames as well.
The following example illustrates the output of the boolean operator == applied to the column Region of the DataFrame corporatesales. We observe that the output is True for observations from the ’North’region and False otherwise.
Boolean operators provide almost endless possibilities to select observations from a DataFrame, the following table lists some common boolean operations.
Boolean Operation
Description
DataFrame['Column'] == 5
Returns True if the value in Column is equal to 5
DataFrame['Column'] != 5
Returns True if the value in Column is not 5
DataFrame['Column'] > 20
Returns True if the value in Column is larger than 20
DataFrame['Column'] >= 20
Returns True if the value in Column is equal or larger than 20
DataFrame['Column'] == 'Dorothy'
Returns True if the value in Column is equal to Dorothy
Following up on our previous example, the following boolean operation returns True if Sales are larger than 220 and False otherwise.
Pandas based data filtering is performed by combining boolean operators described in the previous section with the slicing operator [] to extract the rows fulfilling the boolean condition, refer to the following figure.
The following figure illustrates a data filtering operation that:
Defines a boolean condition df['Rank']==16 that returns true if the observation has a Rank value equal to 16.
Using the slicing operator [], it applies the boolean condition to the Dataframe df to extract the rows fulfilling the boolean condition.
The code required to implement this filtering operation is:
# We apply the filter to the dataframe to extract the observations fulfilling the filterdf[rankfilter]
Year
Gender
Rank
Name
Count
7
1983
Female
16
Vanessa
1517
8
1995
Male
16
Christian
2761
7.3 Composite Filtering Operations
It is quite common to define more complex filtering conditions by combining boolean conditions. To do so we rely on logical operators & and |.
By way of composite filtering you can develop advanced computational logic such as filtering based on several attributes, filtering based on a time window, filtering based on values of other columns.
The following figure illustrates a composite data filtering operation that:
Defines four boolean conditions df['Gender']=='Female', df['Name'].str.contains('A'), df['Year']>1980, df['Content']>1500.
Using the logical operator & we combine the previous boolean conditions into a composite condition that returns True if ALL boolean conditions have returned True.
Using the slicing operator [], it applies the composite condition to the Dataframe df to extract the rows fulfilling the composite condition.
The code required to implement this composite filtering operation is:
# We define the filtersgenderFilter=df['Gender']=="Female"letterAFilter=df['Name'].str.contains('a')yearFilter=df['Year']>1980countFilter=df['Count']>1500# We define the composite filtercompositefilter = genderFilter & letterAFilter & yearFilter & countFilter#we observe that for rows having Rank equal to 16 the filter returns truecompositefilter
# We apply the composite filter to the dataframe to extract the observations fulfilling the filterdf[compositefilter]
Year
Gender
Rank
Name
Count
2
1992
Female
6
Amanda
2776
5
2013
Female
1
Sophina
3447
7
1983
Female
16
Vanessa
1517
7.4 Data Filtering in practice
7.4.1 First Example
The following Jupyter Notebook illustrates how to define and apply basic filters to the titanic dataset for the purposes of extracting a subset of the population based on age and outcome of the shipwreck (survived/perished). This would allow us, for instance, to explore the impact of age on survival.
Using Generative AI for coding purposes
In the previous example try the following prompt on Google Gemini:
“Filter rows with Name having Thomas in it”
Execute the code provided by Gemini, did it work?
7.4.2 Second Example
The following Jupyter Notebook illustrates the process of defining and applying filters to the gapminder dataset to extract observations based on country and year. The ability to filter observations based on temporal characteristics is quite useful, for instance, in econometric analysis as it allows to measure the impact of events on the economy. In Pandas this is done by defining a filter that splits the dataset into “before-the-event” and “after-the-event” and subsequently performs some statistical analysis over them.
Challenge yourself by developing code to provide insights on the following question: How has life expectancy evolved in China?
Code
# We filter by country=='Chine'chinaFilter=gapminderDataFrame['country']=='China'chinaFiltergapminderDataFrame[chinaFilter]
Using Generative AI for coding purposes
We haven’t learned how to visualize data just yet, however you can try to plot the evolution of life expectancy in China by submitting the following prompt to Google Gemini:
“plot life expectancy evolution of china”
Hopefully it will return working code you can use
7.4.3 Third Example
The following Jupyter Notebook provides examples of data selection, data grouping and data filtering in a healthcare context. Please execute the code making sure that you understand how it works and the logic behind the analysis.
Once you have understood the code provided try to develop new code to answer the following: Which Revenue Codes have the largest Average Total Expenses? Try to work it out on your own before checking the solutions below (Tip: consider the DataFrame.sort_values() method)
Code
# We group the dataset by RevenueCode, we apply mean() to the column TotalExpensestotalexpensesbyRevenueCode=HealthCareDataSet.groupby('RevenueCode')['TotalExpenses'].mean()# We sort the result in descending ordertotalexpensesbyRevenueCode.sort_values(ascending=False)
Using Generative AI for coding purposes
In the previous example try the following prompt on Google Gemini:
“Compute the mean of total expenses accross revenuecode, sort it in descending order”
Worked like a charm, didn’t it ?
Now for another challenge, try to develop code for the following question: Which Medical events lasted longer than 1 week? Again try to work it out on your own before checking the solutions below (Tip: review the following Documentation to see how to extract days from a series of datetime objects )
Code
# we create a new column 'Duration' contaning the days of hospitalizationHealthCareDataSet['Duration']=HealthCareDataSet['EndDate']-HealthCareDataSet['StartDate']# Prior to filtering we need to cast datetime objects into integer onesHealthCareDataSet['Days']=HealthCareDataSet['Duration'].dt.days# We extract rows having 'Days' larger than 7HealthCareDataSet[HealthCareDataSet['Days']>7]
Using Generative AI for coding purposes
In the previous example try the following prompt on Google Gemini:
“Which Medical events lasted longer than 1 week ?”
Hopefully it will return working code that is similar to what you developed on your own
7.5 Conclusion
This chapter delves into the essential skill of filtering in Pandas, which enables users to isolate and analyze specific subsets of data based on defined conditions. Filtering proves invaluable in managing large and complex datasets, such as identifying specific customer demographics, transaction details, or employee performance metrics. By leveraging boolean operations and logical operators like & and |, users can create precise filters for targeted data analysis, thus enhancing the relevance and clarity of insights. Boolean comparisons, such as checking if values in a column meet certain conditions (e.g., ==, >, or <), are the foundation of filtering. When combined with slicing [], these boolean conditions allow users to extract only the rows of interest from a DataFrame, significantly improving data processing efficiency.
The chapter also introduces the concept of composite filtering, which combines multiple boolean conditions to handle more complex scenarios. For instance, users can filter data based on multiple attributes simultaneously, such as gender, name patterns, year thresholds, or numeric values. This capability enables more advanced analyses, such as identifying specific trends, anomalies, or subsets of interest. Practical applications are explored through examples like filtering the Titanic dataset to analyze survival rates by age and creating temporal filters for econometric studies.
7.6 Further Readings
For those of you in need of additional, more advanced, topics please refer to the following references: