import pandas as pd
# Create a Series
= pd.Series([1, 3, 5, 7, 9])
s s
0 1
1 3
2 5
3 7
4 9
dtype: int64
Pandas is a powerful, open-source library for data manipulation and analysis in Python. It’s widely used in the field of data science because it provides essential data structures and functions that make working with structured data easy and efficient.
Think of Pandas as a toolkit that allows you to handle, manipulate, and analyse data in a way that’s both intuitive and high-performing.
Pandas is important in data science for several reasons:
Data Cleaning and Preparation: Data scientists often work with real-world data, which can be messy. Pandas provides functions to easily handle missing values, filter out unnecessary information, and format data into a more useful structure. This step is crucial since clean data leads to more accurate analysis.
Data Analysis: Pandas allows you to perform powerful data analysis quickly. You can compute summary statistics, filter data based on conditions, and perform aggregations efficiently. This is very useful for exploratory data analysis (EDA), where you seek to understand the underlying patterns in your data.
Manipulating and Reshaping Data: It offers robust functionality for transforming data, such as merging, concatenating, and reshaping datasets. This is essential when trying to combine multiple data sources or rearranging data to fit analysis needs.
Integration with Other Libraries: Pandas works seamlessly with other libraries like NumPy (for numerical computations), Matplotlib (for visualization), and Scikit-learn (for machine learning). This interoperability makes it a foundational tool in the data science ecosystem.
Performance: Despite being a high-level library, Pandas is optimised for performance. It leverages efficient data structures based on NumPy arrays, making operations on large datasets faster than you might expect.
Ease of Use:The syntax and functions in Pandas are designed for usability. Even if you are new to programming or data analysis, you can perform complex data manipulations with just a few lines of code. This makes it accessible for beginners and useful for experienced data professionals alike.
For instance Pandas can be used in the following use cases:
Analysing Sales Data: Imagine you have a CSV file containing sales data with columns for product, price, and quantity sold. With Pandas, you can easily read the file, clean the data (like removing NaN values), filter products above a certain price, and compute total sales.
Weather Data Analysis: If you’re analysing climate data over several years, you can use Pandas to read the datasets, aggregate monthly averages, and visualise trends.
Customer Data Management: When managing customer information for a business, Pandas can help you analyse customer behaviour, segment users, and predict trends based on historical data.
The two primary data structures in Pandas are:
Series: A one-dimensional labelled array that can hold any data type (integers, floats, strings, etc.). You can think of it as a single column from a spreadsheet. The axis labels are collectively referred to as the index.
import pandas as pd
# Create a Series
= pd.Series([1, 3, 5, 7, 9])
s s
0 1
1 3
2 5
3 7
4 9
dtype: int64
DataFrame: A two-dimensional labelled data structure with columns of potentially different types. It’s similar to a spreadsheet or SQL table, making it very easy to visualise and understand. A DataFrame also has an index.
# Create a DataFrame
import pandas as pd
= {
data 'Name': ['Bob', 'Alice', 'Charlie'],
'Age': [25, 35, 30],
'City': ['New York', 'Los Angeles', 'Chicago']
}= pd.DataFrame(data)
df df
Name | Age | City | |
---|---|---|---|
0 | Bob | 25 | New York |
1 | Alice | 35 | Los Angeles |
2 | Charlie | 30 | Chicago |
In Pandas, basic operations over DataFrames include several essential tasks for data manipulation and analysis. First, you can create a DataFrame from various data sources, such as dictionaries or CSV files. Inspecting the DataFrame can be done using methods like head()
, tail()
, info()
, and describe()
to understand its structure and summary statistics. You can then select specific data by accessing columns with bracket notation or using methods like loc
and iloc
for row and column selection. Filtering allows you to retrieve rows based on conditions, complemented by the ability to add or modify columns easily. For data cleaning, removing columns or rows can be done using the drop()
method. You can sort data using the sort_values()
method and group data with groupby()
to perform aggregation operations. Merging and joining multiple DataFrames can be achieved using join()
, merge()
or concat()
, and finally, you can save your DataFrame to various file formats, such as CSV or Excel, using methods like to_csv()
. Together, these operations form the backbone of data manipulation in Pandas.
Usually you will be creating DataFrames by importing existing datasets, Pandas offers several methods to import external datasets, refer to the following figure.
For instance, to import CSV files we use pandas.read_csv()
. In the example following we access a URL "https://download.mlcc.google.com/mledu-datasets/"
to download a CSV file named california_housing_train.csv
and have it converted into a Pandas DataFrame called new_dataframe
.
# Create a new DataFrame by importing an external CSV file
import pandas as pd
= pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
new_dataframe new_dataframe
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16995 | -124.26 | 40.58 | 52.0 | 2217.0 | 394.0 | 907.0 | 369.0 | 2.3571 | 111400.0 |
16996 | -124.27 | 40.69 | 36.0 | 2349.0 | 528.0 | 1194.0 | 465.0 | 2.5179 | 79000.0 |
16997 | -124.30 | 41.84 | 17.0 | 2677.0 | 531.0 | 1244.0 | 456.0 | 3.0313 | 103600.0 |
16998 | -124.30 | 41.80 | 19.0 | 2672.0 | 552.0 | 1298.0 | 478.0 | 1.9797 | 85800.0 |
16999 | -124.35 | 40.54 | 52.0 | 1820.0 | 300.0 | 806.0 | 270.0 | 3.0147 | 94600.0 |
17000 rows × 9 columns
To import XLS or JSON files we would use pandas.read_excel()
and pandas.read_json()
respectively. Refer to Pandas IO methods for additional methods to import external datasets.
Indeed DataFrames can be created from scratch using the pandas.DataFrame()
method. The following example shows how to create a new DataFrame from a Dictionary.
# Step 1 import pandas
import pandas as pd
# Step 2: Define the dictionary
= {
data 'Name': ['Alice', 'Bob', 'Charlie','John','Sean','Tobias'], # Column 'Name' with names
'Age': [25, 30, 35,50,47,59], # Column 'Age' with ages
'City': ['New York', 'Los Angeles', 'Chicago','Hong Kong','London','Beijing'], # Column 'City' with cities
'Salary': [70000, 80000, 60000,45000,125000,36000] # Column 'Salary' with salaries
}
# Step 3: Create the DataFrame
= pd.DataFrame(data)
df
# Display the DataFrame
df
Name | Age | City | Salary | |
---|---|---|---|---|
0 | Alice | 25 | New York | 70000 |
1 | Bob | 30 | Los Angeles | 80000 |
2 | Charlie | 35 | Chicago | 60000 |
3 | John | 50 | Hong Kong | 45000 |
4 | Sean | 47 | London | 125000 |
5 | Tobias | 59 | Beijing | 36000 |
You will notice that the DataFrame df
has an additional column at the beginning that runs from zero to five. This is called the index
of the DataFrame and provides a convenient mechanism to name each row automatically.
Pandas offers useful methods to inspect existing DataFrames. This is useful to conduct exploratory data analysis (EDA) and gain a preliminary understanding of the dataset structure.
To have a look at the first rows of a DataFrame we use the pandas.head()
method
# Displays the first 5 rows by default df.head()
Name | Age | City | Salary | |
---|---|---|---|---|
0 | Alice | 25 | New York | 70000 |
1 | Bob | 30 | Los Angeles | 80000 |
2 | Charlie | 35 | Chicago | 60000 |
3 | John | 50 | Hong Kong | 45000 |
4 | Sean | 47 | London | 125000 |
To have a look at the last rows of a DataFrame we use the pandas.tail()
method
# Displays the last 5 rows by default df.tail()
Name | Age | City | Salary | |
---|---|---|---|---|
1 | Bob | 30 | Los Angeles | 80000 |
2 | Charlie | 35 | Chicago | 60000 |
3 | John | 50 | Hong Kong | 45000 |
4 | Sean | 47 | London | 125000 |
5 | Tobias | 59 | Beijing | 36000 |
To have a look at a random sample of rows of a DataFrame we use the pandas.sample()
method
=2) # Displays a random sample of 2 rows df.sample(n
Name | Age | City | Salary | |
---|---|---|---|---|
5 | Tobias | 59 | Beijing | 36000 |
3 | John | 50 | Hong Kong | 45000 |
To get a summary of a DataFrame we use the pandas.info()
method .
# Displays a summary of the DataFrame df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 6 non-null object
1 Age 6 non-null int64
2 City 6 non-null object
3 Salary 6 non-null int64
dtypes: int64(2), object(2)
memory usage: 324.0+ bytes
Based on the previous information we learn that the df
DataFrame has 4 columns: ‘Name’, ‘Age’, ‘City’ and ‘Salary’. ‘Name’ and ‘City’ are objects
, ‘Age’ and ‘Salary’ are int64
(this is integers with a precision of 64 bits). the DataFrame has 3 rows and requires 224 bytes of memory. It also has an index
consisting of 6 elements [0-5].
To get a basic statistical description of a DataFrame we use the pandas.describe()
method .
# Displays a basic statistical description of numerical columns in the DataFrame df.describe()
Age | Salary | |
---|---|---|
count | 6.000000 | 6.000000 |
mean | 41.000000 | 69333.333333 |
std | 13.069047 | 31633.315771 |
min | 25.000000 | 36000.000000 |
25% | 31.250000 | 48750.000000 |
50% | 41.000000 | 65000.000000 |
75% | 49.250000 | 77500.000000 |
max | 59.000000 | 125000.000000 |
By default Pandas associates a numerical index by default, it is possible however to specify your own index when creating new DataFrames, as illustrated in the following two examples:
= pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",",index_col='population')
california_dataframe california_dataframe
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|
population | ||||||||
1015.0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 472.0 | 1.4936 | 66900.0 |
1129.0 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 463.0 | 1.8200 | 80100.0 |
333.0 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 117.0 | 1.6509 | 85700.0 |
515.0 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 226.0 | 3.1917 | 73400.0 |
624.0 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 262.0 | 1.9250 | 65500.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
907.0 | -124.26 | 40.58 | 52.0 | 2217.0 | 394.0 | 369.0 | 2.3571 | 111400.0 |
1194.0 | -124.27 | 40.69 | 36.0 | 2349.0 | 528.0 | 465.0 | 2.5179 | 79000.0 |
1244.0 | -124.30 | 41.84 | 17.0 | 2677.0 | 531.0 | 456.0 | 3.0313 | 103600.0 |
1298.0 | -124.30 | 41.80 | 19.0 | 2672.0 | 552.0 | 478.0 | 1.9797 | 85800.0 |
806.0 | -124.35 | 40.54 | 52.0 | 1820.0 | 300.0 | 270.0 | 3.0147 | 94600.0 |
17000 rows × 8 columns
# Create a DataFrame
import pandas as pd
= {
data 'Name': ['Bob', 'Alice', 'Charlie'],
'Age': [25, 35, 30],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Salary': [30000, 80000, 60000]
}=['Employee_001','Employee_002','Employee_003']
labels= pd.DataFrame(data,index=labels)
df2 df2
Name | Age | City | Salary | |
---|---|---|---|---|
Employee_001 | Bob | 25 | New York | 30000 |
Employee_002 | Alice | 35 | Los Angeles | 80000 |
Employee_003 | Charlie | 30 | Chicago | 60000 |
The index of an existing DataFrame can be manipulated via DataFrame.set_index()
and DataFrame.reset_index()
, for instance the following code resets the existing index creating a new one.
=df2.reset_index()
df2_new df2_new
index | Name | Age | City | Salary | |
---|---|---|---|---|---|
0 | Employee_001 | Bob | 25 | New York | 30000 |
1 | Employee_002 | Alice | 35 | Los Angeles | 80000 |
2 | Employee_003 | Charlie | 30 | Chicago | 60000 |
You can use the option drop=True
parameter to avoid the old index being added as a column:
=df2.reset_index(drop=True)
df2_new df2_new
Name | Age | City | Salary | |
---|---|---|---|---|
0 | Bob | 25 | New York | 30000 |
1 | Alice | 35 | Los Angeles | 80000 |
2 | Charlie | 30 | Chicago | 60000 |
Even better you can use the option inplace=True
to apply the changes in the original dataframe as follows:
=True,drop=True)
df2.reset_index(inplace df2
Name | Age | City | Salary | |
---|---|---|---|---|
0 | Bob | 25 | New York | 30000 |
1 | Alice | 35 | Los Angeles | 80000 |
2 | Charlie | 30 | Chicago | 60000 |
Given an existing DataFrame you can set an existing column as the new index with DataFrame.set_index()
, for instance:
=df2.set_index('City')
df2_new df2_new
Name | Age | Salary | |
---|---|---|---|
City | |||
New York | Bob | 25 | 30000 |
Los Angeles | Alice | 35 | 80000 |
Chicago | Charlie | 30 | 60000 |
Again you can use the option inplace=True
to apply the changes to the existing dataframe:
'City',inplace=True)
df2.set_index( df2
Name | Age | Salary | |
---|---|---|---|
City | |||
New York | Bob | 25 | 30000 |
Los Angeles | Alice | 35 | 80000 |
Chicago | Charlie | 30 | 60000 |
Two useful methods are DataFrame.sort_values()
and DataFrame.sort_index()
as they allow you to sort DataFrames by columns or by index.
The following examples order the df2
DataFrame by the column Name
in ascending and descending order respectively.
='Name',inplace=True)
df2.sort_values(by df2
Name | Age | Salary | |
---|---|---|---|
City | |||
Los Angeles | Alice | 35 | 80000 |
New York | Bob | 25 | 30000 |
Chicago | Charlie | 30 | 60000 |
='Salary',inplace=True, ascending=False)
df2.sort_values(by df2
Name | Age | Salary | |
---|---|---|---|
City | |||
Los Angeles | Alice | 35 | 80000 |
Chicago | Charlie | 30 | 60000 |
New York | Bob | 25 | 30000 |
It is also possible to sort DataFrames by index, for instance:
=True)
df2.sort_index(inplace df2
Name | Age | Salary | |
---|---|---|---|
City | |||
Chicago | Charlie | 30 | 60000 |
Los Angeles | Alice | 35 | 80000 |
New York | Bob | 25 | 30000 |
Pandas offers several methods to select columns from existing DataFrames. For instance, to select a single column we use the indexing operator []
with the name of the column we are interested in.
= df2['Age']
ages ages
City
Chicago 30
Los Angeles 35
New York 25
Name: Age, dtype: int64
To select several columns we use the indexing operator []
with a list of the names of the columns we want to select.
= df2[['Name','Age']]
names_and_ages names_and_ages
Name | Age | |
---|---|---|
City | ||
Chicago | Charlie | 30 |
Los Angeles | Alice | 35 |
New York | Bob | 25 |
Pandas offers several methods to select rows from an existing DataFrame. For instance to select a single row we use the indexing operator DataFrame.loc[]
with the name of the row we want to retrieve. Remember that the index provides a naming for each row. For instance the following example retrieves the row with the name ‘Chicago’ frim the DataFrame df2
.
= df2.loc['Chicago']
row_named_Chicago row_named_Chicago
Name Charlie
Age 30
Salary 60000
Name: Chicago, dtype: object
To select several rows by name we use the indexing operator DataFrame.loc[]
this time with a list of names of the rows we want to retrieve. The following example retrieves rows with names ‘1’ and ‘5’.
= df2.loc[['Chicago','New York']]
rows_named_Chicago_NewYork rows_named_Chicago_NewYork
Name | Age | Salary | |
---|---|---|---|
City | |||
Chicago | Charlie | 30 | 60000 |
New York | Bob | 25 | 30000 |
In future chapters we will see other uses of the operator df.loc[]
, for instance to filter rows based on conditions.
Pandas offers many numerical operations that allow you to perform calculations on DataFrames efficiently. For instance:
'Age'].mean() df2[
30.0
'Salary'].max() df2[
80000
'Salary']/1000000 df2[
City
Chicago 0.06
Los Angeles 0.08
New York 0.03
Name: Salary, dtype: float64
'Salary']/df2['Age'] df2[
City
Chicago 2000.000000
Los Angeles 2285.714286
New York 1200.000000
dtype: float64
'Salary']/df2['Salary'].min() df2[
City
Chicago 2.000000
Los Angeles 2.666667
New York 1.000000
Name: Salary, dtype: float64
The following table lists some common numerical operations available in pandas:
Operation | Description | Example Code |
---|---|---|
Addition | Adds a value to each element or adds two DataFrames. | df + 10 or df + other_df |
Subtraction | Subtracts a value from each element or subtracts two DataFrames. | df - 5 or df - other_df |
Multiplication | Multiplies each element by a value or multiplies two DataFrames element-wise. | df * 2 or df * other_df |
Division | Divides each element by a value or divides two DataFrames element-wise. | df / 3 or df / other_df |
Sum | Computes the sum for each column (or row with axis=1 ). |
df.sum() |
Mean | Computes the mean of each column (or row with axis=1 ). |
df.mean() |
Median | Computes the median of each column (or row with axis=1 ). |
df.median() |
Min/Max | Computes the minimum or maximum of each column (or row with axis=1 ). |
df.min() , df.max() |
Standard Deviation | Computes the standard deviation of each column (or row with axis=1 ). |
df.std() |
Variance | Computes the variance of each column (or row with axis=1 ). |
df.var() |
Cumulative Sum | Computes the cumulative sum of each column (or row with axis=1 ). |
df.cumsum() |
Cumulative Product | Computes the cumulative product of each column (or row with axis=1 ). |
df.cumprod() |
Difference | Computes the difference between consecutive elements in each column (or row with axis=1 ). |
df.diff() |
Correlation | Computes the pairwise correlation between columns. | df.corr() |
Covariance | Computes the pairwise covariance between columns. | df.cov() |
Rank | Assigns ranks to values in each column (or row with axis=1 ). |
df.rank() |
Quantile | Returns the specified quantile for each column. | df.quantile(0.5) |
Rounding | Rounds values to the specified number of decimal places. | df.round(2) |
Absolute Value | Takes the absolute value of each element. | df.abs() |
Exponential | Applies the exponential function to each element. | df.exp() |
Standardization (Z-score) | Scales values to have a mean of 0 and standard deviation of 1. | (df - df.mean()) / df.std() |
Min-Max Scaling | Scales values between 0 and 1. | (df - df.min()) / (df.max() - df.min()) |
The best approach to master Pandas is by using it in your data science endeavours. Remember the motto: “Practice makes perfect”.
The following Jupyter Notebook illustrates Pandas’ core functionalities for data manipulation and analysis. It covers essential operations like loading data from CSV files into a DataFrame, accessing specific rows and columns, and selecting subsets of data based on their names or positions.
The following Jupyter Notebook provides a basic introduction to the pandas library in Python, focusing on data manipulation and analysis using the Titanic dataset. It covers essential operations such as loading data into a DataFrame, exploring the dataset with descriptive statistics, manipulating indexes, sorting data, selecting specific columns and rows, and performing basic arithmetic operations on columns.
This chapter has introduced Pandas as a fundamental tool for data manipulation and analysis in Python, essential for modern data science workflows. Through its intuitive data structures, Series and DataFrame, Pandas provides a versatile platform for handling and analysing structured data. We explored various aspects of Pandas, from importing and cleaning datasets to performing complex data manipulations such as filtering, reshaping, and merging data. Additionally, Pandas offers efficient handling of missing values, powerful numerical operations, and seamless integration with other libraries, enhancing its utility for exploratory data analysis (EDA) and machine learning tasks.
The chapter emphasised how Pandas simplifies the process of transforming raw data into insights. By automating repetitive data wrangling tasks and providing tools for visualisation and statistical analysis, Pandas allows data scientists to focus more on deriving meaning from data rather than manual data preparation. For those beginning with data science or seeking to refine their analytical skills, mastering Pandas is an invaluable step. Moving forward, applying these techniques in real-world scenarios, as suggested in the practical exercises, will solidify your understanding and open up new possibilities for efficient, insightful data analysis.
In the next chapter you will learn how to group and aggregate DataFrames using Pandas.
Data scientists must be proficient in the use of the Pandas library to harness the complexity arising in industrial contexts and be able to have data that is ready for statistical analysis or AI-based inference. We strongly advice you to have a look at the following references:
If you want a good reference site: Official Pandas Reference
Gentle introduction to Pandas: 10 minutes to Pandas
Additional learning resources (Advanced): Python for Data Analysis