A time series is a sequence of data points indexed in time order. It is typically a set of observations recorded sequentially over time at consistent time intervals. In its most common form, a time series consists of a series of time-stamped data, each reflecting the value of a variable at a specific time.
Characteristics of Time Series Data:
Time series data is characterized by several key features: (1) temporal order is crucial as the data points are ordered in time, meaning the sequence matters, (2) time series data is usually recorded at regular intervals, such as hourly, daily, or monthly, (3) over time, a time series may exhibit trends, which are long-term movements in the data, (4) some datasets may show seasonal patterns based on periodic fluctuations, like sales spikes during holidays and (5) time series data may also reflect cyclical patterns, which occur over longer, non-fixed periods and often tie into economic cycles.
Importance of Time Series in Data Science:
Forecasting: Time series analysis is crucial for predicting future values based on previously observed values. This is widely applied in economics, stock market analysis, weather forecasting, and resource consumption prediction.
Identifying Trends and Seasonality: Understanding trends (long-term movements in a data series) and seasonality (regular variations per time period) helps in strategic decision-making. Businesses can adjust their operations, inventory, and marketing campaigns based on these insights.
Monitoring and Anomaly Detection: Time series data are often used to monitor the performance of systems and detect anomalies. For example, monitoring server performance metrics over time can help identify when resources are under strain or behaving unexpectedly.
Causal Analysis: Time series can help establish relationships between variables. For example, when assessing economic indicators’ effects, time series analysis can help determine how changes in one area (e.g., interest rates) affect another (e.g., unemployment rates).
Event Analysis: Time series analysis allows data scientists to assess the impact of particular events (like marketing campaigns or policy changes) on variables of interest over time.
Financial Analysis: In finance, time series analysis is crucial for modeling and forecasting stock prices, interest rates, and other financial metrics. Advanced methods are used for modeling volatility and trends in financial time series.
Sports Analytics: Time series can be used to analyze the performance of players or teams over time, providing insights for training, strategies, and fan engagement.
Manufacturing and Operations: Time series data are used to analyze production trends, equipment performance, and supply chain dynamics which can lead to optimized operations and reduced costs.
16.1 Time processing using native Python libraries
The datetime module in Python is a library designed to handle date and time manipulation. It provides classes to manipulate time related data such as: date, time, datetime, timedelta, and timezone.
Date and datetime Objects
date and datetime types are the standard approach to handle time related information in Python. The following code creates a date object and access its elements.
from datetime import date# Create a date objectd = date(2024, 10, 9)print(f"I have created a new date: {d}") print(f"Its elements are year:{d.year}, month:{d.month} and day:{d.day}")
I have created a new date: 2024-10-09
Its elements are year:2024, month:10 and day:9
The following code creates a datetime object and access its elements.
from datetime import datetime# Create a datetime objectdt = datetime(2024, 10, 9, 14, 30, 45)print(f"I have created a new datetime: {dt}") print(f"Its elements are year:{dt.year}, month:{dt.month}, day:{dt.day}, hour:{dt.hour}, minute:{dt.minute} and second:{dt.second}")
I have created a new datetime: 2024-10-09 14:30:45
Its elements are year:2024, month:10, day:9, hour:14, minute:30 and second:45
Python allows multiple alternatives to represent time related information as shown in the following example:
The following table provides a description of some of the available format codes you can use to display datetime objects.
Format Code
Description
Example Output
%Y
Four-digit year
2024
%y
Two-digit year
24
%m
Month as a zero-padded decimal
01-12
%B
Full month name
October
%b
Abbreviated month name
Oct
%d
Day of the month as zero-padded
01-31
%A
Full weekday name
Wednesday
%a
Abbreviated weekday name
Wed
%H
Hour (24-hour clock)
00-23
%I
Hour (12-hour clock)
01-12
%M
Minute
00-59
%S
Second
00-59
%f
Microsecond
000000-999999
%p
AM/PM indicator
AM/PM
%z
UTC offset
+0100
%Z
Timezone name
UTC, EST, CST, etc.
To convert strings to datetime objects in Python, you can use the datetime.strptime() method from the datetime module. This method parses a string representing a date and/or time according to a specified format and returns a datetime object.
from datetime import datetime# Define the date string and the formatdate_string ="2024-10-09 14:30:45"date_format ="%Y-%m-%d %H:%M:%S"# Convert the string to a datetime objectdatetime_object = datetime.strptime(date_string, date_format)print(datetime_object) # Output: 2024-10-09 14:30:45print(type(datetime_object)) # Output: <class 'datetime.datetime'>
2024-10-09 14:30:45
<class 'datetime.datetime'>
Timedelta Objects
A timedelta object in Python represents a duration, i.e., the difference between two dates or times. It is part of the datetime module and is useful for performing date and time arithmetic.
The following code creates a timedelta object by specifying the duration in days, seconds, microseconds, milliseconds, minutes, hours, or weeks. All arguments are optional and default to 0.
from datetime import timedelta# Creating a timedelta objectdelta = timedelta(days=5, hours=3, minutes=30)print(delta) # Output: 5 days, 3:30:00
5 days, 3:30:00
timedelta objects are useful to perform time difference calculations, for instance to compute the time lapse between two events:
from datetime import datetime, timedelta# Event start and end timesstart_time = datetime(2024, 10, 9, 9, 0)end_time = datetime(2024, 10, 9, 17, 30)# Duration of the eventduration = end_time - start_timeprint(duration)print(type(duration))
8:30:00
<class 'datetime.timedelta'>
Timezone-Aware Objects
Thanks to the zoneinfo module it is possible to handle timezone related information directly with datetime. The zoneinfo module which provides support for the IANA time zone database, which is the industry standard for time zone information. This module allows you to use time zones (e.g. ‘America/New_York’, ‘Europe/London’). It also automatically handles daylight saving time transitions, adjusting times accordingly based on the specified time zone.
from datetime import datetimefrom zoneinfo import ZoneInfo# Create a datetime object with a specific time zonelondondatetime=datetime(2024, 10, 9, 14, 30, tzinfo=ZoneInfo('UTC'))newyorkdatetime = datetime(2024, 10, 9, 14, 30, tzinfo=ZoneInfo('America/New_York'))print(f"The time in London is: {londondatetime}")print(f"The time in New York is: {newyorkdatetime}")
The time in London is: 2024-10-09 14:30:00+00:00
The time in New York is: 2024-10-09 14:30:00-04:00
Indeed it is possible to operate with timezone-aware datetimes as usual:
print(f"the time difference between New York and London is {newyorkdatetime-londondatetime}")
the time difference between New York and London is 4:00:00
16.2 Time Series processing with Pandas
Time series manipulation in Pandas is a common task when working with timestamped data. Pandas has built-in support for datetime objects. You can easily convert columns to datetime format, set them as index, and perform various operations on them.
import pandas as pd# Create a sample dataframe with a datetime columndata = {'date': ['2024-09-01','2024-09-15','2024-10-01', '2024-10-15', '2024-12-01', '2024-12-15','2024-11-01', '2024-11-14'],'customers': [50,75,100, 200, 150, 250, 235, 134],'revenue':[120000,180000,230000,450000,200000, 500000,450000,250000]}df = pd.DataFrame(data)# Convert 'date' column to datetimedf['date'] = pd.to_datetime(df['date'])df
date
customers
revenue
0
2024-09-01
50
120000
1
2024-09-15
75
180000
2
2024-10-01
100
230000
3
2024-10-15
200
450000
4
2024-12-01
150
200000
5
2024-12-15
250
500000
6
2024-11-01
235
450000
7
2024-11-14
134
250000
16.2.1 Basic operations
When dealing with timestamped data it is advisable to set the index using the datetime objects as this facilitates time related operations. The following code illustrates how to use the method DataFrame.set_index().
# Set 'date' as indexdf.set_index('date', inplace=True)df
customers
revenue
date
2024-09-01
50
120000
2024-09-15
75
180000
2024-10-01
100
230000
2024-10-15
200
450000
2024-12-01
150
200000
2024-12-15
250
500000
2024-11-01
235
450000
2024-11-14
134
250000
Sorting
It is advisable to keep Pandas time series ordered specially before applying any transformation over them, for this we have the DataFrame.sort_index() method:
df.sort_index(inplace=True)df
customers
revenue
date
2024-09-01
50
120000
2024-09-15
75
180000
2024-10-01
100
230000
2024-10-15
200
450000
2024-11-01
235
450000
2024-11-14
134
250000
2024-12-01
150
200000
2024-12-15
250
500000
df.sort_index(ascending=False)
customers
revenue
date
2024-12-15
250
500000
2024-12-01
150
200000
2024-11-14
134
250000
2024-11-01
235
450000
2024-10-15
200
450000
2024-10-01
100
230000
2024-09-15
75
180000
2024-09-01
50
120000
Selection and Filtering
Once sorted and with the right index it is possible to perform usual operations such as selection by label name, filtering or slicing among many others.
df.loc['2024-10-01':'2024-11-03']
customers
revenue
date
2024-10-01
100
230000
2024-10-15
200
450000
2024-11-01
235
450000
df.loc['2024-10']
customers
revenue
date
2024-10-01
100
230000
2024-10-15
200
450000
df.loc[df.index >'2024-12-03']
customers
revenue
date
2024-12-15
250
500000
Timezone localization
Pandas also allows you to specify the timezone your data is expressed by way of the DataFrame.tz_localize() method, for instance:
# Localize the datetime index to a specific timezone (e.g., New York)df_tz_ny = df.tz_localize('America/New_York')print("DataFrame with New York-aware datetime index:")df_tz_ny
DataFrame with New York-aware datetime index:
customers
revenue
date
2024-09-01 00:00:00-04:00
50
120000
2024-09-15 00:00:00-04:00
75
180000
2024-10-01 00:00:00-04:00
100
230000
2024-10-15 00:00:00-04:00
200
450000
2024-11-01 00:00:00-04:00
235
450000
2024-11-14 00:00:00-05:00
134
250000
2024-12-01 00:00:00-05:00
150
200000
2024-12-15 00:00:00-05:00
250
500000
After making the index timezone-aware, you can convert it to a different timezone using the DataFrame.tz_convert() method, like this:
# Convert the timezone to 'London'df_tz_shangai = df_tz_ny.tz_convert('Asia/Shanghai')print("\nDataFrame with datetime index converted to Shangai timezone:")print(df_tz_shangai)
Shifting and lagging operations in pandas are useful for time series analysis, allowing you to move data points forward or backward in time. The method for this is DataFrame.shift(), which is available on all of the pandas objects. For instance:
df.shift(3,freq='D')
customers
revenue
date
2024-09-04
50
120000
2024-09-18
75
180000
2024-10-04
100
230000
2024-10-18
200
450000
2024-11-04
235
450000
2024-11-17
134
250000
2024-12-04
150
200000
2024-12-18
250
500000
df.shift(-3,freq='W')
customers
revenue
date
2024-08-11
50
120000
2024-08-25
75
180000
2024-09-15
100
230000
2024-09-29
200
450000
2024-10-13
235
450000
2024-10-27
134
250000
2024-11-10
150
200000
2024-11-24
250
500000
Following there is a list of commonly used Pandas offset aliases:
Alias
Description
B
Business day
C
Custom business day (requires calendar)
D
Calendar day
W
Weekly, defaults to Sunday
ME
Month end
SM
Semi-month end (15th and end of month)
BM
Business month end
CBM
Custom business month end
MS
Month start
BMS
Business month start
QE
Quarter end (default fiscal end)
QS
Quarter start
A
Year end (default to December)
AS
Year start
H
Hourly
T or min
Minutely
S
Secondly
L or ms
Milliseconds
U
Microseconds
N
Nanoseconds
Resampling
In Pandas, resampling is the process of changing the frequency of your time-series data by aggregating or interpolating it to a new frequency. Resampling is commonly used to move from:
High frequency to low frequency (e.g., daily data to monthly data). This operation is known as Downsampling
Low frequency to high frequency (e.g., yearly data to daily data). This operation is known as Upsampling
Resampling is used, for instance, to summarize data at different time periods (e.g., monthly sales from daily data), handle irregular time-series data by converting it to a regular frequency, detect trends or seasonality by looking at data at broader intervals.
Resampling operations are carried out in Pandas with the DataFrame.resample() method. This method is similar to DataFrame.groupby(), you call resample to group the data, then call an aggregation function:
Downsampling
Downsampling in Pandas refers to the process of reducing the frequency of a time-series data. This involves aggregating data points to a lower frequency (e.g., from daily data to monthly or yearly data). You typically need to specify an aggregation function (e.g., sum(), mean(), count()) to summarize the data within the new frequency interval.
The following example performs a downsampling operation on the dataframe df, first it changes the initial biweekly frequency into a monthly one, then it applies a sum function.
df.resample('ME').sum()
customers
revenue
date
2024-09-30
125
300000
2024-10-31
300
680000
2024-11-30
369
700000
2024-12-31
400
700000
The following example is similar to the previous one yet in this case we change the frequency to a quarterly one.
df.resample('QE').sum()
customers
revenue
date
2024-09-30
125
300000
2024-12-31
1069
2080000
Using Generative AI for coding purposes
Don’t forget that you can use your favorite GenAI-based assistant to list built-in aggregation functions for resampled data. You can submit the following prompt:
“Built in methods available with pandas resample”
Upsampling
Upsampling refers to increasing the frequency of time-series data. This means converting data from a lower frequency (e.g., monthly) to a higher frequency (e.g., daily).
The following example performs an upsampling operation on the dataframe df. We have increased the frequency by creating weekly entries containing NaN values.
df.resample('W').asfreq()
customers
revenue
date
2024-09-01
50.0
120000.0
2024-09-08
NaN
NaN
2024-09-15
75.0
180000.0
2024-09-22
NaN
NaN
2024-09-29
NaN
NaN
2024-10-06
NaN
NaN
2024-10-13
NaN
NaN
2024-10-20
NaN
NaN
2024-10-27
NaN
NaN
2024-11-03
NaN
NaN
2024-11-10
NaN
NaN
2024-11-17
NaN
NaN
2024-11-24
NaN
NaN
2024-12-01
150.0
200000.0
2024-12-08
NaN
NaN
2024-12-15
250.0
500000.0
Since higher frequency intervals require filling new time points, upsampling typically involves filling or interpolating missing values. You can fill the newly created time intervals using several methods, refer to the following table.
Method
Description
ffill()
Forward-fill: propagates the last valid value
bfill()
Backward-fill: fills with the next valid value
interpolate()
Interpolates values between known points
fillna()
Fill missing data with a constant or method
pad()
Alias for ffill()
nearest()
Fill with the nearest valid value
asfreq()
Set frequency without filling, leaving NaNs in gaps
For the previous upsampling example we could use interpolate() to estimate new values a follows:
df.resample('W').interpolate()
customers
revenue
date
2024-09-01
50.000000
120000.000000
2024-09-08
62.500000
150000.000000
2024-09-15
75.000000
180000.000000
2024-09-22
81.818182
181818.181818
2024-09-29
88.636364
183636.363636
2024-10-06
95.454545
185454.545455
2024-10-13
102.272727
187272.727273
2024-10-20
109.090909
189090.909091
2024-10-27
115.909091
190909.090909
2024-11-03
122.727273
192727.272727
2024-11-10
129.545455
194545.454545
2024-11-17
136.363636
196363.636364
2024-11-24
143.181818
198181.818182
2024-12-01
150.000000
200000.000000
2024-12-08
200.000000
350000.000000
2024-12-15
250.000000
500000.000000
Or, alternatively, ffill() to propagate the last known value
df.resample('W').ffill()
customers
revenue
date
2024-09-01
50
120000
2024-09-08
50
120000
2024-09-15
75
180000
2024-09-22
75
180000
2024-09-29
75
180000
2024-10-06
100
230000
2024-10-13
100
230000
2024-10-20
200
450000
2024-10-27
200
450000
2024-11-03
235
450000
2024-11-10
235
450000
2024-11-17
134
250000
2024-11-24
134
250000
2024-12-01
150
200000
2024-12-08
150
200000
2024-12-15
250
500000
The following code illustrates how to apply different methods to each column:
# Resample to daily frequencydf_resampled = df.resample('W').asfreq()# Interpolate 'revenue' column and forward-fill 'customers' columndf_resampled['revenue'] = df_resampled['revenue'].interpolate(method='linear')df_resampled['customers'] = df_resampled['customers'].ffill()print("\nResampled DataFrame with Different Fill Strategies:")print(df_resampled)
Windowing operations in Pandas allow you to perform calculations on rolling windows or expanding windows of a dataset. These operations are particularly useful in time-series analysis to smooth data, compute moving statistics (like moving average), or identify trends.
Rolling window operations, DataFrame.rolling() apply calculations over a fixed-size moving window, such as a 7-day moving average, and are useful for smoothing data or detecting short-term trends. Expanding windows,DataFrame.expanding(), compute cumulative statistics from the beginning of the data to the current point, which helps in tracking cumulative sums or averages over time. Exponentially weighted windows,DataFrame.ewm(), assign more weight to recent data points, giving more importance to recent observations while diminishing the influence of older ones. This is particularly helpful when analyzing time-series data with a recency bias, like exponential moving averages.
Rolling
The DataFrame.rolling() method creates a fixed-size moving window over the data. You might want to apply an operation function to each window afterwards
For instance the following code computes the mean of a two-sized window.
df.rolling(window=2).mean()
customers
revenue
date
2024-09-01
NaN
NaN
2024-09-15
62.5
150000.0
2024-10-01
87.5
205000.0
2024-10-15
150.0
340000.0
2024-11-01
217.5
450000.0
2024-11-14
184.5
350000.0
2024-12-01
142.0
225000.0
2024-12-15
200.0
350000.0
It is possible to define your own aggregation function, for instance to compute the harmonic mean of a four-sized window:
import numpy as np# Define a custom function for harmonic meandef harmonic_mean(series): n =len(series)return n / (1/ series).sum()df.rolling(window=4).apply(harmonic_mean, raw=True)
customers
revenue
date
2024-09-01
NaN
NaN
2024-09-15
NaN
NaN
2024-10-01
NaN
NaN
2024-10-15
82.758621
195513.577332
2024-11-01
122.742111
278787.878788
2024-11-14
149.711773
312688.821752
2024-12-01
171.052215
297520.661157
2024-12-15
178.693703
302521.008403
Expanding
The DataFrame.expanding() method calculates statistics cumulatively from the beginning of the dataset up to each point.
The following example tracks the maximum value from the beginning of the series to each point.
df.expanding().max()
customers
revenue
date
2024-09-01
50.0
120000.0
2024-09-15
75.0
180000.0
2024-10-01
100.0
230000.0
2024-10-15
200.0
450000.0
2024-11-01
235.0
450000.0
2024-11-14
235.0
450000.0
2024-12-01
235.0
450000.0
2024-12-15
250.0
500000.0
16.2.2.1 Exponentially Weighted
The DataaFrame.ewm() function in Pandas allows you to apply exponentially weighted moving statistics. This method is useful when you want to give more weight to recent observations while reducing the influence of older values. The following example calculates the exponentially weighted mean over the series.
df.ewm(span=3).mean()
customers
revenue
date
2024-09-01
50.000000
120000.000000
2024-09-15
66.666667
160000.000000
2024-10-01
85.714286
200000.000000
2024-10-15
146.666667
333333.333333
2024-11-01
192.258065
393548.387097
2024-11-14
162.666667
320634.920635
2024-12-01
156.283465
259842.519685
2024-12-15
203.325490
380392.156863
16.3 Time Series in Practice
First Example (Retail I)
Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real transactions from a retailer and compute some basic key performance indicators (KPIs).
Second Example (HealthCare)
Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process medical records and develop basic visualizations over time series data.
Third Example (Retail II)
Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real transactions from a retailer and perform advanced operations over time series data
Using Generative AI for coding purposes
this example performs an advanced times series operation: multi-column grouping and aggregation.
You can submit the following prompt to your favourite assistant for more information:
groupby a time series and a column in pandas
16.4 Conclusion
A time series represents data points ordered by time, often collected at consistent intervals like daily or monthly. Key characteristics include temporal order, trends, seasonality, and cyclical patterns. Time series data is widely used in various fields for forecasting, trend and seasonality analysis, anomaly detection, and event impact assessment. Applications range from finance, where it aids in stock and interest rate forecasting, to healthcare, sports, and manufacturing, where it informs decision-making by analyzing historical patterns and future predictions.
Python’s datetime module enables robust handling of time and date, including operations on datetime, date, timedelta, and timezone-aware objects. Pandas enhances time series processing with features like datetime conversion, indexing, sorting, filtering, and advanced operations like resampling (aggregating or interpolating data), shifting, and windowing (rolling, expanding, or exponentially weighted statistics). Techniques like downsampling summarize data at broader intervals, while upsampling fills gaps in higher-frequency data using methods like forward-fill or interpolation.
16.5 Reading
For those of you in need of additional, more advanced, topics please refer to the following references: