16  Time Series

A time series is a sequence of data points indexed in time order. It is typically a set of observations recorded sequentially over time at consistent time intervals. In its most common form, a time series consists of a series of time-stamped data, each reflecting the value of a variable at a specific time.

Characteristics of Time Series Data:

Time series data is characterized by several key features: (1) temporal order is crucial as the data points are ordered in time, meaning the sequence matters, (2) time series data is usually recorded at regular intervals, such as hourly, daily, or monthly, (3) over time, a time series may exhibit trends, which are long-term movements in the data, (4) some datasets may show seasonal patterns based on periodic fluctuations, like sales spikes during holidays and (5) time series data may also reflect cyclical patterns, which occur over longer, non-fixed periods and often tie into economic cycles.

Importance of Time Series in Data Science:

Forecasting: Time series analysis is crucial for predicting future values based on previously observed values. This is widely applied in economics, stock market analysis, weather forecasting, and resource consumption prediction.

Identifying Trends and Seasonality: Understanding trends (long-term movements in a data series) and seasonality (regular variations per time period) helps in strategic decision-making. Businesses can adjust their operations, inventory, and marketing campaigns based on these insights.

Monitoring and Anomaly Detection: Time series data are often used to monitor the performance of systems and detect anomalies. For example, monitoring server performance metrics over time can help identify when resources are under strain or behaving unexpectedly.

Causal Analysis: Time series can help establish relationships between variables. For example, when assessing economic indicators’ effects, time series analysis can help determine how changes in one area (e.g., interest rates) affect another (e.g., unemployment rates).

Event Analysis: Time series analysis allows data scientists to assess the impact of particular events (like marketing campaigns or policy changes) on variables of interest over time.

Financial Analysis: In finance, time series analysis is crucial for modeling and forecasting stock prices, interest rates, and other financial metrics. Advanced methods are used for modeling volatility and trends in financial time series.

Sports Analytics: Time series can be used to analyze the performance of players or teams over time, providing insights for training, strategies, and fan engagement.

Manufacturing and Operations: Time series data are used to analyze production trends, equipment performance, and supply chain dynamics which can lead to optimized operations and reduced costs.

16.1 Time processing using native Python libraries

The datetime module in Python is a library designed to handle date and time manipulation. It provides classes to manipulate time related data such as: date, time, datetime, timedelta, and timezone.

Date and datetime Objects

date and datetime types are the standard approach to handle time related information in Python. The following code creates a date object and access its elements.

from datetime import date

# Create a date object
d = date(2024, 10, 9)
print(f"I have created a new date: {d}")  
print(f"Its elements are year:{d.year}, month:{d.month} and day:{d.day}")
I have created a new date: 2024-10-09
Its elements are year:2024, month:10 and day:9

The following code creates a datetime object and access its elements.

from datetime import datetime

# Create a datetime object
dt = datetime(2024, 10, 9, 14, 30, 45)

print(f"I have created a new datetime: {dt}")  
print(f"Its elements are year:{dt.year}, month:{dt.month}, day:{dt.day}, hour:{dt.hour}, minute:{dt.minute} and second:{dt.second}")
I have created a new datetime: 2024-10-09 14:30:45
Its elements are year:2024, month:10, day:9, hour:14, minute:30 and second:45

Python allows multiple alternatives to represent time related information as shown in the following example:

format1 = dt.strftime("%Y-%m-%d %H:%M:%S")
print(format1)  

format2 = dt.strftime("%d-%m-%Y %H:%M:%S")
print(format2)  


format3 = dt.strftime("%Y-%m-%a %H:%M")
print(format3)  
2024-10-09 14:30:45
09-10-2024 14:30:45
2024-10-Wed 14:30

The following table provides a description of some of the available format codes you can use to display datetime objects.

Format Code Description Example Output
%Y Four-digit year 2024
%y Two-digit year 24
%m Month as a zero-padded decimal 01-12
%B Full month name October
%b Abbreviated month name Oct
%d Day of the month as zero-padded 01-31
%A Full weekday name Wednesday
%a Abbreviated weekday name Wed
%H Hour (24-hour clock) 00-23
%I Hour (12-hour clock) 01-12
%M Minute 00-59
%S Second 00-59
%f Microsecond 000000-999999
%p AM/PM indicator AM/PM
%z UTC offset +0100
%Z Timezone name UTC, EST, CST, etc.

To convert strings to datetime objects in Python, you can use the datetime.strptime() method from the datetime module. This method parses a string representing a date and/or time according to a specified format and returns a datetime object.

from datetime import datetime

# Define the date string and the format
date_string = "2024-10-09 14:30:45"
date_format = "%Y-%m-%d %H:%M:%S"

# Convert the string to a datetime object
datetime_object = datetime.strptime(date_string, date_format)

print(datetime_object)  # Output: 2024-10-09 14:30:45
print(type(datetime_object))  # Output: <class 'datetime.datetime'>
2024-10-09 14:30:45
<class 'datetime.datetime'>

Timedelta Objects

A timedelta object in Python represents a duration, i.e., the difference between two dates or times. It is part of the datetime module and is useful for performing date and time arithmetic.

The following code creates a timedelta object by specifying the duration in days, seconds, microseconds, milliseconds, minutes, hours, or weeks. All arguments are optional and default to 0.

from datetime import timedelta

# Creating a timedelta object
delta = timedelta(days=5, hours=3, minutes=30)
print(delta)  # Output: 5 days, 3:30:00
5 days, 3:30:00

timedelta objects are useful to perform time difference calculations, for instance to compute the time lapse between two events:

from datetime import datetime, timedelta

# Event start and end times
start_time = datetime(2024, 10, 9, 9, 0)
end_time = datetime(2024, 10, 9, 17, 30)

# Duration of the event
duration = end_time - start_time
print(duration)
print(type(duration))
8:30:00
<class 'datetime.timedelta'>

Timezone-Aware Objects

Thanks to the zoneinfo module it is possible to handle timezone related information directly with datetime. The zoneinfo module which provides support for the IANA time zone database, which is the industry standard for time zone information. This module allows you to use time zones (e.g. ‘America/New_York’, ‘Europe/London’). It also automatically handles daylight saving time transitions, adjusting times accordingly based on the specified time zone.

from datetime import datetime
from zoneinfo import ZoneInfo

# Create a datetime object with a specific time zone

londondatetime=datetime(2024, 10, 9, 14, 30, tzinfo=ZoneInfo('UTC'))
newyorkdatetime = datetime(2024, 10, 9, 14, 30, tzinfo=ZoneInfo('America/New_York'))
print(f"The time in London is: {londondatetime}")
print(f"The time in New York is: {newyorkdatetime}")
The time in London is: 2024-10-09 14:30:00+00:00
The time in New York is: 2024-10-09 14:30:00-04:00

Indeed it is possible to operate with timezone-aware datetimes as usual:

print(f"the time difference between New York and London is {newyorkdatetime-londondatetime}")
the time difference between New York and London is 4:00:00

16.2 Time Series processing with Pandas

Time series manipulation in Pandas is a common task when working with timestamped data. Pandas has built-in support for datetime objects. You can easily convert columns to datetime format, set them as index, and perform various operations on them.

import pandas as pd

# Create a sample dataframe with a datetime column
data = {
    'date': ['2024-09-01','2024-09-15','2024-10-01', '2024-10-15', '2024-12-01', '2024-12-15','2024-11-01', '2024-11-14'],
    'customers': [50,75,100, 200, 150, 250, 235, 134],
    'revenue':[120000,180000,230000,450000,200000, 500000,450000,250000]
}

df = pd.DataFrame(data)

# Convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'])


df
date customers revenue
0 2024-09-01 50 120000
1 2024-09-15 75 180000
2 2024-10-01 100 230000
3 2024-10-15 200 450000
4 2024-12-01 150 200000
5 2024-12-15 250 500000
6 2024-11-01 235 450000
7 2024-11-14 134 250000

16.2.1 Basic operations

When dealing with timestamped data it is advisable to set the index using the datetime objects as this facilitates time related operations. The following code illustrates how to use the method DataFrame.set_index().

# Set 'date' as index
df.set_index('date', inplace=True)
df
customers revenue
date
2024-09-01 50 120000
2024-09-15 75 180000
2024-10-01 100 230000
2024-10-15 200 450000
2024-12-01 150 200000
2024-12-15 250 500000
2024-11-01 235 450000
2024-11-14 134 250000

Sorting

It is advisable to keep Pandas time series ordered specially before applying any transformation over them, for this we have the DataFrame.sort_index() method:

df.sort_index(inplace=True)
df
customers revenue
date
2024-09-01 50 120000
2024-09-15 75 180000
2024-10-01 100 230000
2024-10-15 200 450000
2024-11-01 235 450000
2024-11-14 134 250000
2024-12-01 150 200000
2024-12-15 250 500000
df.sort_index(ascending=False)
customers revenue
date
2024-12-15 250 500000
2024-12-01 150 200000
2024-11-14 134 250000
2024-11-01 235 450000
2024-10-15 200 450000
2024-10-01 100 230000
2024-09-15 75 180000
2024-09-01 50 120000

Selection and Filtering

Once sorted and with the right index it is possible to perform usual operations such as selection by label name, filtering or slicing among many others.

df.loc['2024-10-01':'2024-11-03']
customers revenue
date
2024-10-01 100 230000
2024-10-15 200 450000
2024-11-01 235 450000
df.loc['2024-10']
customers revenue
date
2024-10-01 100 230000
2024-10-15 200 450000
df.loc[df.index > '2024-12-03']
customers revenue
date
2024-12-15 250 500000

Timezone localization

Pandas also allows you to specify the timezone your data is expressed by way of the DataFrame.tz_localize() method, for instance:

# Localize the datetime index to a specific timezone (e.g., New York)
df_tz_ny = df.tz_localize('America/New_York')

print("DataFrame with New York-aware datetime index:")
df_tz_ny
DataFrame with New York-aware datetime index:
customers revenue
date
2024-09-01 00:00:00-04:00 50 120000
2024-09-15 00:00:00-04:00 75 180000
2024-10-01 00:00:00-04:00 100 230000
2024-10-15 00:00:00-04:00 200 450000
2024-11-01 00:00:00-04:00 235 450000
2024-11-14 00:00:00-05:00 134 250000
2024-12-01 00:00:00-05:00 150 200000
2024-12-15 00:00:00-05:00 250 500000

After making the index timezone-aware, you can convert it to a different timezone using the DataFrame.tz_convert() method, like this:

# Convert the timezone to 'London'
df_tz_shangai = df_tz_ny.tz_convert('Asia/Shanghai')

print("\nDataFrame with datetime index converted to Shangai timezone:")
print(df_tz_shangai)

DataFrame with datetime index converted to Shangai timezone:
                           customers  revenue
date                                         
2024-09-01 12:00:00+08:00         50   120000
2024-09-15 12:00:00+08:00         75   180000
2024-10-01 12:00:00+08:00        100   230000
2024-10-15 12:00:00+08:00        200   450000
2024-11-01 12:00:00+08:00        235   450000
2024-11-14 13:00:00+08:00        134   250000
2024-12-01 13:00:00+08:00        150   200000
2024-12-15 13:00:00+08:00        250   500000

16.2.2 Advanced operations

Shifting/Lagging

Shifting and lagging operations in pandas are useful for time series analysis, allowing you to move data points forward or backward in time. The method for this is DataFrame.shift(), which is available on all of the pandas objects. For instance:

df.shift(3,freq='D')
customers revenue
date
2024-09-04 50 120000
2024-09-18 75 180000
2024-10-04 100 230000
2024-10-18 200 450000
2024-11-04 235 450000
2024-11-17 134 250000
2024-12-04 150 200000
2024-12-18 250 500000
df.shift(-3,freq='W')
customers revenue
date
2024-08-11 50 120000
2024-08-25 75 180000
2024-09-15 100 230000
2024-09-29 200 450000
2024-10-13 235 450000
2024-10-27 134 250000
2024-11-10 150 200000
2024-11-24 250 500000

Following there is a list of commonly used Pandas offset aliases:

Alias Description
B Business day
C Custom business day (requires calendar)
D Calendar day
W Weekly, defaults to Sunday
ME Month end
SM Semi-month end (15th and end of month)
BM Business month end
CBM Custom business month end
MS Month start
BMS Business month start
QE Quarter end (default fiscal end)
QS Quarter start
A Year end (default to December)
AS Year start
H Hourly
T or min Minutely
S Secondly
L or ms Milliseconds
U Microseconds
N Nanoseconds

Resampling

In Pandas, resampling is the process of changing the frequency of your time-series data by aggregating or interpolating it to a new frequency. Resampling is commonly used to move from:

  1. High frequency to low frequency (e.g., daily data to monthly data). This operation is known as Downsampling
  2. Low frequency to high frequency (e.g., yearly data to daily data). This operation is known as Upsampling

Resampling is used, for instance, to summarize data at different time periods (e.g., monthly sales from daily data), handle irregular time-series data by converting it to a regular frequency, detect trends or seasonality by looking at data at broader intervals.

Resampling operations are carried out in Pandas with the DataFrame.resample() method. This method is similar to DataFrame.groupby(), you call resample to group the data, then call an aggregation function:

Downsampling

Downsampling in Pandas refers to the process of reducing the frequency of a time-series data. This involves aggregating data points to a lower frequency (e.g., from daily data to monthly or yearly data). You typically need to specify an aggregation function (e.g., sum(), mean(), count()) to summarize the data within the new frequency interval.

The following example performs a downsampling operation on the dataframe df, first it changes the initial biweekly frequency into a monthly one, then it applies a sum function.

df.resample('ME').sum()
customers revenue
date
2024-09-30 125 300000
2024-10-31 300 680000
2024-11-30 369 700000
2024-12-31 400 700000

The following example is similar to the previous one yet in this case we change the frequency to a quarterly one.

df.resample('QE').sum()
customers revenue
date
2024-09-30 125 300000
2024-12-31 1069 2080000
Using Generative AI for coding purposes

Don’t forget that you can use your favorite GenAI-based assistant to list built-in aggregation functions for resampled data. You can submit the following prompt:

“Built in methods available with pandas resample”

Upsampling

Upsampling refers to increasing the frequency of time-series data. This means converting data from a lower frequency (e.g., monthly) to a higher frequency (e.g., daily).

The following example performs an upsampling operation on the dataframe df. We have increased the frequency by creating weekly entries containing NaN values.

df.resample('W').asfreq()
customers revenue
date
2024-09-01 50.0 120000.0
2024-09-08 NaN NaN
2024-09-15 75.0 180000.0
2024-09-22 NaN NaN
2024-09-29 NaN NaN
2024-10-06 NaN NaN
2024-10-13 NaN NaN
2024-10-20 NaN NaN
2024-10-27 NaN NaN
2024-11-03 NaN NaN
2024-11-10 NaN NaN
2024-11-17 NaN NaN
2024-11-24 NaN NaN
2024-12-01 150.0 200000.0
2024-12-08 NaN NaN
2024-12-15 250.0 500000.0

Since higher frequency intervals require filling new time points, upsampling typically involves filling or interpolating missing values. You can fill the newly created time intervals using several methods, refer to the following table.

Method Description
ffill() Forward-fill: propagates the last valid value
bfill() Backward-fill: fills with the next valid value
interpolate() Interpolates values between known points
fillna() Fill missing data with a constant or method
pad() Alias for ffill()
nearest() Fill with the nearest valid value
asfreq() Set frequency without filling, leaving NaNs in gaps

For the previous upsampling example we could use interpolate() to estimate new values a follows:

df.resample('W').interpolate()
customers revenue
date
2024-09-01 50.000000 120000.000000
2024-09-08 62.500000 150000.000000
2024-09-15 75.000000 180000.000000
2024-09-22 81.818182 181818.181818
2024-09-29 88.636364 183636.363636
2024-10-06 95.454545 185454.545455
2024-10-13 102.272727 187272.727273
2024-10-20 109.090909 189090.909091
2024-10-27 115.909091 190909.090909
2024-11-03 122.727273 192727.272727
2024-11-10 129.545455 194545.454545
2024-11-17 136.363636 196363.636364
2024-11-24 143.181818 198181.818182
2024-12-01 150.000000 200000.000000
2024-12-08 200.000000 350000.000000
2024-12-15 250.000000 500000.000000

Or, alternatively, ffill() to propagate the last known value

df.resample('W').ffill()
customers revenue
date
2024-09-01 50 120000
2024-09-08 50 120000
2024-09-15 75 180000
2024-09-22 75 180000
2024-09-29 75 180000
2024-10-06 100 230000
2024-10-13 100 230000
2024-10-20 200 450000
2024-10-27 200 450000
2024-11-03 235 450000
2024-11-10 235 450000
2024-11-17 134 250000
2024-11-24 134 250000
2024-12-01 150 200000
2024-12-08 150 200000
2024-12-15 250 500000

The following code illustrates how to apply different methods to each column:

# Resample to daily frequency
df_resampled = df.resample('W').asfreq()

# Interpolate 'revenue' column and forward-fill 'customers' column
df_resampled['revenue'] = df_resampled['revenue'].interpolate(method='linear')
df_resampled['customers'] = df_resampled['customers'].ffill()

print("\nResampled DataFrame with Different Fill Strategies:")
print(df_resampled)

Resampled DataFrame with Different Fill Strategies:
            customers        revenue
date                                
2024-09-01       50.0  120000.000000
2024-09-08       50.0  150000.000000
2024-09-15       75.0  180000.000000
2024-09-22       75.0  181818.181818
2024-09-29       75.0  183636.363636
2024-10-06       75.0  185454.545455
2024-10-13       75.0  187272.727273
2024-10-20       75.0  189090.909091
2024-10-27       75.0  190909.090909
2024-11-03       75.0  192727.272727
2024-11-10       75.0  194545.454545
2024-11-17       75.0  196363.636364
2024-11-24       75.0  198181.818182
2024-12-01      150.0  200000.000000
2024-12-08      150.0  350000.000000
2024-12-15      250.0  500000.000000

Windowing

Windowing operations in Pandas allow you to perform calculations on rolling windows or expanding windows of a dataset. These operations are particularly useful in time-series analysis to smooth data, compute moving statistics (like moving average), or identify trends.

Rolling window operations, DataFrame.rolling() apply calculations over a fixed-size moving window, such as a 7-day moving average, and are useful for smoothing data or detecting short-term trends. Expanding windows,DataFrame.expanding(), compute cumulative statistics from the beginning of the data to the current point, which helps in tracking cumulative sums or averages over time. Exponentially weighted windows,DataFrame.ewm(), assign more weight to recent data points, giving more importance to recent observations while diminishing the influence of older ones. This is particularly helpful when analyzing time-series data with a recency bias, like exponential moving averages.

Rolling

The DataFrame.rolling() method creates a fixed-size moving window over the data. You might want to apply an operation function to each window afterwards

For instance the following code computes the mean of a two-sized window.

df.rolling(window=2).mean()
customers revenue
date
2024-09-01 NaN NaN
2024-09-15 62.5 150000.0
2024-10-01 87.5 205000.0
2024-10-15 150.0 340000.0
2024-11-01 217.5 450000.0
2024-11-14 184.5 350000.0
2024-12-01 142.0 225000.0
2024-12-15 200.0 350000.0

It is possible to define your own aggregation function, for instance to compute the harmonic mean of a four-sized window:

import numpy as np


# Define a custom function for harmonic mean
def harmonic_mean(series):
    n = len(series)
    return n / (1 / series).sum()





df.rolling(window=4).apply(harmonic_mean, raw=True)
customers revenue
date
2024-09-01 NaN NaN
2024-09-15 NaN NaN
2024-10-01 NaN NaN
2024-10-15 82.758621 195513.577332
2024-11-01 122.742111 278787.878788
2024-11-14 149.711773 312688.821752
2024-12-01 171.052215 297520.661157
2024-12-15 178.693703 302521.008403

Expanding

The DataFrame.expanding() method calculates statistics cumulatively from the beginning of the dataset up to each point.

The following example tracks the maximum value from the beginning of the series to each point.

df.expanding().max()
customers revenue
date
2024-09-01 50.0 120000.0
2024-09-15 75.0 180000.0
2024-10-01 100.0 230000.0
2024-10-15 200.0 450000.0
2024-11-01 235.0 450000.0
2024-11-14 235.0 450000.0
2024-12-01 235.0 450000.0
2024-12-15 250.0 500000.0

16.2.2.1 Exponentially Weighted

The DataaFrame.ewm() function in Pandas allows you to apply exponentially weighted moving statistics. This method is useful when you want to give more weight to recent observations while reducing the influence of older values. The following example calculates the exponentially weighted mean over the series.

df.ewm(span=3).mean()
customers revenue
date
2024-09-01 50.000000 120000.000000
2024-09-15 66.666667 160000.000000
2024-10-01 85.714286 200000.000000
2024-10-15 146.666667 333333.333333
2024-11-01 192.258065 393548.387097
2024-11-14 162.666667 320634.920635
2024-12-01 156.283465 259842.519685
2024-12-15 203.325490 380392.156863

16.3 Time Series in Practice

First Example (Retail I)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real transactions from a retailer and compute some basic key performance indicators (KPIs).

Second Example (HealthCare)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process medical records and develop basic visualizations over time series data.

Third Example (Retail II)

Let’s have a look at the following Jupyter Notebook. This notebook illustrates how to process real transactions from a retailer and perform advanced operations over time series data

Using Generative AI for coding purposes

this example performs an advanced times series operation: multi-column grouping and aggregation.

You can submit the following prompt to your favourite assistant for more information:

groupby a time series and a column in pandas

16.4 Conclusion

A time series represents data points ordered by time, often collected at consistent intervals like daily or monthly. Key characteristics include temporal order, trends, seasonality, and cyclical patterns. Time series data is widely used in various fields for forecasting, trend and seasonality analysis, anomaly detection, and event impact assessment. Applications range from finance, where it aids in stock and interest rate forecasting, to healthcare, sports, and manufacturing, where it informs decision-making by analyzing historical patterns and future predictions.

Python’s datetime module enables robust handling of time and date, including operations on datetime, date, timedelta, and timezone-aware objects. Pandas enhances time series processing with features like datetime conversion, indexing, sorting, filtering, and advanced operations like resampling (aggregating or interpolating data), shifting, and windowing (rolling, expanding, or exponentially weighted statistics). Techniques like downsampling summarize data at broader intervals, while upsampling fills gaps in higher-frequency data using methods like forward-fill or interpolation.

16.5 Reading

For those of you in need of additional, more advanced, topics please refer to the following references: