Time series forecasting is a fundamental task in data science, applied statistics, and econometrics. With time series forecasting we aim to predict the future values of time series datasets.
A time series dataset consists of variables recorded at regular time intervals and has two main attributes:
- Trend- The long-term average direction of a variable (upwards or downwards).
- Seasonality- Fluctuation that occurs at specific time periods (e.g., US retail sales spiking during Christmas season).
Sometimes, time series have seasonal patterns with multiple periods, a property known as complex seasonality. This can be a significant problem for many forecasting models.
Additionally, some time series datasets lack a discernible trend and/or seasonality, making it impossible to forecast them with models that assume their presence. The TBATS forecasting model helps solve this problem. TBATS was created to deal with complex seasonality and offers state-of-the-art accuracy for these datasets.
TBATS stands for
- Trigonometric seasonality
- Box-Cox transformation
- ARMA errors
- Seasonal components
Rob J Hyndman introduced the TBATS model. A statistician and leading authority in time series forecasting, Hyndman has published multiple papers and books about the topic.
TBATS applies the Box-Cox transformation and then models time series data as a combination of the exponentially smoothed trend, the seasonal component, and the autoregressive moving average (ARMA) component. TBATS also applies hyperparameter tuning based on the AIC metric, discarding any suboptimal components from the final model.
The rest of this article will delve deeper into time series forecasting with TBATS in the Python scientific ecosystem. Specifically, we will present a practical example of complex seasonality forecasting using the sktime library and the PJM electricity load dataset.
Python TBATS key concepts
Before we begin, here’s a summary of TBATS in Python terms and concepts.
The sktime Python Library
In recent years, time series forecasting in the Python scientific ecosystem has matured significantly. Numerous libraries have emerged, offering forecasting models that range from statistical approaches to machine learning and deep learning.
One of those libraries is sktime, an open-source framework that supports various time series tasks, including forecasting, classification, regression and clustering. Sktime developers mostly focus on machine learning but also support statistical models, including autoregressive integrated moving average (ARIMA), exponential smoothing, and Theta.
This diverse approach has established sktime as one of the best libraries for time series tasks, and it can be extremely useful to data scientists and Python developers!
The PJM Electricity Load Dataset
PJM is a regional transmission organization (RTO) in the United States that belongs to the eastern interconnection grid, serving numerous states, including Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio and many others.
In this article, we are going to use the PJM electricity load dataset, which Kaggle provides for free. This dataset has complex seasonality that is manifested in multiple time periods, making it ideal for a case study of TBATS in Python. Electricity load time series typically have complex seasonality, because energy demand fluctuates depending on the day, weather, and other seasonal factors.
Creating a TBATS Model with Python
In this section of the article, we will examine a complete case study of time series forecasting with the TBATS model. This will help us better understand TBATS workflows and practical applications.
You can execute the sample code in a Jupyter notebook or a Python IDE like PyCharm, Spyder, or Visual Studio Code. We also recommend creating an environment for the project using Anaconda or another similar tool. For our examples, we use the following library versions:
- pandas 1.4.4
- matplotlib 3.5.3
- statsmodels 0.14.0 (installed from the main GitHub repository)
- sktime 0.13.2
- tbats 1.1.0
Importing the Necessary Python Libraries
We begin by importing some standard libraries of the Python scientific ecosystem, i.e. pandas and Matplotlib. Afterwards, we import various classes and functions from the statsmodels and sktime libraries, that will be useful for the analysis and forecasting sections.
Statsmodels is an established statistical library, including numerous functions that are helpful for time series analysis. We will utilize this library to apply time series decomposition and plot the autocorrelation function of the dataset.
Loading the PJM Dataset
After importing the Python libraries, we load the PJM dataset to a pandas dataframe and apply some basic preprocessing. More specifically, we sort the index and interpolate any missing values, by using the sort_index() and interpolate() pandas functions.
Next, we slice the pandas dataframe to keep the values of October 2001 and discard the rest. Finally, we use the info() function to display some basic information about the resulting dataframe. As we can see, the time series has 744 values and covers the month of October 2001 with an hourly frequency.
Applying Multiple Seasonal Trend Decomposition
Decomposition is a standard technique in time series analysis. It extracts the trend and seasonal components of a dataset. There are multiple decomposition techniques, including moving averages and the STL method.
Here we use the multiple seasonal trend decomposition with LOESS (MSTL) method. This technique is suitable for time series with complex seasonality. Furthermore, the MSTL() statsmodels function lets us specify seasonal periods by using the periods parameter. We can also refine the seasonal components by changing the iterate parameter. As seen in the decomposition plot, the series has seasonality on multiple periods, including daily and weekly.
Plotting the Autocorrelation Function
The autocorrelation function (ACF) is used in time series analysis to identify patterns by calculating the correlation between lagged values, i.e. those being one or more time steps apart. In this case, we created an ACF plot for 170 lags to cover an entire week.
As we can see, the autocorrelation values peak at 24 hours, indicating daily seasonality. After that, the pattern repeats daily, with autocorrelation values steadily decreasing. At the 168th lag, the autocorrelation peaks again, indicating the time series has weekly seasonality.
Creating a Utility Metrics Function
The sktime library provides numerous kinds of metrics that help us evaluate the accuracy of forecasting models, including mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE).
Here, we create the print_metrics() function that calculates various metrics, and subsequently returns a pandas dataframe with the results. This will help us easily evaluate the forecasting models, by printing the metrics on a Jupyter notebook cell. Here’s a brief description for each included metric:
Creating a Baseline Forecasting Model
Setting a baseline accuracy is standard practice in time series forecasting, so we will accomplish that with a naive model. First of all, we use the temporal_train_test_split() function to split the series into a train and test set. Afterwards, we use the NaiveForecaster() class to create a naive model that always predicts the mean value of the train set. Finally, we plot the forecast and display the associated metrics with the print_metrics() function.
Creating the TBATS Forecasting Model
After establishing the baseline accuracy, we move on to create the TBATS forecasting model with the TBATS() sktime class. Furthermore, we specify the daily and weekly seasonal periods by setting the sp parameter. After fitting the model on the train set, we plot the forecast and display the results with the print_metrics() function.
As we can see, the TBATS model identified the seasonal patterns and generated a forecast very similar to the test set. Apart from the plot, the model accuracy is also evident in the metrics table, as they are all significantly lower than the naive model. This verifies that TBATS modeled our dataset accurately!
Time series forecasting with Python has advanced considerably in recent years, making the language an excellent choice for data scientists and other professionals. This article explored various concepts and tools related to time series and the Python scientific ecosystem in general.
We studied a practical example of time series forecasting with the TBATS model. Standard forecasting models like ARIMA might fail to deliver a good result on time series with complex seasonality, so it’s helpful to acquaint ourselves with this specialized tool.
TBATS is ideal for time series datasets with complex seasonality but isn’t a general-purpose forecasting model. You should always conduct a time series analysis to understand the data and then choose the optimal forecasting model. Hopefully, this article provided you with valuable knowledge to help you tackle challenging business problems!