Chapter 1:

Vector Autoregression with Python

December 19, 2022
10 min to read

Time series forecasting entails predicting the future values of a time series based on historical values.  Time series can be of two types:

  • Univariate consists of a single dimension of data changing over time 
  • Multivariate data exists in multiple dimensions

Vector auto-regression (VAR) is used for modeling and forecasting multivariate time series data. The well-known auto-regression (AR) model is a special case of VAR for univariate data.  Each of the multivariate data dimensions is modeled as a linear combination of its past values and the historical values of other dimensions. This combination captures the interaction between the components of the series.

Consider the case of forecasting a large commercial building’s energy consumption with the goals of cost reduction and improving environmental sustainability. Forecasts can be made by considering just the historical energy consumption values, leading to a univariate model.  On the other hand, we could consider a multivariate time series consisting of values like energy consumption, number of occupants, weather, and workload. 

Intuitively, we know these variables interact with each other. For example, weather impacts the number of occupants on a given day, which then affects energy consumption.  The weather may also directly influence energy consumption.  A VAR model for forecasting energy consumption will account for all these interactions simply.

VAR should be used when the following conditions are satisfied.

  1. The target of the forecast depends on several other time-varying variables, leading to a multivariate time series.
  2. The interactions between the target and the other variables are simple and linear.
  3. The time series satisfies the stationarity requirement (which we’ll review later in this article).  

In this article, we will explore the topic of vector autoregression with Python, including modeling with VAR, model validation, and best practices to help improve your models.  

Python vector autoregression key concepts

Below is an overview of the key Python vector autoregression concepts we will review in this article.

Understanding vector autoregression

Before we dive into a more hands-on approach, let’s quickly cover some theory that will help you understand the background of VAR.

The univariate case

Auto-regression is a fundamental principle used in time series forecasting: it models the value of a time series at time t, denoted by x (t), as a linear combination of the past values.  In the univariate case, this can be expressed as follows:

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><mi>x</mi><mfenced><mi>t</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><msub><mi>α</mi><mn>0</mn></msub><mo> </mo><mo>+</mo><mo> </mo><msub><mi>α</mi><mn>1</mn></msub><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>1</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><msub><mi>α</mi><mn>2</mn></msub><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>2</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><msub><mi>α</mi><mi>p</mi></msub><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mi>p</mi></mrow></mfenced><mo> </mo></mstyle></math>","truncated":false}

The first term is a constant that is independent of time.  The remaining terms are coefficients, multiplied by a lagged value of the time series: αp is multiplied by the p-th lag x(t — p) .

Multivariate Vector Auto-Regression

In the multivariate case, we are concerned with multiple variables.  Let’s start with two variables x(t) and y(t). The first task is to identify the target variable.  Let’s say that we are interested in forecasting x(t) . The multivariate extension models the value of x(t) as a linear combination of past values of x(t) as well as  y(t) .

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><mi>x</mi><mfenced><mi>t</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><msub><mi>α</mi><mn>0</mn></msub><mo> </mo><mo>+</mo><mo> </mo><msubsup><mi>α</mi><mn>1</mn><mi>x</mi></msubsup><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>1</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><msubsup><mi>α</mi><mn>2</mn><mi>x</mi></msubsup><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>2</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><msubsup><mi>α</mi><msub><mi>p</mi><mrow><mi>x</mi><mo> </mo></mrow></msub><mi>x</mi></msubsup><mi>x</mi><mfenced><mrow><mi>t</mi><mo>-</mo><msub><mi>p</mi><mi>x</mi></msub></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><msubsup><mi>α</mi><mn>1</mn><mi>y</mi></msubsup><mi>y</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>1</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><msubsup><mi>α</mi><mn>2</mn><mi>y</mi></msubsup><mi>y</mi><mfenced><mrow><mi>t</mi><mo>-</mo><mn>2</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><msubsup><mi>α</mi><msub><mi>p</mi><mrow><mi>y</mi><mo> </mo></mrow></msub><mi>y</mi></msubsup><mi>y</mi><mfenced><mrow><mi>t</mi><mo>-</mo><msub><mi>p</mi><mi>y</mi></msub></mrow></mfenced><mo> </mo></mstyle></math>","truncated":false}

In the equation above, the first term is the scalar bias. The terms

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><msubsup><mi>α</mi><mn>1</mn><mi>x</mi></msubsup><mo>,</mo><msubsup><mi>α</mi><mn>2</mn><mi>x</mi></msubsup><mo> </mo><mo>,</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><msubsup><mi>α</mi><msub><mi>p</mi><mrow><mi>x</mi><mo> </mo></mrow></msub><mi>x</mi></msubsup></mstyle></math>","truncated":false}

are coefficients for the previous values of the target  x(t) . The terms

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><msubsup><mi>α</mi><mn>1</mn><mi>y</mi></msubsup><mo>,</mo><mo> </mo><msubsup><mi>α</mi><mn>2</mn><mi>y</mi></msubsup><mo>,</mo><mo> </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo> </mo><msubsup><mi>α</mi><msub><mi>p</mi><mrow><mi>y</mi><mo> </mo></mrow></msub><mi>y</mi></msubsup></mstyle></math>","truncated":false}

are coefficients for the previous values of the other variable y(t) .

The goal of fitting a VAR model is to optimize the values of these coefficients that maximize the model's performance.

Performance metrics

Forecasting models are typically evaluated using several metrics. The most important of these is the Mean Absolute Error (MAE).  The mean absolute error for forecasting n periods is given as follows.

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><mi>M</mi><mi>A</mi><mi>E</mi><mo> </mo><mo>=</mo><mo> </mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo>∑</mo><mrow><mi>t</mi><mo> </mo><mo>=</mo><mo> </mo><mn>1</mn></mrow><mi>n</mi></munderover><mo>|</mo><mi>x</mi><mfenced><mi>t</mi></mfenced><mo> </mo><mo>-</mo><mo> </mo><mover><mi>x</mi><mo>^</mo></mover><mfenced><mi>t</mi></mfenced><mo>|</mo></mstyle></math>","truncated":false}

Here,  x(t) is the actual value of the target, and x̂(t) is the forecast value. We will use this metric later in our example.

Training VAR models

In this section, we present an overview of a VAR model training process for a given time series dataset.  Each of the steps of the training process will be detailed in subsequent sections.

A VAR training model workflow. 
  1. Preprocessing, which is shown in the yellow boxes above, entails modifying the time series to make it stationary.  The Augmented Dickey-Fuller test is a hypothesis test used to test the stationarity of the time series.
  2. Train-test splitting entails creating a hold-out test dataset to validate the model.
  3. Modeling is the core step of estimating the number of auto and cross lags, the hyperparameters of the model, and then fitting the corresponding parameters of the regression model.
  4. Validation entails testing the learned model on the hold-out set.  This step may need the predictions made by the model to be converted back to the original data space by inverting the preprocessing steps.
  5. Microservice architectures are typically used to serve the learned forecasting models in real-world systems.

Next, we look at each of the above steps in detail.

For our discussion, we will use a toy dataset containing the monthly sales of ice cream and heaters between January 2004 and June 2020.  Our goal is to forecast future heater sales based on historical heater and ice cream sales.

Preprocessing and visualization 

The first step of building time series models is to visualize them and gain an understanding of the data.Fig 1 shows the two time series in the ice cream vs heater dataset plotted against each other.  The following things standout:

Fig 1: Data Visualization for the Ice Cream Vs Heater Sales dataset.
  • Observation 1: There are cyclical sales patterns.  Moreover, the peak sales periods for ice cream correspond to troughs in heater sales and vice versa.  Intuitively, this makes sense.  Ice cream sales tend to be high in summer months and the demand for heaters spikes in winter.
  • Observation 2: The data trends upwards.  In the more recent years, we see that the sales for both heater and ice cream have gone up.
  • Observation 3: The data's variability has also increased over time. The spread between the summer and winter months has become larger.

Stationarity analysis

The observations for the ice cream vs. heater dataset above imply that the properties of the signal change with time, resulting in the time series being non-stationary.  This can be confirmed using the Augmented Dickey Fuller (ADF) hypothesis test.  The ADF tests the null hypothesis that the data has a unit root, which implies non-stationarity.  If the p-value of the test is low, then the null hypothesis can be rejected, and we can consider that the data is stationary. 

The following code snippet shows the procedure.

It results in the following output.

Inducing stationarity if needed by normalizing, detrending, removing seasonality, etc

Stationarity is a prerequisite for VAR modeling to work. In this step, we proceed with preprocessing steps that induce stationarity, thereby making the time series eligible for VAR modeling to work accurately.


The next step is to normalize the data. This can be achieved by subtracting the mean and dividing the standard deviation for each column. Fig. 2 illustrates the effect of normalization.

Fig. 2: Normalized version of the ice cream vs heater dataset.

The next step entails removing the trend in the data by taking the difference and dropping the first result NAN value. This step may need to be repeated multiple times in general. The differenced version is illustrated in Fig. 3.

Fig. 3: Detrended version of Fig. 2.

For educational purposes, we difference the data and visualize it again in Fig. 4 to emphasize that differencing can be repeated multiple times to induce stationarity.

Fig. 4: Version of Fig. 3 that has had data differenced again.

After detrending, we take care of the time-dependent variance in the data.  We see that the variability in the data increases with time.  We can remove this by dividing each data point by the variance of the data for the corresponding year.  The following code snippet describes this process.

The first few rows of the resulting dataframe are shown in Fig. 5 and the data is visualized in Fig. 6.

Fig. 5: Dataframe after removing the standard deviation for the corresponding year.
Fig 6: Removing time-dependent variance from Fig. 3.

The next step is to remove the seasonality in the data.  We follow the same strategy as the previous step of removing time-dependent variance.  Here, the mean value changes as a function of time, specifically, the month of the year.  Thus, we will first accumulate the average values for each of the 12 months and then remove them from the data points.

The dataframe and the data after these steps are shown in Fig. 7 and Fig. 8, respectively.

Fig 7: Dataframe after removing the monthly averages.

Fig 8: Removing seasonality from Fig. 5.

Now that we have performed all the preprocessing steps, we can check if the data has become stationary. The code for checking stationarity remains the same as above, but the new results are shown below.

As we can see, the data has now become stationary and is ready to be modeled using VAR.

Modeling with VAR

After inducing stationarity in our data, we can proceed with modeling. The model will contain auto-regressive terms for both ice cream and heater. Given that we want to forecast the heater sales, we must first find the heater AR terms. We can use a Partial Auto-Correlation Function (PACF) plot to find the AR terms.

Fig 9: PACF plot for preprocessed heater sales.

Fig. 9 shows that the first three terms are significant and outside the confidence intervals denoted by the shaded blue region. Thus it is appropriate to take 3 heater AR terms for modeling/forecasting the heater sales. The first term is a constant and the next two terms correspond to lags 1 and 2.

Next, we find the appropriate number of AR terms for ice cream. We do this by measuring the p-value of the correlation between the heater sales and different lags of ice cream sales. We can achieve this with the following code snippet.

This results in the table of correlation values shown in Fig. 8. The Pearson's correlation value and its p-value are plotted for the various lags in the table. The correlations where the p-value is less than 0.05 are relevant for us. 

Note that this is only the case for lag = 13. Thus, we estimate that the lag of 13 for ice cream sales may be important for predicting the sales of heaters in the VAR model.

We will confirm this by fitting a VAR model to the data in the next step.

Fig 10: Pearson correlation and p-values for different lags of ice cream.

Next, we proceed with fitting the model with the data. We will hold out the last 12 rows as test data and fit a model to the remaining rows. We set the maxlags parameter to 13 based on our analysis above. 

The call to the VAR model will assume the columns of the dataframe to contain the time series. The summary of the model is shown in Fig. 11 and discussed next.

Fig 11: Summary of the model fit on the data.

The following points are important for interpreting the results and forming the final model.

  1. The VAR package will fit a model for heater, as well as ice cream, resulting in two sets of results: “Results for equation ice cream” and “Results for equation heater”. Given that we are interested in predicting the sales of heaters, we will focus on the second set.
  2. For regressing, the model used 13 lags of both the heater and ice cream sales. We will use only the lags with a low p-value, of less than 0.05, for our model.   
  3. For the heater, we see that lags 1 and 2 have a low p-value. This confirms our interpretation of the PACF plot above. Similarly, the lag of 13 for ice cream has a low p-value, which again confirms our interpretation of the correlation table in Fig. 8. Note that the constant value does not have a very low probability and will be left out of the model.
  4. Thus, we are left with 3 main terms: lags 1 and 2 for the heater and lag 13 for ice cream. We take the corresponding coefficients to arrive at the following VAR model, where Ĥ denotes the forecasted value and H and I

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"\"><mstyle mathsize=\"16px\"><mover><mi>H</mi><mo>^</mo></mover><mfenced><mi>T</mi></mfenced><mo> </mo><mo>=</mo><mo> </mo><mo>-</mo><mn>0</mn><mo>.</mo><mn>5341</mn><mo> </mo><mi>H</mi><mfenced><mrow><mi>T</mi><mo>-</mo><mn>1</mn></mrow></mfenced><mo> </mo><mo>-</mo><mo> </mo><mn>0</mn><mo>.</mo><mn>293</mn><mo> </mo><mi>H</mi><mfenced><mrow><mi>T</mi><mo>-</mo><mn>2</mn></mrow></mfenced><mo> </mo><mo>+</mo><mo> </mo><mn>0</mn><mo>.</mo><mn>196</mn><mo> </mo><mi>I</mi><mfenced><mrow><mi>T</mi><mo>-</mo><mn>13</mn></mrow></mfenced></mstyle></math>","truncated":false}

Validating our model

We apply the model learned above to the last 12 data points held out from training. The results are shown in Fig. 12 and the code to produce it is shown below.

Fig 12: Validation of the VAR forecasting model.

The Mean Absolute Error (MAE) is 0.61. Note that the predictions need to be inverted back to their original space by reversing the preprocessing steps.

Build complex forecasting models
in a fraction of the time
Learn More
Save time by leveraging a portfolio of pre-built connectors to third-party data sources 
Use aiMatch™ to stitch multiple datasets when there’s no common entities or uniform formatting
Built SaaS applications using an intuitive user interface and our library of advanced algorithms
Learn More

Best practices for vector autoregression with Python

The following best practices will help extend the concepts from the previous exercise to other real-world problems.

  • Spend time understanding the time series using visualization, decomposition, and ACF/PACF plots. Developing an intuition for the patterns in data will help in the modeling phase. For example, aspects like trends, seasonality and time-dependent variability are important to spot to make the data stationary.
  • Consider a data-driven approach. In addition to the statistical methods covered in this article for estimating the lag parameters, it is possible to take a more data-driven approach. This entails creating a grid of possible lag values and trying them out. The parameters resulting in the best performance are chosen for forecasting. This approach is better suited for more complex time series, where statistical methods do not scale.
  • Spend time interpreting the results of fitting a VAR model to the data. It is very easy for the model complexity, in terms of the number of lags, to explode with increasing dimensions. Here, it is important to study the summary of the model fitting process and choose the most relevant coefficients to control the complexity of the model. 


We created a VAR model for predicting the heater sales from its past values, as well as the past sales of ice cream. We proceeded by first visualizing the data and identifying the patterns that would contribute to the data not being stationary. We then preprocessed the data and induced stationarity, thereby making the data eligible for VAR modeling. 

The parameters for the model were estimated using statistical analysis, like PACF plots and the p-values of correlation. These estimates were confirmed after fitting the model and interpreting the model fitting summary. The resulting model was then validated on a holdout set. The process shown in this article can be applied to other datasets based on the best practices above.

Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now