Chapter 4:

Python residual sum of squares: Tutorial & Examples

December 23, 2022
12 min to read

Residual sum of squares (RSS) is a statistical method that calculates the variance between two variables that a regression model doesn’t explain. It measures the distance between a regression model’s predictions and ground truth variables. Therefore, the higher the RSS, the worse a model is performing..

Python residual sum of squares — which uses the Python programming language to calculate RSS — is useful for applications where validating a model’s predictive capabilities is essential. For example, financial analysis and financial modeling are typical applications for Python RSS. 

There are multiple ways to implement RSS using Python. This article will explain how to implement RSS in pure Python and use the statsmodels API on a real-world dataset. We will also summarize the different ways to implement RSS in Python, and the best practices for utilizing RSS in a real-world project.

Python residual sum of squares key concepts

Below is a summary of key RSS concepts that will help you better understand the following sections.

RSS in regression analysis and time series analysis

The residual sum of squares is a key metric that gives a numerical representation of how well your regression model fits the data. Specifically, RSS indicates how well an independent variable predicts a dependent variable. This technique works for long term as well as forecasting with limited data (aiCasting).

Residual sum of squares is used a lot in time series analysis, especially in finance and FinTech. RSS helps firms assess and iterate their financial models to gain a competitive advantage and improve the quality of their predictions. 

For example, RSS supports investors and financial analysts tracking an asset’s price over time and attempting to predict its future value by indicating how accurate a regression model’s predictions are.

RSS can also be used in other time-series analysis applications, like sales forecasting, predicting real estate prices over time, and modeling biological signals such as electrical activity in the brain.

How is RSS calculated?

The term “residual” is just another word for “error”. Error is the distance between two data points, usually the predicted and ground truth values. As the name suggests, RSS is calculated by summing up the squared residuals of the two variables, the formula is as follows:

Residual sum of squares formula

Where yi denotes the ith ground truth value, and ŷi denotes the ith predicted value.

The relationship between RSS, SSR, and SST

There is often some confusion between the sum of squares regression (SSR), the sum of squares total (SST), and the residual sum of squares. While the names are similar, they each play a different role and are commonly used in regression analysis.

SST is the squared error between the dependent variable (target variable) and its mean. SST provides insight into the overall variance of the target variable.

SSR is the squared error between the predictions and the mean of the dependent variable. It provides a measure of the total distance between the predictions and the center of the dependent variable.

The relationship between the three metrics is as follows:

SST = SSR + RSS

To put it in words, the total variance of the data is equal to the variability explained by the line in addition to the unexplained variability in the dataset (noise)

How to calculate RSS in Python

In the sections below, we’ll provide sample code to help you get started with Python residual sum of squares calculations. 

Calculating RSS in pure Python

Calculating RSS from a linear regression model

Calculating RSS using the statsmodels API

The output when using statsmodels will include a general summary of the OLS model, but what’s relevant to this article is the RSS score in the last line

Build complex forecasting models
in a fraction of the time
Learn More
Save time by leveraging a portfolio of pre-built connectors to third-party data sources 
Use aiMatch™ to stitch multiple datasets when there’s no common entities or uniform formatting
Built SaaS applications using an intuitive user interface and our library of advanced algorithms
Learn More

Python residual sum of squares in practice

For this example, we’ll apply a regression model using statsmodels on the Swedish auto insurance dataset. This small and simple dataset is found here.

The dataset has one dependent variable and one independent variable. The independent variable is the number of insurance claims by the person. The dependent variable is the total payment for all claims in thousand Swedish Krones.

First, import the dataset and clean it up.

Raw dataset

Dataset after cleaning

Next, plot the dataset with Plotly express.

A Scatterplot of the dataset shows individual datapoints. 

Next, the regression model and RSS calculation using statsmodels. We won’t be doing a train/test split right now just for simplicity's sake, but you should always test your model out with a test set and ideally a validation set as well.

Note that here the RSS is 78796.74, much larger than our previous example. Since we’re squaring up the residuals, RSS is heavily influenced by the variables’ magnitude.

If you test one model, the resulting RSS may not tell you directly how well a model performs in contrast to other metrics such as the Mean Absolute Error (MAE). Standard practice is to find the regression line with the lowest RSS score, as that is the best performing line.

Next, plot the regression line and data using matplotlib.

Plot of the dataset and resulting regression line

Tips for optimizing RSS calculations

Our basic code is a practical way to get started with Python RSS. However, you’ll need to do more data processing to achieve optimal performance. Required processing includes cleaning the dataset of any null values, duplicates, infinite values, and other unusable entries.

Finally, perform hyperparameter optimization using grid search and cross-validation methods that iterate over and test out a grid of hyperparameters and save the highest-performing set of parameters.

Additionally, it is best practice to evaluate your regression model with multiple metrics, such as MAE, Mean Squared Error (MSE), and R2 score. Each metric gives you an idea of the model’s performance from a different point of view.

Conclusion 

We went over what RSS is, why and where it is used, how to calculate RSS in 

Python, and the relationship between RSS and other similar metrics.

While the conceptual knowledge you build writing code from scratch is valuable, it is usually better to leverage existing frameworks. Popular data analysis frameworks are often optimized to run calculations faster, handle exceptions, and have well-written documentation that can help you hit the ground running in real-world projects. 

Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now