Linear Regression With Bootstrapping (2024)

For data scientists andmachine learning engineers, this bootstrapping context is an important tool for sampling data. For this reason, it is one of the most important to consider what underlies the variation of numbers, the variation of distributions. We use resampling when we have a small amount of data as it allows us to see how much variation there would have been.

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.

Article Overview:

Mean, trimmed mean ,outlaiers.
Bootstrap Method.
Bootstrapping Regression with worked example

Trimmed mean

A variation of the mean is a trimmed mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by x1 , x2 , ..., xn where x1 is the smallest value and x n the largest, the formula to compute the trimmed mean with p smallest and largest values omitted is:

The trimmed mean is Robust to outliers.

Bootstrap Method

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. This approach to sampling is called sampling with replacement.

The process for building one sample can be summarized as follows:

Bootstrapping Regression with worked example

In this article I looked at applying bootstrapping techniques to linear regression in one ways with my experiance :

Parametric bootstrapping (this article )
Non-parametric boostrapping(next article)

Data

I treat the data sample we have as the only representation of the population that we have. Then to get more datasets from it, we resample the datawith replacement.

From the community of drivers who crashed in one month, I have considered a community of 30 drivers with the age of each driver.Their 33 ages are simulated below:

import numpy as npfrom scipy import statsimport matplotlib.pyplot as pltfrom sklearn.utils import resample%matplotlib inline# 30 of the driver’s agedriver_age = [24, 40, 27, 33, 31, 33, 35, 32, 29, 34, 39, 40, 41, 36, 34, 35, 29, 30, 35, 98,24, 40, 27, 65, 71, 33,54, 32, 29, 87]# expected age could be the averagedriver_avg = np.mean(driver_age)print('Average of driver age guesses: {} years old'.format(driver_avg))driver_std = np.std(driver_age)print('Std_Dev of driver age guesses: {0:.2f} years'.format(driver_std))

print(len(driver_age))

Average of driver age guesses: 36.06 years oldStd_Dev of driver age guesses: 17.50 years

Considering the guesses, only 6 of them are above the average. Visually inspecting the data we see that this is because of one outlier of 80 years. While we may consider dropping this outlier to get a better guess, we can also use bootstrap resampling to get more data that approaches the parent distribution. Here we will repeatedly sample with replacement to get a set of subsamples. Each subsample will also have 33 data points, the same as the original.

sampling with replacement

To run Bootstrap sampling, I used the the (replace=True)flag in(np .random .choice) .

notes :

- The size of each sample should be equal to the original sample(33).

- 100 samples have been taken from the original sample.

n_sets = 100n_samples = len(driver_age)def generate_samples(dataset, n): return list(np.random.choice(dataset, size=n, replace=True))boot_samples = [generate_samples(driver_age, n_samples) for _ in range(n_sets)]print('Here are the top 2 samples generated:')print('{}, …'.format(boot_samples[0:2]))Here are the top 2 samples generated:[[32, 33, 40, 36, 33, 46, 29, 29, 36, 40, 36, 39, 49, 33, 31, 35, 36, 27, 24, 39, 31, 24, 24, 34, 37, 34, 39, 40, 29, 39, 34, 46, 37], [30, 27, 39, 46, 34, 34, 35, 41, 24, 34, 46, 41, 34, 27, 35, 35, 35, 24, 47, 35, 40, 35, 37, 35, 33, 29, 32, 33, 32, 29, 49, 40, 80]], …

Now I calculate the mean and standard deviation of each sample and then output the average of the averaged means and the average of the 100 subsampled standard deviations.

sample_means = [np.mean(x) for x in boot_samples]sample_stdev = [np.std(x) for x in boot_samples]# take the average of all the meansset_mean = np.mean(sample_means)# average of all the std_devsset_stdev = np.mean(sample_stdev)print('Average of the sample averages: {0:.2f}'.format(set_mean))print('Average of the sample st. devs: {0:.2f}'.format(set_stdev))Average of the sample averages: 35.84Average of the sample st. devs: 8.94

Hmm — so far bootstrapping has barely changed mean(36 to 35.84). The sample averages of the sets are very similar (which we expect) and std is changed(17.50 to 8.94) , but we also have not done much with the 100 subsamples we created.

Trimmed mean

The lack of improvement in the average is due to outlaier data, so I use theTrimmed mean(robust for outlaier) .

# each sample mean and st. dev.trimmed_means = [stats.trim_mean(x,0.1) for x in boot_samples]# average of all the meanstrimmed_mean_avg = np.mean(trimmed_means)# And the average of all the standard deviations

print('Average of the sample averages: {:.2f}'.format(trimmed_mean_avg))Average of the sample averages: 34.76

Histogram of the means

As you can see, bootstrap has been significantly improved by the trimmed mean on the re-sampledes data, and the effect of outlaiers has been removed(slices off "left most" and "right most" 10% of scores.

Bootstrapping Regression

The bootstrap method can be applied to regression models. Bootstrapping a regression model gives insight into how variable the model parameters are. It is useful to know how much random variation there is in regression coefficients simply because of small changes in data values.As with most statistics, it is possible to bootstrap almost any regression model. However, since bootstrap resampling uses a large number of subsamples, it can be computationally intensive.

Let’s explore the simple regression models both for population and for sample data:

import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport statsmodels.formula.api as sm# synthetic sample datan_points = 50x = np.linspace(0, 20, n_points)y = x + (np.random.rand(len(x)) * 5)data_df = pd.DataFrame({'x': x,'y': y})ols_model = sm.ols(formula = 'y ~ x', data=data_df)results = ols_model.fit()# coefficientsprint('Intercept, x-Slope : {}'.format(results.params))y_pred = ols_model.fit().predict(data_df['x'])# plot resultsplt.scatter(x, y)plt.plot(x, y_pred, linewidth=2)plt.grid(True)plt.xlabel('x')plt.ylabel('y')plt.title('x vs y')plt.show()

Resample bootstrap

The Bootstrap approach asks a question: what if we resample the data with replacement and estimate the coefficients, how extreme would it be?

Here is a simple loop of 100 trials, which resamples with replacement these 50 observations from our sample dataset, runs the regression model and saves the coefficients we get there. In the end, we would have 100 pairs of coefficients.

# resample with replacement each rowboot_slopes = []boot_interc = []n_boots = 100plt.figure()for _ in range(n_boots):# sample the rows, same size, with replacementsample_df = data_df.sample(n=n_points, replace=True)# fit a linear regressionols_model_temp = sm.ols(formula = 'y ~ x', data=sample_df)results_temp = ols_model_temp.fit()# append coefficientsboot_interc.append(results_temp.params[0])boot_slopes.append(results_temp.params[1])# plot a greyed out liney_pred_temp = ols_model_temp.fit().predict(sample_df['x'])plt.plot(sample_df['x'], y_pred_temp, color='grey', alpha=0.2)# add data pointsplt.scatter(x, y)plt.plot(x, y_pred, linewidth=2)plt.grid(True)plt.xlabel('x')plt.ylabel('y')

plt.title('x vs y')plt.show()

Distribution of slope and intercept coefficients with confidence interval.

sns.distplot(boot_slopes)plt.axvline(np.percentile(boot_slopes, 5), color='red', linewidth=2) ;plt.axvline(np.percentile(boot_slopes, 95), color='red', linewidth=2) ;

sns.distplot(boot_interc)plt.axvline(np.percentile(boot_interc, 5), color='red', linewidth=2) ;plt.axvline(np.percentile(boot_interc, 95), color='red', linewidth=2) ;

np.mean(boot_slopes)1.02512633638174np.mean(boot_interc)3.4724101847830746

The data above paints a pretty picture from parametric bootstrapping. However, if by chance, and this is more likely, we had sparse data, there may be a chance that our random selection of points are entirely in one area and not in another — recall the mention that the outlier can be sampled several times despite being a single outlying point.

Conclusion

In this article, I have explored the bootstrap approach for estimating regression coefficients. I used a simple regression model for simplicity and clear representation of this powerful technique. I concluded that this approach is essentially equal to the OLS models, however without relying on the assumptions. It is a powerful method for estimating the uncertainty of the coefficients and could be used along with traditional methods to check the stability of the models.

what you learn with pleasure,will never forget"