Linear Regression With Bootstrapping (2024)

For data scientists andmachine learning engineers, this bootstrapping context is an important tool for sampling data. For this reason, it is one of the most important to consider what underlies the variation of numbers, the variation of distributions. We use resampling when we have a small amount of data as it allows us to see how much variation there would have been.

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.

Article Overview:

  1. Mean, trimmed mean ,outlaiers.
  2. Bootstrap Method.
  3. Bootstrapping Regression with worked example

Trimmed mean

A variation of the mean is a trimmed mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by x1 , x2 , ..., xn where x1 is the smallest value and x n the largest, the formula to compute the trimmed mean with p smallest and largest values omitted is:

Linear Regression With Bootstrapping (1)

The trimmed mean is Robust to outliers.

Bootstrap Method

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. This approach to sampling is called sampling with replacement.

The process for building one sample can be summarized as follows:

  1. Choose the size of the sample.
  2. While the size of the sample is less than the chosen size
  3. Randomly select an observation from the dataset
  4. Add it to the sample

The bootstrap method can be used to estimate a quantity of a population. This is done by repeatedly taking small samples, calculating the statistic, and taking the average of the calculated statistics. We can summarize this procedure as follows:

  1. Choose a number of bootstrap samples to perform
  2. Choose a sample size
  3. For each bootstrap sample
  4. Draw a sample with replacement with the chosen size
  5. Calculate the statistic on the sample
  6. Calculate the mean of the calculated sample statistics.

The procedure can also be used to estimate the skill of a machine learning model.

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

Linear Regression With Bootstrapping (2)

Bootstrapping Regression with worked example

In this article I looked at applying bootstrapping techniques to linear regression in one ways with my experiance :

  1. Parametric bootstrapping (this article )
  2. Non-parametric boostrapping(next article)

Data

I treat the data sample we have as the only representation of the population that we have. Then to get more datasets from it, we resample the datawith replacement.

From the community of drivers who crashed in one month, I have considered a community of 30 drivers with the age of each driver.Their 33 ages are simulated below:

import numpy as npfrom scipy import statsimport matplotlib.pyplot as pltfrom sklearn.utils import resample%matplotlib inline# 30 of the driver’s agedriver_age = [24, 40, 27, 33, 31, 33, 35, 32, 29, 34, 39, 40, 41, 36, 34, 35, 29, 30, 35, 98,24, 40, 27, 65, 71, 33,54, 32, 29, 87]# expected age could be the averagedriver_avg = np.mean(driver_age)print('Average of driver age guesses: {} years old'.format(driver_avg))driver_std = np.std(driver_age)print('Std_Dev of driver age guesses: {0:.2f} years'.format(driver_std))
print(len(driver_age))
Average of driver age guesses: 36.06 years oldStd_Dev of driver age guesses: 17.50 years

Linear Regression With Bootstrapping (3)

Considering the guesses, only 6 of them are above the average. Visually inspecting the data we see that this is because of one outlier of 80 years. While we may consider dropping this outlier to get a better guess, we can also use bootstrap resampling to get more data that approaches the parent distribution. Here we will repeatedly sample with replacement to get a set of subsamples. Each subsample will also have 33 data points, the same as the original.

sampling with replacement

To run Bootstrap sampling, I used the the (replace=True)flag in(np .random .choice) .

notes :

- The size of each sample should be equal to the original sample(33).

- 100 samples have been taken from the original sample.

n_sets = 100n_samples = len(driver_age)def generate_samples(dataset, n): return list(np.random.choice(dataset, size=n, replace=True))boot_samples = [generate_samples(driver_age, n_samples) for _ in range(n_sets)]print('Here are the top 2 samples generated:')print('{}, …'.format(boot_samples[0:2]))Here are the top 2 samples generated:[[32, 33, 40, 36, 33, 46, 29, 29, 36, 40, 36, 39, 49, 33, 31, 35, 36, 27, 24, 39, 31, 24, 24, 34, 37, 34, 39, 40, 29, 39, 34, 46, 37], [30, 27, 39, 46, 34, 34, 35, 41, 24, 34, 46, 41, 34, 27, 35, 35, 35, 24, 47, 35, 40, 35, 37, 35, 33, 29, 32, 33, 32, 29, 49, 40, 80]], …

Now I calculate the mean and standard deviation of each sample and then output the average of the averaged means and the average of the 100 subsampled standard deviations.

sample_means = [np.mean(x) for x in boot_samples]sample_stdev = [np.std(x) for x in boot_samples]# take the average of all the meansset_mean = np.mean(sample_means)# average of all the std_devsset_stdev = np.mean(sample_stdev)print('Average of the sample averages: {0:.2f}'.format(set_mean))print('Average of the sample st. devs: {0:.2f}'.format(set_stdev))Average of the sample averages: 35.84Average of the sample st. devs: 8.94

Hmm — so far bootstrapping has barely changed mean(36 to 35.84). The sample averages of the sets are very similar (which we expect) and std is changed(17.50 to 8.94) , but we also have not done much with the 100 subsamples we created.

Trimmed mean

The lack of improvement in the average is due to outlaier data, so I use theTrimmed mean(robust for outlaier) .

# each sample mean and st. dev.trimmed_means = [stats.trim_mean(x,0.1) for x in boot_samples]# average of all the meanstrimmed_mean_avg = np.mean(trimmed_means)# And the average of all the standard deviations
print('Average of the sample averages: {:.2f}'.format(trimmed_mean_avg))Average of the sample averages: 34.76

Histogram of the means

Linear Regression With Bootstrapping (4)

As you can see, bootstrap has been significantly improved by the trimmed mean on the re-sampledes data, and the effect of outlaiers has been removed(slices off "left most" and "right most" 10% of scores.

Bootstrapping Regression

The bootstrap method can be applied to regression models. Bootstrapping a regression model gives insight into how variable the model parameters are. It is useful to know how much random variation there is in regression coefficients simply because of small changes in data values.As with most statistics, it is possible to bootstrap almost any regression model. However, since bootstrap resampling uses a large number of subsamples, it can be computationally intensive.

Let’s explore the simple regression models both for population and for sample data:

import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport statsmodels.formula.api as sm# synthetic sample datan_points = 50x = np.linspace(0, 20, n_points)y = x + (np.random.rand(len(x)) * 5)data_df = pd.DataFrame({'x': x,'y': y})ols_model = sm.ols(formula = 'y ~ x', data​=data_df)results = ols_model.fit()# coefficientsprint('Intercept, x-Slope : {}'.format(results.params))y_pred = ols_model.fit().predict(data_df['x'])# plot resultsplt.scatter(x, y)plt.plot(x, y_pred, linewidth=2)plt.grid(True)plt.xlabel('x')plt.ylabel('y')plt.title('x vs y')plt.show()

Linear Regression With Bootstrapping (5)

Resample bootstrap

The Bootstrap approach asks a question: what if we resample the data with replacement and estimate the coefficients, how extreme would it be?

Here is a simple loop of 100 trials, which resamples with replacement these 50 observations from our sample dataset, runs the regression model and saves the coefficients we get there. In the end, we would have 100 pairs of coefficients.

# resample with replacement each rowboot_slopes = []boot_interc = []n_boots = 100plt.figure()for _ in range(n_boots):# sample the rows, same size, with replacementsample_df = data_df.sample(n=n_points, replace=True)# fit a linear regressionols_model_temp = sm.ols(formula = 'y ~ x', data=sample_df)results_temp = ols_model_temp.fit()# append coefficientsboot_interc.append(results_temp.params[0])boot_slopes.append(results_temp.params[1])# plot a greyed out liney_pred_temp = ols_model_temp.fit().predict(sample_df['x'])plt.plot(sample_df['x'], y_pred_temp, color='grey', alpha=0.2)# add data pointsplt.scatter(x, y)plt.plot(x, y_pred, linewidth=2)plt.grid(True)plt.xlabel('x')plt.ylabel('y')
plt.title('x vs y')plt.show()

Linear Regression With Bootstrapping (6)

Distribution of slope and intercept coefficients with confidence interval.

sns.distplot(boot_slopes)plt.axvline(np.percentile(boot_slopes, 5), color='red', linewidth=2) ;plt.axvline(np.percentile(boot_slopes, 95), color='red', linewidth=2) ;

Linear Regression With Bootstrapping (7)

sns.distplot(boot_interc)plt.axvline(np.percentile(boot_interc, 5), color='red', linewidth=2) ;plt.axvline(np.percentile(boot_interc, 95), color='red', linewidth=2) ;

Linear Regression With Bootstrapping (8)

np.mean(boot_slopes)1.02512633638174np.mean(boot_interc)3.4724101847830746

The data above paints a pretty picture from parametric bootstrapping. However, if by chance, and this is more likely, we had sparse data, there may be a chance that our random selection of points are entirely in one area and not in another — recall the mention that the outlier can be sampled several times despite being a single outlying point.

Conclusion

In this article, I have explored the bootstrap approach for estimating regression coefficients. I used a simple regression model for simplicity and clear representation of this powerful technique. I concluded that this approach is essentially equal to the OLS models, however without relying on the assumptions. It is a powerful method for estimating the uncertainty of the coefficients and could be used along with traditional methods to check the stability of the models.

what you learn with pleasure,will never forget"

Linear Regression With Bootstrapping (2024)
Top Articles
Drdgold Limited (DRD) Stock Forecast, Price Targets and Analysts Predictions - TipRanks.com
SSL Cipher Configuration - removing weak ciphers [Legacy]
Victor Spizzirri Linkedin
Angela Babicz Leak
What to Do For Dog Upset Stomach
Ati Capstone Orientation Video Quiz
DENVER Überwachungskamera IOC-221, IP, WLAN, außen | 580950
Arrests reported by Yuba County Sheriff
Samsung 9C8
AB Solutions Portal | Login
Midway Antique Mall Consignor Access
C Spire Express Pay
Mens Standard 7 Inch Printed Chappy Swim Trunks, Sardines Peachy
ExploreLearning on LinkedIn: This month's featured product is our ExploreLearning Gizmos Pen Pack, the…
Nonne's Italian Restaurant And Sports Bar Port Orange Photos
Craigslist Malone New York
Operation Cleanup Schedule Fresno Ca
Price Of Gas At Sam's
Cyndaquil Gen 4 Learnset
Weather Rotterdam - Detailed bulletin - Free 15-day Marine forecasts - METEO CONSULT MARINE
White Pages Corpus Christi
Promiseb Discontinued
Craigslist St. Cloud Minnesota
Panola County Busted Newspaper
Meridian Owners Forum
Violent Night Showtimes Near Johnstown Movieplex
Catchvideo Chrome Extension
Abga Gestation Calculator
Lininii
Club Keno Drawings
Red Sox Starting Pitcher Tonight
Dubois County Barter Page
R3Vlimited Forum
Mkvcinemas Movies Free Download
O'reilly's Wrens Georgia
Dr. John Mathews Jr., MD – Fairfax, VA | Internal Medicine on Doximity
Property Skipper Bermuda
Vision Source: Premier Network of Independent Optometrists
Telugu Moviez Wap Org
Trap Candy Strain Leafly
Disassemble Malm Bed Frame
Mathews Vertix Mod Chart
Tom Kha Gai Soup Near Me
Devotion Showtimes Near Showplace Icon At Valley Fair
Aurora Southeast Recreation Center And Fieldhouse Reviews
Puss In Boots: The Last Wish Showtimes Near Valdosta Cinemas
Where Is Darla-Jean Stanton Now
Southern Blotting: Principle, Steps, Applications | Microbe Online
Koniec veľkorysých plánov. Prestížna LEAF Academy mení adresu, masívny kampus nepostaví
Texas 4A Baseball
Latest Posts
Article information

Author: Jerrold Considine

Last Updated:

Views: 6285

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.