Regression
Purpose
This chapter demonstrates how to solve a simple linear regression problem, including fitting the model, computing residuals, computing the coefficient of determination (), and related quantities, conducting a model utility test, and computing predictions and prediction intervals for the model.
The notebook builds on, and is intended to support the study of linear regression in Chapter 12 of Jay L. Devore's Probability and Statistics for Engineering and the Sciences.
Mathematica Solution
The starting point is a data set, either read in from disk, or typed in manually. This data set is the weight of terrestrial animal's brains against total body weight (data provided by Cadet Kaiser, from <>)
Note: this is relatively advanced Mathematica manipulation, using the "DataManipulation" subpackage (part of the Statistics package). For more discussion of how to use the data manipulation package, see Abell et al., particularly Chapter 2, where the basic approach is discussed and several examples given. The specific approach used here is slightly different
Obviously, the specific file name will vary depending on where the data file is located. This particular example is copied from the data disk supplied with Devore into my lesson planning directory.
In[232]:=
Out[232]=
If we were reading data from file, the following is how
In[74]:=
filename="E:\\Kaiser_HW#2_MA206_Z3.txt"
axesnames=ReadList[filename,Word,2,WordSeparators→{",","\t"}];
textdata=ReadList[filename,Word,WordSeparators→{"\t",","},RecordLists→True]
dataTable=Delete[ToExpression[textdata],1]
x=Flatten[ColumnTake[dataTable,1]]
y=Flatten[ColumnTake[dataTable,{2}]]
In[234]:=
Out[234]=
Out[236]=
Out[237]=
Out[238]=
As a first step, plot the data to see if they have a reasonably linear relationship. If not, it is inappropriate to be using the simple linear regression model we are discussing; other—more advanced—techniques will be needed than this worksheet provides.
In[239]:=
Required
Fit a linear model using the Principle of Least Squares.
Compute statistics related to model quality.
Conduct a model utility test.
Predict values of the response value given values of the predictor variable. Create prediction intervals on the predicted value.
Model
The model appropriate to this problem is
where and are (unknown) coefficients of the best fit line, and ε is assumed to be N(0,).
For clarity, this notebook uses the following notation (similar, but not identical, to Devore's notation).
: the y-intercept of the true equation. It is estimated by , and the specific value of the point estimate in this problem is defined to be .
: the slope of the true equation. It is estimated by , and the specific value of the point estimate in this problem is defined to be .
: the standard deviation of ε, the error term in the model. It is estimated by ; the point estimate of is s. Note that Devore uses σ as his symbol for this quantity; using is a reminder of which standard deviation we are dealing with.
In many equations, Devore uses the general symbol for the estimator he is interested in; , for example. Unless there is a specific reason for following Devore's (more abstractly correct) notation, the computational formulæ in this notebook use the specific value of the point estimate
Fitting the Model
Following Devore, we compute intermediate quantities n, Sxy and Sxx (which will be used later as well)
In[240]:=
Out[242]//TableForm=
n | = | 2 |
Sxy | = | {2.}-0.5 ({1.}+{{6654.,5712.},{1.,6.6},{3.385,44.5},{0.92,5.7},{2547.,4603.},{10.55,179.5},{0.023,0.3},{160.,169.},{3.3,25.6},{52.16,440.},{0.425,6.4},{465.,423.},{0.55,2.4},{187.1,419.},{0.075,1.2},{3.,25.},{0.785,3.5},{0.2,5.},{1.41,17.5},{60.,81.},{529.,680.},{27.66,115.},{0.12,1.},{207.,406.},{85.,325.},{36.33,119.5},{0.101,4.},{1.04,5.5},{521.,655.},{100.,157.},{35.,56.},{0.005,0.14},{0.01,0.25},{62.,1320.},{0.122,3.},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250.,490.},{0.48,15.5},{10.,115.},{1.62,11.4},{192.,180.},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179.},{0.75,12.3},{3.6,21.},{83.,98.2},{55.5,175.},{1.4,12.5},{0.06,1.},{0.9,2.6},{2.,12.3},{0.104,2.5},{4.19,58.},{3.5,3.9},{4.05,17.}}) ({2.}+{{6654.,5712.},{1.,6.6},{3.385,44.5},{0.92,5.7},{2547.,4603.},{10.55,179.5},{0.023,0.3},{160.,169.},{3.3,25.6},{52.16,440.},{0.425,6.4},{465.,423.},{0.55,2.4},{187.1,419.},{0.075,1.2},{3.,25.},{0.785,3.5},{0.2,5.},{1.41,17.5},{60.,81.},{529.,680.},{27.66,115.},{0.12,1.},{207.,406.},{85.,325.},{36.33,119.5},{0.101,4.},{1.04,5.5},{521.,655.},{100.,157.},{35.,56.},{0.005,0.14},{0.01,0.25},{62.,1320.},{0.122,3.},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250.,490.},{0.48,15.5},{10.,115.},{1.62,11.4},{192.,180.},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179.},{0.75,12.3},{3.6,21.},{83.,98.2},{55.5,175.},{1.4,12.5},{0.06,1.},{0.9,2.6},{2.,12.3},{0.104,2.5},{4.19,58.},{3.5,3.9},{4.05,17.}})+{{4.42757*10^^7,3.26269*10^^7},{1.,43.56},{11.4582,1980.25},{0.8464,32.49},{6.48721*10^^6,2.11876*10^^7},{111.303,32220.3},{0.000529,0.09},{25600.,28561.},{10.89,655.36},{2720.67,193600.},{0.180625,40.96},{216225.,178929.},{0.3025,5.76},{35006.4,175561.},{0.005625,1.44},{9.,625.},{0.616225,12.25},{0.04,25.},{1.9881,306.25},{3600.,6561.},{279841.,462400.},{765.076,13225.},{0.0144,1.},{42849.,164836.},{7225.,105625.},{1319.87,14280.3},{0.010201,16.},{1.0816,30.25},{271441.,429025.},{10000.,24649.},{1225.,3136.},{0.000025,0.0196},{0.0001,0.0625},{3844.,1.7424*10^^6},{0.014884,9.},{1.8225,65.61},{0.000529,0.16},{0.002304,0.1089},{2.89,39.69},{12.25,116.64},{62500.,240100.},{0.2304,240.25},{100.,13225.},{2.6244,129.96},{36864.,32400.},{6.25,146.41},{18.3869,1536.64},{0.0784,3.61},{17.9352,2540.16},{46.24,32041.},{0.5625,151.29},{12.96,441.},{6889.,9643.24},{3080.25,30625.},{1.96,156.25},{0.0036,1.},{0.81,6.76},{4.,151.29},{0.010816,6.25},{17.5561,3364.},{12.25,15.21},{16.4025,289.}} |
Sxx | = |
Note: Be very cautious in entering the formulas shown here. While using the "traditional" summation notation makes the analogy to Devore's statement of the equations more clear, it is easy to introduce subtle errors in Mathematica using this form. A less error-prone approach is to state the summations as, e.g., Sum[ x[[ i ]]*y[[ i ]], {i,1,n}], where the limits of each part of the expression are explicitly shown.
and finally the actual coefficients of the fitted model.
In[243]:=
Out[245]//TableForm=
(The estimated value of the third parameter of the model, , will be computed later.)
These are based on the computational forms given by Devore. While Mathematica can handle the original forms directly, the computational forms are more numerically stable, which seems desirable.
Immediately, we display the resulting model superimposed on a plot of the data to get a visual check that the model is actually appropriate for the data.
See the discussion of graphics in the appendix for more detail on how this display is created.
In[246]:=
Computing Residuals
The residuals are defined to be the actual value of the data minus the predicted value. They are useful as a check on how good the model is, and their numerical values will be useful in later computations.
In[248]:=
Out[249]=
In[250]:=
Based on our model, the residuals are the observed values of ε . We assumed ε to be normally distributed with mean zero and a constant standard deviation. If the plot of the residuals does not support this assumption, caution is indicated.
Estimating and
From the definition on page 512 we can compute estimates of the standard deviation of ε (noting that the summand is simply the ith residual):
In[251]:=
Out[254]//TableForm=
SSE: | |
ComplexInfinity |
Coefficient of Determination
From the definitions on page 514, the total sum of squares and coefficient of determination, (, but given the Mathematica name r2) are:
In[255]:=
Out[257]//TableForm=
SST: | |||
SSE: | ComplexInfinity | ||
SSM: |
Values of near 1 indicate that the simple linear model explains most of the variability in the data. Be careful, though; departures from this can be caused by variability in the data ( is large) or by a mismatch between the model actually underlying the data and the assumed model of a line.
Inferences on the Slope Parameter
The slope parameter is a measure of how much y changes for a unit change in x. If is actually zero, it means that changes in x do not really correspond to changes in y at all. So we compute a confidence interval on to see if it's reasonable to say that really is zero.
First we compute the critical value. This is the same problem as finding a confidence interval on the mean—except that in this case the number of degrees of freedom is different.
In[258]:=
Out[261]//TableForm=
The function f[x,ν] is the probability density function of a Student's T random variable with ν degrees of freedom (defined in an initialization cell in an appendix to this notebook). A graph of the relevant T distribution is shown below.
NOTE: The Mathematica commands to produce and label this graph are more complicated than the the plot itself is worth. Rather than try to teach this command, plot the distribution by itself, then draw in the relevant shading and labels.
In[262]:=
A 100(1-α)% confidence interval for , the slope parameter of the regression model, is given by the formula on page 523 of Devore (noting that is the value of ):
In[263]:=
Out[263]=
If this confidence interval includes zero, the explanatory power of the model is suspect; the possibility that the supposedly dependent variable is in fact independent of the predictor variable is plausible based on the data.
Model Utility Test
The model utility test continues the same investigation as the confidence interval just computed. It is a hypothesis test comparing
: = 0
: ≠0
The test statistic is
In[264]:=
Out[264]=
With the corresponding P-value (recalling that since is ≠0, this is a two-sided test)
In[265]:=
Out[265]=
which has the usual interpretation that smaller P-values provide stronger support for the alternate hypothesis that is not zero. In general, rejecting the null hypothesis provides support that the predictor variable is useful in explaining the changes in the response variable.
Prediction of Y
Specific Values of Y Given Specific Values of X
Predicting a specific value of Y given a specific value x is exactly what our model is about. So making this computation is as simple as substituting into the basic model (setting ε to 0).
In[266]:=
Out[266]=
In[267]:=
Out[267]=
Extrapolation—predicting a y value based on an x value that is beyond the observed data in either direction—is mechanically easy, but dangerous. Since the pair of interest is outside our experience, we have no way to judge whether extending the linear model out to this point is reasonable or not. Don't extrapolate!
In[268]:=
Out[268]//TableForm=
Data range: | ColumnTake[{{6654,5712},{1,6.6},{3.385,44.5},{0.92,5.7},{2547,4603},{10.55,179.5},{0.023,0.3},{160,169},{3.3,25.6},{52.16,440},{0.425,6.4},{465,423},{0.55,2.4},{187.1,419},{0.075,1.2},{3,25},{0.785,3.5},{0.2,5},{1.41,17.5},{60,81},{529,680},{27.66,115},{0.12,1},{207,406},{85,325},{36.33,119.5},{0.101,4},{1.04,5.5},{521,655},{100,157},{35,56},{0.005,0.14},{0.01,0.25},{62,1320},{0.122,3},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250,490},{0.48,15.5},{10,115},{1.62,11.4},{192,180},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179},{0.75,12.3},{3.6,21},{83,98.2},{55.5,175},{1.4,12.5},{0.06,1},{0.9,2.6},{2,12.3},{0.104,2.5},{4.19,58},{3.5,3.9},{4.05,17}},{1}] | ≤ x ≤ | ColumnTake[{{6654,5712},{1,6.6},{3.385,44.5},{0.92,5.7},{2547,4603},{10.55,179.5},{0.023,0.3},{160,169},{3.3,25.6},{52.16,440},{0.425,6.4},{465,423},{0.55,2.4},{187.1,419},{0.075,1.2},{3,25},{0.785,3.5},{0.2,5},{1.41,17.5},{60,81},{529,680},{27.66,115},{0.12,1},{207,406},{85,325},{36.33,119.5},{0.101,4},{1.04,5.5},{521,655},{100,157},{35,56},{0.005,0.14},{0.01,0.25},{62,1320},{0.122,3},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250,490},{0.48,15.5},{10,115},{1.62,11.4},{192,180},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179},{0.75,12.3},{3.6,21},{83,98.2},{55.5,175},{1.4,12.5},{0.06,1},{0.9,2.6},{2,12.3},{0.104,2.5},{4.19,58},{3.5,3.9},{4.05,17}},{1}] |
Prediction Interval on a Future Value
As shown above, a predicted value for Y at a specific value is found by substituting into the model. Devore shows that (assuming and are themselves unbiased, which is true given our computational forumulæ and the assumed distribution of ε ) the expected value of this predicted value is indeed + as desired. Given this and the distribution of an appropriate standardized statistic, we can compute a 100(1-α)% confidence interval for Y()
First, we compute the standard error of the predicted Y value:
In[269]:=
Out[269]=
At this point we could compute a confidence interval on E() following Equation 12.6 on page 532. However, that's not what we're after; we want a prediction interval on a future y value.
Once again, we need to find the T critical value, labeled "tstarpredict" (to allow it to be different than the critical value we computed above)
In[270]:=
Out[271]=
The prediction interval is (Equation 12.7 on page 535)
In[272]:=
Out[272]=
and as expected, the predicted value is in the center of the interval.
In[273]:=
Out[273]=
Results
To summarize our results:
We are fitting the model Y = + x + ε to the observed data. Our estimates for , , and , together with other relevant measures, are:
In[274]:=
Out[274]//TableForm=
2. NIntegrate[f[dummy,0],{dummy,-∞,0.}] | |||
ComplexInfinity | |||
SSE: | |||
SSM: | |||
SST: | |||
Data range: | ColumnTake[{{6654,5712},{1,6.6},{3.385,44.5},{0.92,5.7},{2547,4603},{10.55,179.5},{0.023,0.3},{160,169},{3.3,25.6},{52.16,440},{0.425,6.4},{465,423},{0.55,2.4},{187.1,419},{0.075,1.2},{3,25},{0.785,3.5},{0.2,5},{1.41,17.5},{60,81},{529,680},{27.66,115},{0.12,1},{207,406},{85,325},{36.33,119.5},{0.101,4},{1.04,5.5},{521,655},{100,157},{35,56},{0.005,0.14},{0.01,0.25},{62,1320},{0.122,3},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250,490},{0.48,15.5},{10,115},{1.62,11.4},{192,180},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179},{0.75,12.3},{3.6,21},{83,98.2},{55.5,175},{1.4,12.5},{0.06,1},{0.9,2.6},{2,12.3},{0.104,2.5},{4.19,58},{3.5,3.9},{4.05,17}},{1}] | ≤ x ≤ | ColumnTake[{{6654,5712},{1,6.6},{3.385,44.5},{0.92,5.7},{2547,4603},{10.55,179.5},{0.023,0.3},{160,169},{3.3,25.6},{52.16,440},{0.425,6.4},{465,423},{0.55,2.4},{187.1,419},{0.075,1.2},{3,25},{0.785,3.5},{0.2,5},{1.41,17.5},{60,81},{529,680},{27.66,115},{0.12,1},{207,406},{85,325},{36.33,119.5},{0.101,4},{1.04,5.5},{521,655},{100,157},{35,56},{0.005,0.14},{0.01,0.25},{62,1320},{0.122,3},{1.35,8.1},{0.023,0.4},{0.048,0.33},{1.7,6.3},{3.5,10.8},{250,490},{0.48,15.5},{10,115},{1.62,11.4},{192,180},{2.5,12.1},{4.288,39.2},{0.28,1.9},{4.235,50.4},{6.8,179},{0.75,12.3},{3.6,21},{83,98.2},{55.5,175},{1.4,12.5},{0.06,1},{0.9,2.6},{2,12.3},{0.104,2.5},{4.19,58},{3.5,3.9},{4.05,17}},{1}] |
The model and data together are shown below:
In[275]:=
In[276]:=
Out[276]=
Appendix: Mathematica Definitions for Regression
This appendix is for other definitions that are needed in computations, Computation cells in this appendix are initialization cells, to be evaluated when the notebook is opened.
In[2]:=
Student's T distribution
In[4]:=
In[280]:=
Excel Solution
Created by Mathematica (July 20, 2006) |