Empirical Distribution Functions

Mathematica Overview

This section discusses the mechanics of plotting and manipulating Empirical Distribution Functions (EDF) in Mathematica.

Since this subject is best discussed with the aid of an example, we start by creating the array of data for which we want to plot the EDF. This example is taken from Project II as presented during STAP 04. I have significantly reduced the number of replications in order to be able to show the data without taking too much space.

In[13]:=

Needs["Statistics`DataManipulation`"]

n = 10 ;

In[15]:=

T = RandomArray[ExponentialDistribution[1./14], n] + RandomArray[GammaDistribution[2, 14], n] + RandomArray[UniformDistribution[6, 12], n]

Out[15]=

{34.7227, 41.0089, 37.344, 41.912, 51.5236, 17.0298, 37.2253, 90.9462, 71.9425, 74.458}

Note: this section uses a number of the more advanced Mathematica manipulations available with the Statistics and DataManipulation packages; the principles involved are straightforward, but the manipulations may not be.

Now what we need to do is to create a matrix with the sorted values of the sample each appearing twice in the first column, and the frequency data appearing in the second column.  The exact organization of the data is intended to cause the plotting routines to plot it as a step function.

First, create a vector (list) of the frequency values.  This list consists of the values {0, 1/n, 1/n, 2/n, 2/n, 3/n, 3/n, ... , (n-1)/n, (n-1)/n, 1}.  We create it by first creating a list of the values {1/n, 2/n, ..., 1}

In[16]:=

Range[1., n]/n

Out[16]=

{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.}

Then we double the list and insert the value 0, converting the nested lists into a single flat list

In[17]:=

Flatten[{0, Range[1, n]/n, Range[1, n]/n}]

Out[17]=

{0, 1/10, 1/5, 3/10, 2/5, 1/2, 3/5, 7/10, 4/5, 9/10, 1, 1/10, 1/5, 3/10, 2/5, 1/2, 3/5, 7/10, 4/5, 9/10, 1}

Notice that every value appears twice except 0.  Now we sort the list and remove one of the 1's at the top of the list, assigning the result to a symbol "freq".

In[36]:=

freq = Delete[Sort[Flatten[{0, Range[1, n]/n, Range[1, n]/n}]], 2n + 1]

Out[36]=

{0, 1/10, 1/10, 1/5, 1/5, 3/10, 3/10, 2/5, 2/5, 1/2, 1/2, 3/5, 3/5, 7/10, 7/10, 4/5, 4/5, 9/10, 9/10, 1}

Each value in the sample also appears twice, and needs to be sorted;

In[19]:=

Sort[Join[T, T]]

Out[19]=

{17.0298, 17.0298, 34.7227, 34.7227, 37.2253, 37.2253, 37.344, 37.344, 41.0089, 41.0089, 41.912, 41.912, 51.5236, 51.5236, 71.9425, 71.9425, 74.458, 74.458, 90.9462, 90.9462}

If we simply make a list of the two lists, we will have a matrix with the right values in it.  However, the list will consist of two rows and n columns, while we want it to be two columns and n rows.  The Transpose function takes care of this.

In[20]:=

EDF = Transpose[{Sort[Join[T, T]], freq}]

Out[20]=

Now we can plot the resulting pairs.  The "PlotJoined→True" option causes Mathematica to join the plotted points with line segments, which is the desired form of the plot.

In[21]:=

EDFplot = ListPlot[EDF, PlotJoined→True]

[Graphics:../HTMLFiles/index_35.gif]

Out[21]=

-Graphics -

Fitting A Model to the EDF

Once the EDF has been constructed, we need to fit a model of some sort to it.  The fundamental process is to select parameters for the model that cause it to fit "well" to the data; there are several ways to approach this problem.

Estimator Approach

One approach to this problem is one that fits (pun intended) well with traditional statistics: define an estimator for the parameter(s) in question and use the point estimates resulting from it.  In this case, we will try to fit the Gamma distribution, finding point estimates for E(X) = αβ and V(X) = αβ^2

In[22]:=

LocationReport[Column[EDF, 1]]

Out[22]=

{Mean→49.8113, HarmonicMean→40.4628, Median→41.4605}

In[23]:=

Variance[Column[EDF, 1]]

Out[23]=

478.766

In[24]:=

Solve[{α β == Mean[Column[EDF, 1]], α  β^2 == Variance[Column[EDF, 1]]}, {α, β}]

Out[24]=

{{α→5.18242, β→9.61159}}

These are our estimates for α and β; check to make sure they really have the desired properties (which is mostly a check that we did the solution right).

In[25]:=

Mean[GammaDistribution[7.77008, 6.46818]]

Out[25]=

50.2583

In[26]:=

Variance[GammaDistribution[7.77008, 6.46818]]

Out[26]=

325.08

Plot the fitted graph

In[27]:=

ListPlot[Table[{Quantile[GammaDistribution[7.77008, 6.46818], i], i}, {i, 0, .999, .001}]] ;

[Graphics:../HTMLFiles/index_49.gif]

And now the fit and the EDF on the same axis.

In[28]:=

MultipleListPlot[Table[{Quantile[GammaDistribution[7.77008, 6.46818], i], i}, {i, 0, .9999, .02}], EDF, PlotJoined→True]

[Graphics:../HTMLFiles/index_51.gif]

Out[28]=

-Graphics -

Clearly this is a reasonably good fit of the distribution; since it's part of Project II, let's apply the same technique to an exponential.

In[29]:=

[Graphics:../HTMLFiles/index_54.gif]

Out[29]=

-Graphics -

Clearly, while it may have the same mean, the exponential distribution is not as good a model of the simulated data as the Gamma distribution.

Least Squares Fit

In principle, computation of the least squares fit is straightforward; the actual Mathematica implementation is much less straightforward..

The first step is to find the predicted values corresponding to the observed values.

In[35]:=

Sort[T]

Out[35]=

{17.0298, 34.7227, 37.2253, 37.344, 41.0089, 41.912, 51.5236, 71.9425, 74.458, 90.9462}

In[73]:=

frequencyList = N[Sort[Flatten[{Range[0, n - 1]/n} + 1/(2 * n)]]]

Out[73]=

{0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95}

Compare the predictions and observed values

In[78]:=

Transpose[{Quantile[GammaDistribution[7.7, 6.5], frequencyList], Sort[T]}]

Out[78]=

Compute the sum of squared errors.  

In[86]:=

Sum[(Quantile[GammaDistribution[7.7, 6.5], frequencyList][[i]] - Sort[T][[i]])^2, {i, 1, n}]

Out[86]=

407.37

Finally, minimize the sum of the squared error as a function of the family parameters.

In[85]:=

NMinimize[Sum[(Quantile[GammaDistribution[α, β], frequencyList][[i]] - Sort[T][[i]])^2, {i, 1, n}], {α, β}]

Out[85]=

In[84]:=

??NMinimize

NMinimize[f, {x, y, ... }] minimizes f numerically with respect to x, y, ... . NMinimize[{f, cons}, {x, y, ... }] minimizes f numerically subject to the constraints cons. More…

Attributes[NMinimize]={Protected,ReadProtected}
Options[NMinimize]={AccuracyGoal→Automatic,EvaluationMonitor→None,MaxIterations→100,Method→Automatic,PrecisionGoal→Automatic,StepMonitor→None,WorkingPrecision→MachinePrecision}

In[228]:=

fitted = ListPlot[Transpose[{Quantile[ExponentialDistribution[λ], Delete[Column[EDF, 2], 2n]], Delete[Column[EDF, 2], 2n]}]]/.λ->3.84125

Delete :: partw : Part  {20} of Column[Transpose[{«1», {0., «9», «10»}}], 2] does not exist. More…

Delete :: partw : Part  {20} of Column[Transpose[{«1», {0., «9», «10»}}], 2] does not exist. More…

ListPlot :: list : List expected at position 1 in ListPlot[Transpose[{Quantile[«23»[λ], «1»], «1»}]] .  More…

Transpose :: nmtx : The first two levels of the one-dimensional list  {Quantile[ExponentialDistribution[«19»], «1»], «1»} cannot be transposed. More…

ListPlot :: list : List expected at position 1 in ListPlot[Transpose[{Quantile[«23»[«19»], «1»], «1»}]] .  More…

Out[228]=

In[229]:=

2n

Out[229]=

20

In[230]:=

Transpose[{Quantile[ExponentialDistribution[λ], Delete[Column[EDF, 2], 2n]], Delete[Column[EDF, 2], 2 * n]}]/.λ->3.84125

Delete :: partw : Part  {20} of Column[Transpose[{«1», {0., «9», «10»}}], 2] does not exist. More…

Delete :: partw : Part  {20} of Column[Transpose[{«1», {0., «9», «10»}}], 2] does not exist. More…

Transpose :: nmtx : The first two levels of the one-dimensional list  {Quantile[ExponentialDistribution[«19»], «1»], «1»} cannot be transposed. More…

Out[230]=

In[231]:=

[Graphics:../HTMLFiles/index_86.gif]

Out[231]=

-Graphics -


Created by Mathematica  (July 20, 2006) Valid XHTML 1.1!