EXPLANATION FILE FOR LEAST-SQUARES APPROXIMATIONS
=================================================
1. General Introduction to Polynomial Approximations
-------------------------------------------------
The discussion here centers on one of the popular concepts in approximation
theory: least-squares curve fitting. or regression (the statistical determina-
tion of parametric dependencies, e.g., the polynomial coefficients of a least-
squares curve fit to a set of data). Although the associated techniques are
usually applied to sets of data containing noise, they may also be employed to
provide approximations in functions. This is particularly useful when high-
accuracy tables that represent the function are available. ln the subroutines
given, there is no operational distinction made between curve-fitting noisy
data or fitting function tables. Therefore, these subroutines can be used both
for statistical regression and for functional approximations.
The least-squares curve-fitting criterion is simple and is based directly on
statistical concepts. We will take the linear correlation between two varia-
bles, x and y, as an example. If x is defined as the independent variable, and
y the dependent variable, then ideally the linear relationship between the two
is
y = a + bx (1.1.1)
If we are given a set of data values (xi, yi) in which there is sorne error in
the yi measurement, we have
yi = a + bx, + Ëi (1.1.2)
Ei represents the measurement error. However, the error is usually an unknown
quantity. Therefore, a and b cannot be obtained from equation (1.1.2). In-
stead, the form of equation (1.1.1) is assumed, and by sorne means, the data
set is used to obtain approximate values for a and b. We will call these
approximations alpha and beta respectively.
If the error, E;, is normally distributed with variance a2 (this is a fairly
good assumption and is expected from the Central Limit Theorem-see Ref. 3),
then the probability that the estimate, yi = alpha + beta xi, is within dy of
yi is
2 dy 2 2
P(yi - dy <= yi <= yi + dy) = ----------- exp(-(yi-alpha-beta xi) /2s )
s*sqrt(2pi)
(1.1.3)
2
ln addition, if the variance, s, is independent of yi (i.e., the noise is
additive and not signal dependent), and the errors are not correlaied (i.e.,
Ei is independent of Ej, i <> j), then the probability that aIl the predicted
values are within some dy of the measured values is
2dy N N 2
P = (-----------) exp( Sum [-(yi-alpha-beta xi) ] ) (1.1.4)
s*sqrt(2pi) i=1
P is called the likelihood. The values of alpha and beta that maximize P are
called the maximum-likelihood estimators (see Ref. 4). lt can further be shown
that the values of alpha and beta that maximize P are the most precise
unbiased estimates of a and b statistically attainable using the given data.
The maximum-likelihood criterion can be restated in the more familiar least-
squares form as follows. The "best" values of alpha and beta are those that
minimize S(alpha,beta):
N 2
S(alpha,beta) = Sum (yi-alpha-beta xi) (1.1.5)
i=1
As we will see later, the above least-squares example can be generalized to
an Mth-degree polynomial (M <= N - 1).
The approximation algorithms that result from the least-squares criterion
will be considered in subsequent sections. The discussion for the remainder
of this section will concentrate on the differences between least-squares
curve fitting and several of the other approximation methods, notably Taylor
series and min-max polynomials. An understanding of the intrinsic error
characteristics of these distinctly different approximation methods is very
helpful in choosing the best algorithm for the task.
Perhaps the most frequently used polynomial approximation method is the
Taylor eries expansion:
2 2
df| (x - x0) d f |
f(x) = f(xo) + (x - x0) --| + -------- ---2| + ... (1.1.6)
dx|x0 2! dx |x0
Its popularity is based on generality: any analytical function can be expanded
in a Taylor series. The standard mathematical handbooks (e.g., References 5,
6, and 7) contain tables of Taylor series expansions for many functions. lt
is therefore not surprising to find that functions are often evaluated in
computers by means of these expansions.
ln particular, this seems to be the case with much of the software for small
computers in which the basic trigonometrie and exponential functions (and
their inverses) are implemented in assembly language. The programmers who
write this type of software are usually not aware of the significant short-
falls of Taylor series expansions.
As discussed in the last chapter of Volume 1 of BASIC Scientific Subrou-
tines, Taylor series expansions are very convenient, but they are usually
among the poorest of the polynomial approximation techniques.
A much better approximation can be obtained simply by using a fourth-degree
least-squares polynomial. ln this case, the fourth-degree polynomial was gene-
rated by accurately calculating values for [ln (1 + x)] / x at 101 points (100
intervals) in the range 0 <= x <= 1. A least-squares cubic polynomial was then
fitted to this "data." The result was a much better fourth-degree approxi-
mation than that obtained by using the Taylor series, with the maximum error
reduced by a factor of 200 (figure not shown here).
The uneven distribution of error in least-squares approximations is cha-
racteristic of the method. Becket and Hurt (Ref. 8) point out that there are
theoretical reasons for not expecting the error near the endpoints of the
interval to go to zero regardless of the degree of the fit. This is remini-
scent of the Gibbs' Phenomenon in Fourier transform theory; a little ringing
is always present. Therefore, if least-squares fitting is employed, it is a
good practice to approximate the function over a range wider than that even-
tually needed.
It is possible to force the truncation error to be more evenly distributed.
This is accomplished by employing min-max polynomials. Approximations to these
special polynomials can be obtained by using the truncated Chebyshev series
discussed in Chapter 2. For the ln (1 + x) example, a four-term truncated
Chebyshev series approximation results in a maximum error about the same as
that observed for the quartic least-squares fit. If the next term in the
Chebyshev series were included, the maximum error would be reduced sevenfold.
Truncated Chebyshev series do not give true min-max polynomials. An exact
min-max fit would have equal-ripple error. ln sorne situations (e.g., Bessel
functions), it is desirable to evalua te the function for large values of
the argument. One class of approximations is particularly weIl suited for
this application: asymptotic series.
To have more general information about available methods, see chapter one
of [BIBLI 01].
2. First-Order Least Squares
-------------------------
The simplest and most common fitting function is the straight line. The
reasons for its popularity are that real data (e.g., population studies) are
often very noisy and do not warrant higher-order fits, and that linear fits
are ideal for sensitivity analysis. For example, in many control system models,
the key parametric inputs to the analysis are the derivatives (slopes) of the
input/output responses at the nominal control point. The derivation of the
coefficients for this case is very simple and forms a good basis for under-
standing the mathematics for higher-order fits.
Recall from above the S(alpha,beta) function:
N 2
S(alpha,beta) = Sum (yi-alpha-beta xi)
i=1
The parameters alpha and beta are to be chosen such that S(alpha,beta) is
minimized. At this minimum, it must be the case that dS/dalpha = 0 and
dS/dbeta = o. Thus, we have the following two equations:
N
Sum (yi - alpha - beta xi) = 0
i=1
N
Sum xi (yi - alpha - beta xi) = 0
i=1
(See Ref. 4, "Fitting Equations to Data", for a detailed discussion of this
method). We establish the following definitions:
_ N
x = (1/N) Sum x
i=1 i
(1.2.1)
_ N
y = (1/N) Sum y
i=1 i
The solutions for the coefficients are then
N _ _ N _ 2
beta = Sum (xi - x)(yi - y)/Sum (xi - x) (1.2.2)
i=1 i=1
_ _
alpha = y - beta x
The final fitted equation is y = alpha + beta x (1.2.3)
In addition, the unbiased estima te of the standard deviation is
N 2
| Sum (yi - alpha beta xi) |1/2
| i=1 |
Sd = | ------------------------ | (1.2.4)
| N - 2 |
| |
The above equations can be implemented as shown in program Lstsqr1. The
inputs to the program are the number of data points (N) and the data pairs
[X(M), Y(M)]. The results returned are the coefficients A and B (Y = A + BX),
and an unbiased estimate of the standard deviation, D.
It is instructive to apply Lstsqr1 to the linear approximation of sin X in
the first octant. Eleven equally spaced points were used to obtain the fol-
lowing least-squares estimate:
sin x == 0.104 + 0.662x for 0 <= x <= pi/2
The standard deviation of this fit is s = 0.08. It is apparent that the
least-squares fit is much better on the whole than the Taylor series
approximation. As expected from the earlier discussion, the truncated Taylor
series polynomial is accurate near the expansion point (x = 0), but the error
grows rapidly as x increases. The least-squares fit also behaves as expected;
the maximum relative errors are at the endpoints of the interval.
The least-squares line-fitting subroutine can be applied to create an even
better fit by observing that x is a common factor in the Taylor series expan-
sion. Thus, a more accurate fit is expected by using the function x(a + bx).
This is accomplished by replacing sin x with sin x / x, and approximating the
latter function. The resulting approximation is
sin x == x(1.078 - 0.3966x)
The standard deviation of this estimate is s = 0.05, which is a definite
improvement over the previous example. ln addition, the relative error near
x = 0 has been greatly reduced.
It is interesting to note that using the first two terms of the Taylor
series to approximate sin x, y = x - x^3/3!, gives a higher standard deviation
(s = 0.31) than that obtained using just the first term (s = 0.27). The
addition of the second term causes an overcorrection as measured by the
least-squares error metric.
Program Lstsqr1 is very easy to apply and is not very sensitive to round-off
error. lt also executes quickly. It requires as input at least three data
points, two of which must be distinct. The three-point requirement is due to
the M - 2 divisor in the standard-deviation calculation. The distinct-points
requirement is due to a simple geometrical consideration: at least two dis-
tinct points are needed to define a line. Because these conditions are seldom
violated, the subroutine does not check to see if they are met.
Below, we will examine second-order least-squares fitting using the parabo-
la.
3. Second-Order Least Squares
--------------------------
The concepts of the previous section can easily be extended to parabolic
fits. If we assume that the true dependence is y = a + bx + cx^2, then the
problem is one of determining estimates for the coefficients of that quadra-
tic. Continuing with the notation of the previous section, we will cali these
estimates alpha, beta, and gamma. The function to minimize then is
N 2 2
S(alpha,beta,gamma) = Sum (yi - alpha - beta xi - gamma xi ) (1.3.1)
i=1
The partial derivative approach can again be employed to generate three simul-
taneous equations. The solution to the se equations is a little cumbersome
unless some additional definitions are established:
N _ 2
S = (1/N) Sum (xi - x)
xx i=1
N _ _
S = (1/N) Sum (xi - x)(yi - y)
xy i=1
N _ 2
S = (1/N) Sum (yi - y)
yy i=1
(1.3.2)
N _ _
S = (1/N) Sum (xi - x)(xi² - x²)
xx² i=1
N _ 2
S = (1/N) Sum (xi² - x²)
x²x² i=1
N _ _
S = (1/N) Sum (yi - y)(yi² - y²)
yx² i=1
(For further reference, see any standard statistics text, e.g., Ref. 11.)
Using these definitions, it can be shown that the least-squares solutions for
the coefficients are
Sxy Sx²x² - Syx² Sxx²
beta = ---------------------
Sxx Sx²x² - (Sxx²)²
Sxx Sy²x² - Sxx² Sxy (1.3.3)
gamma = ---------------------
Sxx Sx²x² - (Sxx²)²
_ _ _
alpha = y - beta x - gamma x²
Program Lstsqr2 performs these laborious calculations. The inputs are the
number of data points (N) and the data pairs [X(M),Y(M)]. The returned values
are the quadratic coefficients (A, B, and C) and the standard deviation (D).
The given examples correspond to those in previous section in which a linear
fit to sin x was obtained. lncreasing the order of the fit to a quadratic
greatly improves the standard deviation.
The only warning regarding the use of program Lstsqr2 is relative to the
minimum number of data points. There must be at least four data pairs; at
least three of them must be distinct.
The generalization of this least-squares approach is the subject of the next
two sections. In those sections, polynomial least-squares fits of arbitrary
order are considered by repeating the previous analysis using matrices. The
real utility of this generalization is the extension to more than one dimen-
sion. For one-dimensional applications, this methodology is not recommended
because of round-off error. Rather, the orthogonal polynomial algorithm dis-
cussed in another file is suggested.
4. Nth-Order Least Squares
-----------------------
The linear and parabolic least-squares fitting algorithms discussed in the
previous sections represent special (and simple) cases of the general least-
squares polynomial approximation procedure. The special case subroutines were
presented separately for two reasons. First, those algorithms are easy for the
novice to understand. Second, and more importantly, the general case subrou-
tine is slower in execution because much more overhead exists in the calcula-
tion. The power of the general approach, however, is that it opens the door to
multidimensional curve fitting, as weIl as to mixed functional dependencies.
This will become clear as we go on.
We will develop the mathematics of the general algorithm using matrix alge-
bra. For those who are familiar with (or who simply dislike) this subject,
the ensuing discussion can be bypassed.
Consider the following polynomial approximation to y(x):
0 1 n
y(x) == D0x + D1x + ... + Dnx (1.4.1)
If the measurement (or table) pairs are (xi, y,), and if there are M of them,
then the approximation can be written in the following matrix form:
Y == X D (1.4.2)
In this notation, Y is a column vector of length M; X is an M row by N + 1
column matrix, and D is a column vector of length N + 1:
|y1|
|y2|
|..|
Y = |..| = Value vector
|..|
|..|
|ym|
0 1 n
|x1 x1 ... x1 |
| 0 1 n|
|x2 x2 ... x2 |
|..............|
X = |..............| = input data array
|..............|
|..............|
| 0 1 n|
|ym ym ... ym |
|D0|
|D1|
|..|
D = |..| = Coefficient vector
|..|
|..|
|Dn|
Recall that the error function [e.g., S(alpha,beta) in § 2) had the appearance
of the square of the length of a vector in Euclidean space. We can set up the
same structure in matrix notation:
t
Error = (Y - XO) (Y - XO) (1.4.3)
The variable in this equation is the coefficient vector. D. As we did earlier,
we can calculate the differential
t
d(Error) = (Y - XO) X dD
We want this differential to be zero. Because dD is arbitrary, this equation
reduces to
t
(Y - XO) X = 0
The coefficient vector can th en be solved for as
t -1 t
D = (X X) X Y (1.4.4)
Equation (1.4.4) has a very simple structure and permits easy solution for
the coefficients through very fundamental matrix operations.
The entire procedure has been implemented as shown in program Lstsqr. That
program consists of several modular parts, The main program is responsible for
obtaining the M data pairs in (X(I), Y(I)) form, and the degree of the fit, N.
However, a two-dimensional input data array, Z(I,J) is needed. This array cor-
responds to the matrix X in equation (1.4.4).The POLYCM (Polyomial Coefficient
Matrix) subroutine performs that task and creates the Z(I,J) array using the
X(I) data values. This array, along with M, N, and Y(I), is passed to LSTSQR,
a supervisory subroutine that implements equation (1.4.4) by calling several
matrix-operation subroutines. Note that N + 1, not N, is passed to Lstsqr.
Program Lstsqr returns the coefficients to the calling program in the vector
D(I). The calling program then obtains the standard deviation from subroutine
Sigma. The reason for choosing this modular structure will become apparent in
the next section when we consider multidimensional regression.
The main operational precautions to be observed in using these subroutines
center around the validity of the input data. The calling program must ensure
that at least N + 1 distinct data pairs are passed to POLYCM and LEASTSQR.
Also, there are several arrays to be dimensioned. '\
The practical precaution is simple-watch out for round-off error. ln the
three examples given, the input data set was perfectly linear. By choosing
fits higher than linear, larger matrices were processed (in particular,
inverted) than necessary, leading to increased round-off error. It is there-
fore suggested that low-order fits be attempted first, and that the standard
deviation be examined as higher orders are tried.
Equation (1.4.4) represents a very elegant and powerful solution to the ge-
neral least-squares problem. However, because matrices are involved, round-off
error has a strong influence on the results. ln later sections, means by which
this source of error can be greatly reduced will be discussed. Before doing
so, however, we will examine the extension of the ideas in this section to
more th an one dimension.
5. Multidimensional Least Squares
------------------------------
In the previous section we examined the use of matrix equations to find a
one-dimensional polynomial fit to a given set of data. We will proceed now to
generalize the algorithm for multidimensional use.
The key to the problem is the two-dimensional input data array X (actually,
Z(I,J) in the subroutines). It decides the outcome of equation (1.4.4) along
with the measured result vector, Y. The form of this matrix was established by
equation (1.4.1):
0 1 n
y(x) == D0x + D1x + ... + Dnx
The alteration to treat multiple dimensions can be made at this step.
One approach might be to redefine equation (1.4.1) as follows (for two
dimensions, x1 and x2):
0 1 m1 0 1
y(x1,x2) == Dox1 + D1x1 + ... + Dm1x1 + Dm1+1x2 + Dm1+2x2 + ...
m2
+ Dm1+m2+1x2
The implied assumptions in this approximation are that the degree of the poly-
nomial is m1, in x1 and m2 in x2, and that there are no cross-dependencies,
This is the equivalent of assuming y(x) == P1(x1) + P2(x2), and is very unre-
alistic. Another approach might be to assume the form y(x1,x2) == P1(x1) x
P2(x2). This permits a measure of the interdependence, but is much too limi-
ting an assumption.
The complete form is much more complicated, but it can be laid out in a lo-
gical structure. For example, in two dimensions the structure is
0 1 m1 0
y(x1,x2) == (D1x1 + D2x1 + ... + Dm1x1 ) x2
0 m1 1
+ (Dm1+2x1 + ... + D2m1+2x1 ) x2 (1.5.1)
+ ...
0 m1 m2
+ (Dm2(m1+1)+1x1 + ... + D(m2+1)(m1+1)x1 )x2
The number of coefficients required in this representation is N=(M1+1)(M2+1),
where M1 and M2 are the degrees of fit in each dimension. If three dimensions
were employed, there would be an additional level of brackets required, and
the number of coefficients would be N = (M1 + 1)(M2 + 1)(M3 + 1).
Undoubtedly the most difficult problem in using the least-squares curve-
fitting algorithm in several dimensions is deciphering the resulting coeffi-
cients. Therefore, if you are actually going to use the subroutine that will
be presented shortly, the encoding scheme must be clearly understood.
As an example, we will consider the form of the X array [actuaIly, 2(I,J) in
the subroutines] for the three-dimensional case in which the degree of fit is
linear in the first two dimensions, but quadratic in the third. The number of
coefficients to be found is N = 2x2x3 = 12. Each row of the X array corres-
ponds to a particular data point, say y(l). Therefore, we have
Row I of x: 1, x1(I), x2(I), x1(I)x2(I), x3(I), xl(I)x3(I), x2(I)x3(I),
x1(I)x2 I)x3(I), x3^2(I), xl(I)x3^2(I), x2(I)x3^2(I),
x1(I)x2(I)x3^2(I)
Because there are 12 coefficients to be de termine d, there must be at least
12 distinct data points. Therefore, the minimum size of the X array is 12 X 12.
As an exercise, try to de termine what the X array would look like if the
number of dimensions were three, but the fits were quadratic, linear, and
linear respectively.
Although generating such an array and processing it according to matrix
equation (1.4.4) would be almost impossible by hand, computers are weIl suited
to such tasks.
We can use the computer algorithm given in the previous section for one-
dimensional least-squares fitting simply by replacing a few modules. First,
the subroutine that generates the X array [actually, 2(I,J) in the program] is
replaced with that shown in program MLTNLREG). This subroutine takes the data
and generates the 2(I,J) matrix used by LEASTSQR. Also, this replacement sub-
routine calculates the standard deviation on the second calling, and thereby
also replaces SIGMA for multidimensional problems. AlI the other subroutines
used in the previous program (LEASTSQR and aIl the matrix-operation programs)
are kept intact.
The multidimensional least-squares procedure is demonstrated in program
Mltnlreg. In the given examples, the data was generated from the equation y =
x1 + x2. The algorithm clearly identified the linear dependencies, and with
reasonably good round-off error. Note, however, that the round-off error was
higher for the case in which a parabolic fit was attempted in one of the
dimensions. Large arrays lead to increased round-off error.
There are several precautions associated with the use of this collection of
programs. First, the degree of fit in each dimension must be linear or higher.
This is certainly not a restriction on the utility of the calculation, but it
does put the responsibility on the calling program that realistic inputs be
supplied. Second, there must be at least N + 1 data points, with N of them
distinct, where N = (M1 + 1)(M2 + 1) ... (ML + 1). Third, the number of di-
mensions that can be processed is limited to nine. Beyond that point, the
coefficient-generation subroutine will abort and set L = O. However, it it
very unlikely that this limit would ever be approached since the minimum size
of the 2(I,J) array corresponding to nine linear dimensions would be (29 + 1)
X (29), or 513 rows by 512 columns. Finally, remember to dimension the arrays
in the calling program.
In the next section, we will consider an alternative approach to polynomial
regression in one dimension. It will have the advantage of low round-off
error, high execution speed, and an automatic termination option (see file
Lsqply.txt).
From [BIBLI 01].
-----------------------------------------------
End of file Lstsqr.txt