the "Best Fit" to some (xn, yn) Data

If we assume that our investment grows according to:

A(N) = A(0) (1 + R)N

where we start with $A(0) and, after N months (or days or years ...), we have $A(N), then it's easy to compute R, the monthly Rate of Return (or daily or yearly ...):

R = {A(N)/A(0)}1/N - 1

This formula ignores everything that's happened between the beginning and the end (after N months). What we really want is the following:

  1. Consider C(1+R)n as an approximation to A(n)
  2. Then, taking logarithms, we want log{C(1+R)n} = log(C) + n log(1+R) to be an approximation to log(A(n)) for every n ... not just the first (n=0) and last (n=N).
  3. Choose the numbers C and R so that this is the best approximation, meaning that the errors are minimized (in some optimal way).
For convenience we let K =log(C) and M = log(1+R) and consider the straight line nM + K which, for each value of n, is supposed to approximate the numbers
log(A(0)), log(A(1)), log(A(2)), ..., log(A(N))

For sanitary reasons, we let these numbers be called
y0, y1, y2, ...,y100 and we also assume N = 100.

So far, so good.
Here are the errors in our approximation

e0={y0 - K}, e1={y1 - (M+K)}, e2={y2 - (2M+K)}, ..., e100={y100 - (100M+K)}

and the sum of the squares of these errors, Σen2, is:

{y0 - K}2+{y1 - (M+K)}2+ {y2 - (2M+K)}2+ ...+{y100 - (100M+K)}2

Remember, we know the numbers y0, y1, y2, etc. (they're the logarithms of the dollar values of our investment, after 0, 1, 2, 3, etc. months). What we want is to choose just two numbers, M and K, so the sum of the squares of these errors is as small as possible (which, by the way, defines what we mean by the optimal or "Best" ... but you may have another definition).

We'll call this error E(M,K), so:

E(M,K) = Σ {yn - (nM+K)}2    where the sum is from n = 0 to n = N (= 100, say)

and maximize like so (careful ... some Calculus here):

d/dM E(M,K) = 0    or    Σ n{yn - (nM+K)} = 0    or    MΣn2 + KΣn = Σnyn

d/dK E(M,K) = 0    or    Σ {yn - (nM+K)} = 0    or    MΣn + KΣ1 = Σyn

The solution is:

M = ??     K = ??


... left as an exercise ...*


Okay, time for a picture.
Here's the logarithm of the TSE 300 over some 14-year period
and the straight line Mn + K
(with M and K as per formula, above, calculated from the logarithms of the TSE)

You can use log10 or loge or logπ or whatever
and the horizontal axis could be labelled 0, 1, 2, 3, ... 14 and that's n.

Now plot exp(Mn+K) versus A(n)
(A(n)? Them's the actual TSE values we want to approximate), and get:

The green curve is the exponential which simply goes from the first TSE value to the last.
The magenta curve is our approximation which incorporates all values of the TSE over this period (cuz M and K are obtained from all those nice yn = log(A(n).).
Nice, eh?

*
In general, if we want the "best" straight line fit to points (x1,y1), (x2,y2), (x3,y3), ..., (xN,yN)
like so:

we start with a line y = Mx + K and minimize the error
E(M,K) = Σ {yn - (Mxn+K)}2
and get equations similar to those above, namely:

MΣxn2 + KΣxn = Σxnyn
MΣxn + KΣ1 = Σyn

where the sum is from n = 1 to n = N

The solution is
where we drop the subscripts,
writing x in place of xn, etc. cuz it looks neater
... but you understand x really means xn ...

M = { N Σxy - Σx Σy } / { N Σx2 - ( Σx )2 }

K = { Σx2 Σy - Σx Σxy } / { N Σx2 - ( Σx )2 }


Mamma mia!
Uh ... did I mention that Σ1 = 1+1+1+...+1 = N ?


Of course, if you have MS Excel, the calculation of M=SLOPE and K=INTERCEPT is easy. Just put the logarithms of the Data, LN(Data), in column A and the xs in column B and and use the Excel commands:

=SLOPE(A1:A100,B1:B100)
=INTERCEPT(A1:A100,B1:B100)
This'll give a line: y = Mx+K.
The "best fit" to the Data is then EXP(y) = EXP(Mx+K) vs x ... for example:

   

Note: our "Best Fit" minimized the mean squared error for the logarithm of the data.
It did NOT minimize the mean squared error for the original S&P 500 data
which is one reason for putting it inside "quotes"

That's because we're looking at best "straight-line" fits
... and the logarithm is close to a straight line, eh?

We could also try to mimic the S&P directly, with y = C (1+R)n and n is the number of days (weeks? years?) and R is the gain per day (week? year?) and we try to minimize: E(M,K) = Σ {yn - C (1+R)n}2   by choice of C and R.

Good luck! (But check out Best Fit to stock prices.)


Oh, one more thingy:
The Standard Deviation (SD) of any set of numbers x1, x2, ..., xN (not necessarily those considered above!) is given by:
SD2 = (1/N) Σ (xn - A)2    where A is the average of the x's, namely    A = (1/N) Σ xn.

Note: The Standard Deviation is ALWAYS positive (or, at least it's never negative).

Okay, to calculate SD, we

  1. Calculate the average of the N numbers x1, x2, ...
    That's A = (1/N) {x1 + x2 + ...}
  2. Calculate the deviations of the numbers x1, x2, ... from their average.
    That's (x1-A), (x2-A), ...
  3. Square each of these deviations.
    That's (x1-A)2, (x2-A)2, ...
  4. Calculate the average of these squares.
    That's SD2!
Okay, here's your homework:
Show that SD2 = (1/N) Σx2 - A2
namely the difference between the
"average of the squares" and the "square of the average"!

If we write this out it looks like this:
SD2 = (1/N)Σx2 - {(1/N) Σx }2
which is just N2 times the denominator in the M & K equations, above!!

Neat, eh what?


Of course, who's to say that minimizing the Mean Squared Error is really the "Best Fit"?

Suppose e1, e2, ... eN are the absolute values of the errors and we wish to minimize "something", by appropriate choice of M = SLOPE and K = INTERCEPT.

Normally we wish to minimize SQRT{(1/N)Σ en2}.
Instead, let's minimize (1/N)Σ en (which is a sum of positive terms since each en is an absolute value).

Let's compare these two "error measures". Under what conditions will (1/N)Σ en < SQRT{(1/N)Σ en2} ?

This will be true if: {(1/N)Σ en}2 < (1/N)Σ en2

And this will be true if: (1/N)Σen2 - {(1/N)Σ en}2 > 0

which we recognize as a Standard Deviation, so the inequality is true if the Standard Deviation of the errors, en, is positive.

But any Standard Deviation is ALWAYS positive !!

Conclusion? The MEAN of the absolute values of the errors is a smaller error measure than the ROOT MEAN SQUARE error measure.

for Part 2

{