"Best Fit" to Data

the "Best Fit" to some (x_n, y_n) Data

If we assume that our investment grows according to:

A(N) = A(0) (1 + R)^N

where we start with $A(0) and, after N months (or days or years ...), we have $A(N), then it's easy to compute R, the monthly Rate of Return (or daily or yearly ...):

R = {A(N)/A(0)}^1/N - 1

This formula ignores everything that's happened between the beginning and the end (after N months). What we really want is the following:

Consider C(1+R)ⁿ as an approximation to A(n)
Then, taking logarithms, we want log{C(1+R)ⁿ} = log(C) + n log(1+R) to be an approximation to log(A(n)) for every n ... not just the first (n=0) and last (n=N).
Choose the numbers C and R so that this is the best approximation, meaning that the errors are minimized (in some optimal way).

For convenience we let K =log(C) and M = log(1+R) and consider the straight line nM + K which, for each value of n, is supposed to approximate the numbers
log(A(0)), log(A(1)), log(A(2)), ..., log(A(N))

For sanitary reasons, we let these numbers be called
y₀, y₁, y₂, ...,y₁₀₀ and we also assume N = 100.

So far, so good.
Here are the errors in our approximation

e₀={y₀ - K}, e₁={y₁ - (M+K)}, e₂={y₂ - (2M+K)}, ..., e₁₀₀={y₁₀₀ - (100M+K)}

and the sum of the squares of these errors, Σe_n², is:

{y₀ - K}²+{y₁ - (M+K)}²+ {y₂ - (2M+K)}²+ ...+{y₁₀₀ - (100M+K)}²

Remember, we know the numbers y₀, y₁, y₂, etc. (they're the logarithms of the dollar values of our investment, after 0, 1, 2, 3, etc. months). What we want is to choose just two numbers, M and K, so the sum of the squares of these errors is as small as possible (which, by the way, defines what we mean by the optimal or "Best" ... but you may have another definition).

We'll call this error E(M,K), so:

E(M,K) = Σ {y_n - (nM+K)}² where the sum is from n = 0 to n = N (= 100, say)

and maximize like so (careful ... some Calculus here):

^d/_d_M E(M,K) = 0 or Σ n{y_n - (nM+K)} = 0 or MΣn² + KΣn = Σny_n

^d/_d_K E(M,K) = 0 or Σ {y_n - (nM+K)} = 0 or MΣn + KΣ1 = Σy_n

The solution is:

M = ?? K = ??

... left as an exercise ...*

Okay, time for a picture.
Here's the logarithm of the TSE 300 over some 14-year period
and the straight line Mn + K
(with M and K as per formula, above, calculated from the logarithms of the TSE)

You can use log₁₀ or log_e or log_π or whatever
and the horizontal axis could be labelled 0, 1, 2, 3, ... 14 and that's n.

Now plot exp(Mn+K) versus A(n)
(A(n)? Them's the actual TSE values we want to approximate), and get:

The green curve is the exponential which simply goes from the first TSE value to the last.
The magenta curve is our approximation which incorporates all values of the TSE over this period (cuz M and K are obtained from all those nice y_n = log(A(n).).
Nice, eh?

*
In general, if we want the "best" straight line fit to points (x₁,y₁), (x₂,y₂), (x₃,y₃), ..., (x_N,y_N)
like so:

we start with a line y = Mx + K and minimize the error
E(M,K) = Σ {y_n - (Mx_n+K)}²
and get equations similar to those above, namely:

MΣx_n² + KΣx_n = Σx_ny_n
MΣx_n + KΣ1 = Σy_n

where the sum is from n = 1 to n = N

The solution is
where we drop the subscripts,
writing x in place of x_n, etc. cuz it looks neater
... but you understand x really means x_n ...

M = { N Σxy - Σx Σy } / { N Σx² - ( Σx )² }

K = { Σx² Σy - Σx Σxy } / { N Σx² - ( Σx )² }

Mamma mia!
Uh ... did I mention that Σ1 = 1+1+1+...+1 = N ?

Of course, if you have MS Excel, the calculation of M=SLOPE and K=INTERCEPT is easy. Just put the logarithms of the Data, LN(Data), in column A and the xs in column B and and use the Excel commands:

=SLOPE(A1:A100,B1:B100)
=INTERCEPT(A1:A100,B1:B100)

This'll give a line: y = Mx+K.
The "best fit" to the Data is then EXP(y) = EXP(Mx+K) vs x ... for example:

Note: our "Best Fit" minimized the mean squared error for the logarithm of the data.
It did NOT minimize the mean squared error for the original S&P 500 data
which is one reason for putting it inside "quotes"

That's because we're looking at best "straight-line" fits
... and the logarithm is close to a straight line, eh?

We could also try to mimic the S&P directly, with y = C (1+R)ⁿ and n is the number of days (weeks? years?) and R is the gain per day (week? year?) and we try to minimize: E(M,K) = Σ {y_n - C (1+R)ⁿ}² by choice of C and R.

Good luck! (But check out Best Fit to stock prices.)

Oh, one more thingy:
The Standard Deviation (SD) of any set of numbers x₁, x₂, ..., x_N (not necessarily those considered above!) is given by:
SD² = (1/N) Σ (x_n - A)² where A is the average of the x's, namely A = (1/N) Σ x_n.

Note: The Standard Deviation is ALWAYS positive (or, at least it's never negative).

Okay, to calculate SD, we

Calculate the average of the N numbers x₁, x₂, ...
That's A = (1/N) {x₁ + x₂ + ...}
Calculate the deviations of the numbers x₁, x₂, ... from their average.
That's (x₁-A), (x₂-A), ...
Square each of these deviations.
That's (x₁-A)², (x₂-A)², ...
Calculate the average of these squares.
That's SD²!

Okay, here's your homework:
Show that SD² = (1/N) Σx² - A²
namely the difference between the
"average of the squares" and the "square of the average"!

If we write this out it looks like this:
SD² = (1/N)Σx² - {(1/N) Σx }²
which is just N² times the denominator in the M & K equations, above!!

Neat, eh what?

Of course, who's to say that minimizing the Mean Squared Error is really the "Best Fit"?

Suppose e₁, e₂, ... e_N are the absolute values of the errors and we wish to minimize "something", by appropriate choice of M = SLOPE and K = INTERCEPT.

Normally we wish to minimize SQRT{(1/N)Σ e_n²}.
Instead, let's minimize (1/N)Σ e_n (which is a sum of positive terms since each e_n is an absolute value).

Let's compare these two "error measures". Under what conditions will (1/N)Σ e_n < SQRT{(1/N)Σ e_n²} ?

This will be true if: {(1/N)Σ e_n}² < (1/N)Σ e_n²

And this will be true if: (1/N)Σe_n² - {(1/N)Σ e_n}² > 0

which we recognize as a Standard Deviation, so the inequality is true if the Standard Deviation of the errors, e_n, is positive.

But any Standard Deviation is ALWAYS positive !!

Conclusion? The MEAN of the absolute values of the errors is a smaller error measure than the ROOT MEAN SQUARE error measure.

for Part 2

{