Least squares

# Linearization

This chapter explains how to find the least squares minimum by linearizing the problem.

## The problem with the minimization of \( \mathcal{S} \)

As we saw in the previous page, our problem consists of finding \( p_{min} \) such as:

##### Model \( p_{min} \) that best fits the observations

\[ \mathcal{S}(p_{min}) = \mathcal{S}_{min} = \min_p \sum_{ i=1 }^{ n } \left( y_i - f(p)_i \right)^2 \]

How do we do this?

In practice, finding the minimum of \( \mathcal{S} \) is not straightforward. There is not a simple or analytical way to express \( \mathcal{S} \). Nor even \( f \). So how could we possibly find the minimum if we can't even express the function?

In the case of orbit determination, the calculation of \( f(p) \) is a complex process which involves orbit propagation and numerical integration. We can get a value of \( \mathcal{S} \) for each parameter \(p\) we provide as an input, it only requires computing time. But unless we run thousands and thousands of orbit calculations, we will never have a continuous sampling of \( \mathcal{S} \) values over an extensive domain of \( p \) values to find the minimum.

So what can we do?

## Least squares minimization in practice

### What do we do in practice?

One solution is to tweak the problem into a similar problem which we *know* how to solve. For example, we could try to approximate \( \mathcal{S} \) by another function, which we know *has* a minimum, which we know *how* to determine. This minimum would not be the solution of the initial problem, but it could be *close enough* to bring us satisfaction. Maybe.

So our next question is: how can we approximate \( \mathcal{S} \)?

This one is easy. We already hear the crowd scream: linearization.

### Linearization of \( f \)

The easiest solution is to linearize the problem. If \( p_0 \) is the initial parameter (the initial model), then we could linearize \( f \) around \( p_0 \). Let's call \( \ell \) the linear approximation of \( f \) around \( p_0 \). We would then have an approximation of \(f\) and subsequently an approximation of \(\mathcal{S}\). Let's call \(Q\) the approximation \(\mathcal{S}\) (you will understand why \(Q\) in a minute).

##### Linearization of f and approximation of \( \mathcal{S} \)

\[ \mathcal{f}(p) \underset{ p_0 }{ \sim } \ell(p) \]

\[ \mathcal{S}(p) \underset{ p_0 }{ \sim } Q(p) = \sum_{ i=1 }^{ n } \left( y_i - \ell(p)_i \right)^2 \]

### Advantages

We can already see the advantages of such an approximation. If we linearize \( f \), \( \mathcal{S} \) will be the square of something linear, which will be something quadratic. Quadratic (\(Q\)) is fairly simple and well-known in mathematical terms.

- If the parameter has one dimension, this is a parabola.
- If the parameter has two dimensions, this is a paraboloid.
- If the parameter has \( k \) dimensions, this is a "hyper-paraboloid".

This becomes very interesting, because we know very well where the minimum of a parabola or a paraboloid is. This minimum exists, is unique and is very simple to determine analytically. So our new problem *has* a solution.

### Is this approximation valid?

We can make the assumption *at our own risk*, that this is reasonably valid, *as long as we stay close to \( p_0 \)*. It is very important to keep this assumption on the back of our mind. Meanwhile, for lack of a better option, we'll do it *anyway*.

Let's now look at the expressions of \( \ell \) and \( Q \).

## Linearization

### Linear approximation of \( f \)

We can express \( \ell \) with a Taylor development of \( f \) at the first order around \( p_0 \).

##### Linear approximation of f around \(p_0\)

\[ \ell(p) = f(p_0) + \frac{ \partial f }{ \partial p}(p_0) (p - p_0) \]

### Quadratic approximation of \( \mathcal{S} \)

If we replace \( f \) by \( \ell \), we obtain the approximation of \( \mathcal{S} \) around \( p_0 \).

\[ \mathcal{S}(p) \underset{ p_0 }{ \sim } Q(p) = \sum_{ i=1 }^{ n } { \left( y_i - \ell(p)_i \right) }^2 \]

With the Taylor development and the partial derivatives:

\[ Q(p) = \sum_{ i=1 }^{ n } { \left( y_i - \left[ f(p_0) + \frac{ \partial f }{ \partial p}(p_0) (p - p_0) \right]_i \right) }^2 \]

Or also:

##### Quadratic approximation of \( \mathcal{S} \) around \(p_0\)

\[ Q(p) = \sum_{ i=1 }^{ n } { \left[ \left( y - f(p_0) \right) - \frac{ \partial f }{ \partial p}(p_0) (p - p_0) \right]_i }^2 \]

Note that the vector of initial residuals \( b = y - f(p_0) \) appears in the expression.

### Finding the minimum of \(Q\)

We may not be able to find the minimum of \( \mathcal{S} \), but we can find the minimum of \( Q \) instead. Let's do it.

Let's just invert the sign of the term inside the square to anticipate the matrix notation in the next step:

\[ Q(p) = \sum_{ i=1 }^{ n } { \left[ \frac{ \partial f }{ \partial p}(p_0) (p - p_0) - ( y - f(p_0) ) \right]_i }^2 \]

## Matrix notation

All this being linear, we can introduce matrices to simplify the notations. It will help us simplify our calculations, but we should emphasize that it is *strictly* equivalent to develop all the sums and products term by term.

If we call \( \mathcal{B}_m \) and \( \mathcal{B}_n \) the canonical bases of \( \mathbb{R}^m \) and \( \mathbb{R}^n \), we can define the following matrices and vectors:

- \( A = mat_{\mathcal{B}_n,\mathcal{B}_m}(\ell)\) the matrix of partial derivatives in function of the parameters in \( p_0 \). \( A_{ij} = \frac{ \partial f_i }{ \partial p_j}(p_0) \). \( A \) is called the "observation matrix". It has \( n \) lines (the number of measurements), and \( m \) columns (the number of parameters). In our example, \(A\) is a vertical rectangle of 3 million lines and 8281 columns.
- \( X = mat_{\mathcal{B}_m}(p - p_0) \). It is the vector of "correction" to the parameters. It has \( m \) lines (the number of parameters).
- \( B = mat_{\mathcal{B}_n}( y - f(p_0))\). It is the vector of initial residuals. It has \( n \) lines (the number of measurements).

With those notations, the previous expression of \(Q\) can now advantageously be rewritten as:

##### Approximation of \( \mathcal{S} \) — Matrix notation (1)

\[ Q(p) = \sum_{ i=1 }^{ n } { \left( AX - B \right)_i }^2 \]

This is none other than the square of the euclidian norm of the vector \( AX - B \), or also its dot product with itself:

##### Approximation of \( \mathcal{S} \) — Matrix notation (2)

\[ Q(p) = || AX - B ||^2 = \,^t(AX - B)(AX-B) \]

If we develop this product term by term (4 terms), and introduce the following notations:

- \( N = \,^tAA \)
- \( S = \,^tAB \)
- \( \alpha = \,^tBB \)

We can rewrite the equation again as:

##### Approximation of \( \mathcal{S} \) — Matrix notation (3)

\[ Q(p) = \,^tXNX - 2\,^tSX + \alpha \]

This looks very nice. It is obviously a polynomial of degree 2. Let's add the following comments:

- \( N = \,^tAA \) is called the normal matrix. By construction, it is square, symmetric, positive, with \( m \) lines and \( m \) columns (the number of parameters).
- \( S = \,^tAB \) is vector of \( m \) lines.
- \( \alpha = \,^tBB = || y - f(p_0) ||^2 = \sum_{ i=1 }^{ n } \left( y_i - f(p_0)_i \right)^2 = Q(p_0) = \mathcal{S}(p_0) \) is the sum of the squares of the residuals obtained for the initial parameter \( p_0 \) (when \( X = 0 \)). It is a positive number.

As an exercise, you can develop the matrix products and express the full sum \(Q(p)\) in function of the individual terms \( (a_{ij}) \), \( (x_i) \) and \( (b_i) \).

Now let's have a look at the minimum of \(Q\).

## Minimum of \( Q \)

Everybody will agree that, in dimension one, when \( x \in \mathbb{R} \), the expression \( q(x) = nx^2 - 2sx + \alpha \), if \( n \) is *strictly positive*, has a unique minimum of \( q_{min} = \alpha - s^2/n \) attained in \( x_{min} = s/n \).

##### Minimum of \(q\)

###### Dimension 1

\[ q = nx^2 - 2sx + \alpha \]

\[ x_{min} = \frac{s}{n} \]

\[ q_{min} = \alpha - \frac{s^2}{n} \]

Similarly, we can easily demonstrate that, in dimension \(m\), when \( X \in \mathbb{R}^m \), the function \( \,^tXNX - 2\,^tSX + \alpha \), if \( N \) is *definite positive*, has a unique minimum of \( Q_{min} \) attained in \( X_{min}\), with the following expressions:

##### Minimum of \(Q\)

###### Dimension m

\[Q = \,^tXNX - 2\,^tSX + \alpha \]

\[ X_{min} = N^{-1}S \]

\[ Q_{min} = \alpha - \,^tSN^{-1}S\]

This is simply the minimum of a paraboloid. And this is an elliptic paraboloid, not a hyperbolic paraboloid, because \(N\) is positive.

## Inversion of the normal matrix

Mathematically, the only thing that remains in order to find the minimum of \( Q \) is to invert the normal matrix \( N \). How do we do this? \( N = \,^tAA \) is square, symmetric and positive. It is also definite (except in rare exceptions). There are several techniques to invert such a matrix.

- The most common is the Cholesky factorization.
- Another one is to diagonalize the matrix and invert its eigenvalues. This method requires more computing power, but it can prove very useful, as we shall see later in this course.

We now have \(N^{-1}\).

## Least squares minimum

Once we have inverted \(N\), we finally calculate \( X_{min} = N^{-1}S \) and easily obtain the solution of our problem. If we define \(P_{min} = mat_{\mathcal{B}_m}(p_{min}) \) and \(P_0 = mat_{\mathcal{B}_m}(p_0)\), we have:

##### Least squares minimum

\[ P_{min} = P_0 + N^{-1}S \]

Our goal is finally achieved. We have found \(p_{min}\).

Congratulations.

### A word of caution

At this point, we would like to add a small word of caution. Remember that \(p_{min}\) is only the minimum of \(Q\), not the minimum of \(\mathcal{S}\). In other words,

- \(Q(p_{min}) = Q_{min}\), but \(\mathcal{S}(p_{min}) \ne \mathcal{S}_{min}\).

This was expected from the start, but let's not forget it.

## Corrections

We name "correction" the difference \( \Delta p \) between the initial parameter (the "model") and the newly determined parameter (the "corrected" model).

Similarly there is also a \( \Delta Q \), which represents the improvement in the value of \( Q \) between \( Q(p_0) \) and \( Q_{min} \). This is, if you will, the "decrease in residuals" (more on this in the next paragraph).

##### Corrections

\[ \Delta p = N^{-1}S \in \mathbb{R}^m \]

\[ \Delta Q = - \,^tSN^{-1}S \in \mathbb{R}^- \]

\( \Delta p \) can be seen as a move in the "horizontal" plane, i.e. \( \mathbb{R}^m \), the space of parameters, and \( \Delta Q\) can be seen as a move along the "vertical axis", i.e. \( \mathbb{R} \), the space of the sum of the squares of the residuals.

## Decrease in residuals

As we said above, \(p_{min}\) is the minimum of \(Q\), but it is only an approximation of the minimum of \(\mathcal{S}\). Therefore, if you want to know the exact gain in terms of residuals, you cannot rigorously use \( \Delta Q \), you have to recalculate \( \mathcal{S} \) in \(p_{min}\). You will then obtain the exact value \( \Delta \mathcal{S} = \mathcal{S}(p_{min}) - \mathcal{S}(p_0) \). In our case, recalculating \( \mathcal{S} \) in \(p_{min}\) means to rerun the orbit computation using the new gravity field model as an input.

##### Decrease in residuals

\[ \Delta \mathcal{S} = \mathcal{S}(p_{min}) - \mathcal{S}(p_0) \]

\(\Delta Q\) and \( \Delta \mathcal{S} \) may be close, but it is necessary to mention this difference for the sake of accuracy. The advantage of \(\Delta Q\) is that it is instantaneously obtained at the inversion of the normal matrix. \( \Delta \mathcal{S} \), on the opposite, requires another round of calculations. But it is the exact number.

## Validity of the linear approximation

Before closing this chapter on linearization, we would like to add another word of caution. Let's remember the assumption we made at the beginning. If we wander too far away from \(p_0\), i.e. if we leave the zone where the linear approximation is valid (in other words, if \( ||p_{min} - p_0|| \gg \varepsilon \)), then the minimum of \(Q\) may actually have nothing to do with the minimum of \(\mathcal{S}\).

It means that the solution we have just found may be completely off track. That would be embarrassing, after all this work. So how can we process this issue?

- A good (and classic) one is to iterate the algorithm, which we will see immediately in the next chapter.
- Another one, when iterations are not possible or require too much computing power, is diagonalization and truncation of eigenvalues, which we shall see in the second part in this course.