This is a quick note showing how the optimal projection can be derived in terms of the inversion of the gram matrix used in the note on the connection between sloppiness and projection.
Lets derive \(min_{w\in R^n}\left\lVert f-\sum_{i=0}^nw_it^i) \right\rVert^2\) with projections (this is the best nth degree polynomial approximation of a function f)
\[min_{w\in R^n}\left\lVert f-\sum_{i=0}^nw_it^i) \right\rVert^2 = min_{w\in R^n}\left\langle f-\sum_{i=0}^nw_it^i,f-\sum_{i=0}^nw_it^i \right\rangle\]
by definition of the norm in terms of the inner product. \[ \begin{align} \left\langle \cdot,\cdot \right\rangle &= \int_0^1(f-\sum_{i=0}^nw_it^i)^2 dt \\ &= \int_0^1(f^2-2f\left(\sum_{i=0}^nw_it^i\right) + \left(\sum_{i=0}^nw_it^i\right)^2 dt \\ &= \int_0^1(f^2) -\int_0^1 2f\left(\sum_{i=0}^nw_it^i\right) + \int_0^1\left(\sum_{i=0}^nw_it^i\right)^2 dt \\ \end{align}\] \[F(w) = \left\langle f \right\rangle^2-2w^\intercal \begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix} +w^\intercal \begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}w\] remembering that we are minimizing \(F\) with respect to \(w\). We can find the minimum by solving \(\nabla_w F=0\), the first term is zero, lets expand the second term \[\begin{align} \nabla_w\left( -2w^\intercal \begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix} \right) &= -2\begin{bmatrix} \frac{\partial}{\partial w_0} \\ \frac{\partial}{\partial w_1} \\\frac{\partial}{\partial w_2}\\... \end{bmatrix} \left( w_0\left\langle f,1 \right\rangle +w_1\left\langle f,t \right\rangle + w_2\left\langle f,t^2 \right\rangle + ... \right) \\ &= -2\begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix} \end{align}\]
It is a bit more complicated, but \[\nabla_w\left(w^\intercal \begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}w \right)= 2\begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}w\] So we have \[ 0 = -2\begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix} + 2\begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}w^{\star}\] and thus, \[ \begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix} = \begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}w^{\star}\]
Finally, \[ w^{\star} = \begin{bmatrix} \left\langle 1,1 \right\rangle & \left\langle t,1 \right\rangle & ... & \left\langle t^n,1 \right\rangle \\ \left\langle 1,t\right\rangle\\ \left\langle 1,t^2 \right\rangle \\ ... \\ \left\langle 1,t^n \right\rangle & ... & &\left\langle t^n,t^n \right\rangle \end{bmatrix}^{-1} \begin{bmatrix} \left\langle f,1 \right\rangle \\ \left\langle f,t\right\rangle\\ \left\langle f,t \right\rangle \\ ... \\ \left\langle f,t^n \right\rangle\end{bmatrix}\] In general the matrix \(G_{ij} = \langle b_i,b_j\rangle\) is called the Gram matrix and it depends only on the basis functions \(b_i\). Thus in general, the “best approximation” of \(f\) in the basis \(b_i\) is given by \[ w^{\star} = G^{-1}\begin{bmatrix} \langle f,b_0\rangle \\ \langle f,b_1\rangle \\ \langle f,b_2\rangle \\ ... \\ \langle f,b_n\rangle \\ \end{bmatrix}\]