# Orthogonalization

## Orthogonalization is important.

### 1) Vector to span{vector}

Let's consider two vectors $$v,w$$in n dimension.

$$
Project ; v ; onto ; span{w}=Proj\_wv=\dfrac{v \cdot w}{||w||^2}w=\dfrac{v\cdot w}{||w||}\dfrac{w}{||w||}
$$

&#x20;$$\dfrac{v \cdot w}{||w||}$$: length  $$\dfrac{w}{||w||}$$: direction

### Gram-schmidt process

Every non-zero subspace of $$\mathbb{R}^n$$has an orthonormal basis.

1. $$v\_1=w\_1$$
2. $$v\_2=w\_2-Proj\_{w\_1}w\_2$$
3. $$v\_3=w\_3-Proj\_{w\_2}w\_3=w\_3-Proj\_{v\_1,v\_2}w\_3$$

$$v\_k=w\_k-\sum^{k-1}*{i=1}Proj*{v\_i}w\_i$$

$$
{v\_1,v\_2,...,v\_k} : ; orthonormal ; basis ; for ; a ; subspace ;W ; of ; \mathbb{R}^n \\
w=(w\cdot v\_1)v\_1+(w \cdot v\_2)v\_2+\cdots (w \cdot v\_k)v\_k
$$

### Projection Matrix

$$A\mathbf{x}=b ;;;  would ;; have ;; no ;; solutions. \ A\hat{\mathbf{x}}=p ;;;where ;; p ; is ; a ; projected ; vector ; from  ; b ; to ; col(A)$$

$$
A\hat{\mathbf{x}}=p \\
A^T(b-A\hat{\mathbf{x}})=0 \\
\hat{\mathbf{x}}=(A^TA)^{-1}A^Tb \\
p=A(A^TA)^{-1}A^Tb=Pb
$$

P$$(A(A^TA)^{-1}A^T)$$ is a symmetric and idempotent matrix.

$$
P=A(A^TA)^{-1}A^T=AA^T=\[v\_1 \cdots v\_n]      \begin{bmatrix}v\_1^T\ \vdots \v\_n^T \end{bmatrix}= v\_1\cdot v\_1^T + \cdots + v\_n \cdot v\_n^T \ when ; A ; has ; an ; orthonormal ; basis, A^TA=I
$$

$$
Pb=(b\cdot v\_1)v\_1 +\cdots +(b\cdot v\_n)v\_n
$$

![](https://1943863620-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MZyJM_SVjd9SJ3tuTbA%2F-MZyJcxkfrzxJIMVg9Jf%2F-M_0YTLlslxCxEwIloBe%2Fch3_3.jpg?alt=media\&token=dd8bad2a-2ba3-4cf1-8ae3-27f2e2ee6932)

&#x20;🦊 Regression by Successive Orthogonalization 🦊&#x20;

**Why do we use orthogonalization?**

$$
Y=X\beta +\varepsilon \\
\hat{\beta}=(X^TX)^{-1}X^Ty \\
\hat{\beta}\_j=\dfrac{\langle \mathbf{x}\_j,\mathbf{y}\rangle}{\langle \mathbf{x}\_j,\mathbf{x}\_j\rangle}, \quad when ;X=Orthogonal ; matrix \\
$$

&#x20;When X is an orthogonal matrix, simply we can find the coefficient through inner product of j$$\_{th}$$input vector and y vector. It is computationally efficient compared to matrix multiplication. However, there is no situation we have orthogonal data in real world. Thus we need to make our data orthogonal.

1. Initialize $$\mathbf{z}\_0=\mathbf{x}\_0=1$$
2. For $$j=1,2,\dots,p$$

   $$
   Regress ; \mathbf{x}\_j ; on ; \mathbf{z}\_0,\mathbf{z}*1,\dots,\mathbf{z}*{j-1} \\
   \mathbf{z}*j=\mathbf{x}*j-\sum^{j-1}*{k=0}\hat{\gamma}*{kj}\mathbf{z}*k, \quad \hat{\gamma}*{lj}=\dfrac{\langle  \mathbf{z}\_l,\mathbf{x}\_j \rangle}{\langle \mathbf{z}\_l, \mathbf{z}\_l \rangle} \\
   \mathbf{z}\_j=\mathbf{x}*j-\sum^{j-1}*{k=0}\dfrac{\langle  \mathbf{z}\_k,\mathbf{x}\_j \rangle}{\langle \mathbf{z}\_k, \mathbf{z}\_k \rangle} \mathbf{z}*k=\mathbf{x}*j-\sum^{j-1}*{k=0} Proj*{\mathbf{z}\_k}\mathbf{x}\_j
   $$
3. Regress $$\mathbf{y}$$on the residual $$\mathbf{z}\_p$$to give the estimate $$\hat{\beta}\_p$$

The result of this algorithm is $$\hat{\beta}\_p=\dfrac{\langle \mathbf{z}\_p,\mathbf{y} \rangle}{\langle \mathbf{z}\_p,\mathbf{z}\_p \rangle}$$

&#x20; X has an orthonormal basis, so our input vectors all could be expressed as an linear combination of $$z\_j$$.&#x20;

$$
x\_j=\mathbf{j} \cdot \mathbf{z} \\
$$

We can do extend it to matrix.&#x20;

$$
\begin{split}
\mathbf{X} &=\mathbf{Z}\mathbf{\Gamma}
\ &= \begin{bmatrix} z\_{11} & \cdots & z\_{p1} \ \vdots & \ddots & \vdots \ z\_{n1} & \cdots & z\_{np}  \end{bmatrix}
\begin{bmatrix} \gamma\_{11} & \gamma\_{12}
&\cdots &  \gamma\_{1p} \ 0 & \gamma\_{22} & \cdots & \gamma\_{2p} \ 0 & 0 & \cdots & \gamma\_{3p} \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \gamma\_{pp} \end{bmatrix}

\ &=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma} \ &=\mathbf{Q}\mathbf{R} \\
&= \begin{bmatrix} q\_1 & \cdots & q\_p \end{bmatrix} \begin{bmatrix} (x\_1 \cdot q\_1) & (x\_2 \cdot q\_1) & \cdots & (x\_p \cdot q\_1)  \ 0 & (x\_2 \cdot q\_2) & \cdots & (x\_p \cdot q\_2) \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & (x\_p \cdot q\_p)
\end{bmatrix}

\end{split}
$$

$$\mathbf{Q}$$: direction, $$\mathbf{R}$$: length(scale)

$$
\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}=(\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^Ty=
\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \\
\hat{\mathbf{y}}=\mathbf{Q}\mathbf{Q}^T\mathbf{y}
$$

> It is so computationally efficient!
