Orthogonalization is important.
1) Vector to span{vector}
Let's consider two vectors v , w v,w v , w in n dimension.
P r o j e c t    v    o n t o    s p a n { w } = P r o j w v = v ⋅ w ∣ ∣ w ∣ ∣ 2 w = v ⋅ w ∣ ∣ w ∣ ∣ w ∣ ∣ w ∣ ∣ Project \; v \; onto \; span\{w\}=Proj_wv=\dfrac{v \cdot w}{||w||^2}w=\dfrac{v\cdot w}{||w||}\dfrac{w}{||w||} P ro j ec t v o n t o s p an { w } = P ro j w ​ v = ∣∣ w ∣ ∣ 2 v ⋅ w ​ w = ∣∣ w ∣∣ v ⋅ w ​ ∣∣ w ∣∣ w ​ v ⋅ w ∣ ∣ w ∣ ∣ \dfrac{v \cdot w}{||w||} ∣∣ w ∣∣ v ⋅ w ​ : length w ∣ ∣ w ∣ ∣ \dfrac{w}{||w||} ∣∣ w ∣∣ w ​ : direction
Gram-schmidt process
Every non-zero subspace of R n \mathbb{R}^n R n has an orthonormal basis.
v 2 = w 2 − P r o j w 1 w 2 v_2=w_2-Proj_{w_1}w_2 v 2 ​ = w 2 ​ − P ro j w 1 ​ ​ w 2 ​
v 3 = w 3 − P r o j w 2 w 3 = w 3 − P r o j v 1 , v 2 w 3 v_3=w_3-Proj_{w_2}w_3=w_3-Proj_{v_1,v_2}w_3 v 3 ​ = w 3 ​ − P ro j w 2 ​ ​ w 3 ​ = w 3 ​ − P ro j v 1 ​ , v 2 ​ ​ w 3 ​
v k = w k − ∑ i = 1 k − 1 P r o j v i w i v_k=w_k-\sum^{k-1}_{i=1}Proj_{v_i}w_i v k ​ = w k ​ − ∑ i = 1 k − 1 ​ P ro j v i ​ ​ w i ​
{ v 1 , v 2 , . . . , v k } :    o r t h o n o r m a l    b a s i s    f o r    a    s u b s p a c e    W    o f    R n w = ( w ⋅ v 1 ) v 1 + ( w ⋅ v 2 ) v 2 + ⋯ ( w ⋅ v k ) v k \{v_1,v_2,...,v_k\} : \; orthonormal \; basis \; for \; a \; subspace \;W \; of \; \mathbb{R}^n \\
w=(w\cdot v_1)v_1+(w \cdot v_2)v_2+\cdots (w \cdot v_k)v_k { v 1 ​ , v 2 ​ , ... , v k ​ } : or t h o n or ma l ba s i s f or a s u b s p a ce W o f R n w = ( w ⋅ v 1 ​ ) v 1 ​ + ( w ⋅ v 2 ​ ) v 2 ​ + ⋯ ( w ⋅ v k ​ ) v k ​ Projection Matrix
A x = b          w o u l d       h a v e       n o       s o l u t i o n s . A x ^ = p          w h e r e       p    i s    a    p r o j e c t e d    v e c t o r    f r o m    b    t o    c o l ( A ) A\mathbf{x}=b \;\;\; would \;\; have \;\; no \;\; solutions. \\ A\hat{\mathbf{x}}=p \;\;\;where \;\; p \; is \; a \; projected \; vector \; from \; b \; to \; col(A) A x = b w o u l d ha v e n o so l u t i o n s . A x ^ = p w h ere p i s a p ro j ec t e d v ec t or f ro m b t o co l ( A )
A x ^ = p A T ( b − A x ^ ) = 0 x ^ = ( A T A ) − 1 A T b p = A ( A T A ) − 1 A T b = P b A\hat{\mathbf{x}}=p \\
A^T(b-A\hat{\mathbf{x}})=0 \\
\hat{\mathbf{x}}=(A^TA)^{-1}A^Tb \\
p=A(A^TA)^{-1}A^Tb=Pb A x ^ = p A T ( b − A x ^ ) = 0 x ^ = ( A T A ) − 1 A T b p = A ( A T A ) − 1 A T b = P b P( A ( A T A ) − 1 A T ) (A(A^TA)^{-1}A^T) ( A ( A T A ) − 1 A T ) is a symmetric and idempotent matrix.
P = A ( A T A ) − 1 A T = A A T = [ v 1 ⋯ v n ] [ v 1 T ⋮ v n T ] = v 1 ⋅ v 1 T + ⋯ + v n ⋅ v n T w h e n    A    h a s    a n    o r t h o n o r m a l    b a s i s , A T A = I P=A(A^TA)^{-1}A^T=AA^T=[v_1 \cdots v_n] \begin{bmatrix}v_1^T\\ \vdots \\v_n^T \end{bmatrix}= v_1\cdot v_1^T + \cdots + v_n \cdot v_n^T \\ when \; A \; has \; an \; orthonormal \; basis, A^TA=I P = A ( A T A ) − 1 A T = A A T = [ v 1 ​ ⋯ v n ​ ] ​ v 1 T ​ ⋮ v n T ​ ​ ​ = v 1 ​ ⋅ v 1 T ​ + ⋯ + v n ​ ⋅ v n T ​ w h e n A ha s an or t h o n or ma l ba s i s , A T A = I P b = ( b ⋅ v 1 ) v 1 + ⋯ + ( b ⋅ v n ) v n Pb=(b\cdot v_1)v_1 +\cdots +(b\cdot v_n)v_n P b = ( b ⋅ v 1 ​ ) v 1 ​ + ⋯ + ( b ⋅ v n ​ ) v n ​
🦊 Regression by Successive Orthogonalization 🦊
Why do we use orthogonalization?
Y = X β + ε β ^ = ( X T X ) − 1 X T y β ^ j = ⟨ x j , y ⟩ ⟨ x j , x j ⟩ , w h e n    X = O r t h o g o n a l    m a t r i x Y=X\beta +\varepsilon \\
\hat{\beta}=(X^TX)^{-1}X^Ty \\
\hat{\beta}_j=\dfrac{\langle \mathbf{x}_j,\mathbf{y}\rangle}{\langle \mathbf{x}_j,\mathbf{x}_j\rangle}, \quad when \;X=Orthogonal \; matrix \\ Y = Xβ + ε β ^ ​ = ( X T X ) − 1 X T y β ^ ​ j ​ = ⟨ x j ​ , x j ​ ⟩ ⟨ x j ​ , y ⟩ ​ , w h e n X = O r t h o g o na l ma t r i x When X is an orthogonal matrix, simply we can find the coefficient through inner product of jt h _{th} t h ​ input vector and y vector. It is computationally efficient compared to matrix multiplication. However, there is no situation we have orthogonal data in real world. Thus we need to make our data orthogonal.
Initialize z 0 = x 0 = 1 \mathbf{z}_0=\mathbf{x}_0=1 z 0 ​ = x 0 ​ = 1
For j = 1 , 2 , … , p j=1,2,\dots,p j = 1 , 2 , … , p
R e g r e s s    x j    o n    z 0 , z 1 , … , z j − 1 z j = x j − ∑ k = 0 j − 1 γ ^ k j z k , γ ^ l j = ⟨ z l , x j ⟩ ⟨ z l , z l ⟩ z j = x j − ∑ k = 0 j − 1 ⟨ z k , x j ⟩ ⟨ z k , z k ⟩ z k = x j − ∑ k = 0 j − 1 P r o j z k x j Regress \; \mathbf{x}_j \; on \; \mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_{j-1} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\hat{\gamma}_{kj}\mathbf{z}_k, \quad \hat{\gamma}_{lj}=\dfrac{\langle \mathbf{z}_l,\mathbf{x}_j \rangle}{\langle \mathbf{z}_l, \mathbf{z}_l \rangle} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\dfrac{\langle \mathbf{z}_k,\mathbf{x}_j \rangle}{\langle \mathbf{z}_k, \mathbf{z}_k \rangle} \mathbf{z}_k=\mathbf{x}_j-\sum^{j-1}_{k=0} Proj_{\mathbf{z}_k}\mathbf{x}_j R e g ress x j ​ o n z 0 ​ , z 1 ​ , … , z j − 1 ​ z j ​ = x j ​ − k = 0 ∑ j − 1 ​ γ ^ ​ kj ​ z k ​ , γ ^ ​ l j ​ = ⟨ z l ​ , z l ​ ⟩ ⟨ z l ​ , x j ​ ⟩ ​ z j ​ = x j ​ − k = 0 ∑ j − 1 ​ ⟨ z k ​ , z k ​ ⟩ ⟨ z k ​ , x j ​ ⟩ ​ z k ​ = x j ​ − k = 0 ∑ j − 1 ​ P ro j z k ​ ​ x j ​ Regress y \mathbf{y} y on the residual z p \mathbf{z}_p z p ​ to give the estimate β ^ p \hat{\beta}_p β ^ ​ p ​
The result of this algorithm is β ^ p = ⟨ z p , y ⟩ ⟨ z p , z p ⟩ \hat{\beta}_p=\dfrac{\langle \mathbf{z}_p,\mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle} β ^ ​ p ​ = ⟨ z p ​ , z p ​ ⟩ ⟨ z p ​ , y ⟩ ​
X has an orthonormal basis, so our input vectors all could be expressed as an linear combination of z j z_j z j ​ .
x j = j â‹… z x_j=\mathbf{j} \cdot \mathbf{z} \\
x j ​ = j ⋅ z We can do extend it to matrix.
X = Z Γ = [ z 11 ⋯ z p 1 ⋮ ⋱ ⋮ z n 1 ⋯ z n p ] [ γ 11 γ 12 ⋯ γ 1 p 0 γ 22 ⋯ γ 2 p 0 0 ⋯ γ 3 p ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ γ p p ] = Z D − 1 D Γ = Q R = [ q 1 ⋯ q p ] [ ( x 1 ⋅ q 1 ) ( x 2 ⋅ q 1 ) ⋯ ( x p ⋅ q 1 ) 0 ( x 2 ⋅ q 2 ) ⋯ ( x p ⋅ q 2 ) ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ( x p ⋅ q p ) ] \begin{split}
\mathbf{X} &=\mathbf{Z}\mathbf{\Gamma}
\\ &= \begin{bmatrix} z_{11} & \cdots & z_{p1} \\ \vdots & \ddots & \vdots \\ z_{n1} & \cdots & z_{np} \end{bmatrix}
\begin{bmatrix} \gamma_{11} & \gamma_{12}
&\cdots & \gamma_{1p} \\ 0 & \gamma_{22} & \cdots & \gamma_{2p} \\ 0 & 0 & \cdots & \gamma_{3p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \gamma_{pp} \end{bmatrix}
\\ &=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma} \\ &=\mathbf{Q}\mathbf{R} \\
&= \begin{bmatrix} q_1 & \cdots & q_p \end{bmatrix} \begin{bmatrix} (x_1 \cdot q_1) & (x_2 \cdot q_1) & \cdots & (x_p \cdot q_1) \\ 0 & (x_2 \cdot q_2) & \cdots & (x_p \cdot q_2) \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & (x_p \cdot q_p)
\end{bmatrix}
\end{split} X ​ = ZΓ = ​ z 11 ​ ⋮ z n 1 ​ ​ ⋯ ⋱ ⋯ ​ z p 1 ​ ⋮ z n p ​ ​ ​ ​ γ 11 ​ 0 0 ⋮ 0 ​ γ 12 ​ γ 22 ​ 0 ⋮ 0 ​ ⋯ ⋯ ⋯ ⋱ ⋯ ​ γ 1 p ​ γ 2 p ​ γ 3 p ​ ⋮ γ pp ​ ​ ​ = Z D − 1 DΓ = QR = [ q 1 ​ ​ ⋯ ​ q p ​ ​ ] ​ ( x 1 ​ ⋅ q 1 ​ ) 0 ⋮ 0 ​ ( x 2 ​ ⋅ q 1 ​ ) ( x 2 ​ ⋅ q 2 ​ ) ⋮ 0 ​ ⋯ ⋯ ⋱ ⋯ ​ ( x p ​ ⋅ q 1 ​ ) ( x p ​ ⋅ q 2 ​ ) ⋮ ( x p ​ ⋅ q p ​ ) ​ ​ ​ Q \mathbf{Q} Q : direction, R \mathbf{R} R : length(scale)
β ^ = ( X T X ) − 1 X T y = ( R T Q T Q R ) − 1 R T Q T y = R − 1 Q T y y ^ = Q Q T y
\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}=(\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^Ty=
\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \\
\hat{\mathbf{y}}=\mathbf{Q}\mathbf{Q}^T\mathbf{y} β ^ ​ = ( X T X ) − 1 X T y = ( R T Q T QR ) − 1 R T Q T y = R − 1 Q T y y ^ ​ = Q Q T y It is so computationally efficient!