Orthogonalization is important.
1) Vector to span{vector}
Let's consider two vectors v , w v,w v , w in n dimension.
P r o j e c t ā
ā v ā
ā o n t o ā
ā s p a n { w } = P r o j w v = v ā
w ā£ ā£ w ā£ ā£ 2 w = v ā
w ā£ ā£ w ā£ ā£ w ā£ ā£ w ā£ ā£ Project \; v \; onto \; span\{w\}=Proj_wv=\dfrac{v \cdot w}{||w||^2}w=\dfrac{v\cdot w}{||w||}\dfrac{w}{||w||} P ro j ec t v o n t o s p an { w } = P ro j w ā v = ā£ā£ w ā£ ā£ 2 v ā
w ā w = ā£ā£ w ā£ā£ v ā
w ā ā£ā£ w ā£ā£ w ā v ā
w ā£ ā£ w ā£ ā£ \dfrac{v \cdot w}{||w||} ā£ā£ w ā£ā£ v ā
w ā : length w ā£ ā£ w ā£ ā£ \dfrac{w}{||w||} ā£ā£ w ā£ā£ w ā : direction
Gram-schmidt process
Every non-zero subspace of R n \mathbb{R}^n R n has an orthonormal basis.
v 2 = w 2 ā P r o j w 1 w 2 v_2=w_2-Proj_{w_1}w_2 v 2 ā = w 2 ā ā P ro j w 1 ā ā w 2 ā
v 3 = w 3 ā P r o j w 2 w 3 = w 3 ā P r o j v 1 , v 2 w 3 v_3=w_3-Proj_{w_2}w_3=w_3-Proj_{v_1,v_2}w_3 v 3 ā = w 3 ā ā P ro j w 2 ā ā w 3 ā = w 3 ā ā P ro j v 1 ā , v 2 ā ā w 3 ā
v k = w k ā ā i = 1 k ā 1 P r o j v i w i v_k=w_k-\sum^{k-1}_{i=1}Proj_{v_i}w_i v k ā = w k ā ā ā i = 1 k ā 1 ā P ro j v i ā ā w i ā
{ v 1 , v 2 , . . . , v k } : ā
ā o r t h o n o r m a l ā
ā b a s i s ā
ā f o r ā
ā a ā
ā s u b s p a c e ā
ā W ā
ā o f ā
ā R n w = ( w ā
v 1 ) v 1 + ( w ā
v 2 ) v 2 + āÆ ( w ā
v k ) v k \{v_1,v_2,...,v_k\} : \; orthonormal \; basis \; for \; a \; subspace \;W \; of \; \mathbb{R}^n \\
w=(w\cdot v_1)v_1+(w \cdot v_2)v_2+\cdots (w \cdot v_k)v_k { v 1 ā , v 2 ā , ... , v k ā } : or t h o n or ma l ba s i s f or a s u b s p a ce W o f R n w = ( w ā
v 1 ā ) v 1 ā + ( w ā
v 2 ā ) v 2 ā + āÆ ( w ā
v k ā ) v k ā Projection Matrix
A x = b ā
ā ā
ā ā
ā w o u l d ā
ā ā
ā h a v e ā
ā ā
ā n o ā
ā ā
ā s o l u t i o n s . A x ^ = p ā
ā ā
ā ā
ā w h e r e ā
ā ā
ā p ā
ā i s ā
ā a ā
ā p r o j e c t e d ā
ā v e c t o r ā
ā f r o m ā
ā b ā
ā t o ā
ā c o l ( A ) A\mathbf{x}=b \;\;\; would \;\; have \;\; no \;\; solutions. \\ A\hat{\mathbf{x}}=p \;\;\;where \;\; p \; is \; a \; projected \; vector \; from \; b \; to \; col(A) A x = b w o u l d ha v e n o so l u t i o n s . A x ^ = p w h ere p i s a p ro j ec t e d v ec t or f ro m b t o co l ( A )
A x ^ = p A T ( b ā A x ^ ) = 0 x ^ = ( A T A ) ā 1 A T b p = A ( A T A ) ā 1 A T b = P b A\hat{\mathbf{x}}=p \\
A^T(b-A\hat{\mathbf{x}})=0 \\
\hat{\mathbf{x}}=(A^TA)^{-1}A^Tb \\
p=A(A^TA)^{-1}A^Tb=Pb A x ^ = p A T ( b ā A x ^ ) = 0 x ^ = ( A T A ) ā 1 A T b p = A ( A T A ) ā 1 A T b = P b P( A ( A T A ) ā 1 A T ) (A(A^TA)^{-1}A^T) ( A ( A T A ) ā 1 A T ) is a symmetric and idempotent matrix.
P = A ( A T A ) ā 1 A T = A A T = [ v 1 āÆ v n ] [ v 1 T ā® v n T ] = v 1 ā
v 1 T + āÆ + v n ā
v n T w h e n ā
ā A ā
ā h a s ā
ā a n ā
ā o r t h o n o r m a l ā
ā b a s i s , A T A = I P=A(A^TA)^{-1}A^T=AA^T=[v_1 \cdots v_n] \begin{bmatrix}v_1^T\\ \vdots \\v_n^T \end{bmatrix}= v_1\cdot v_1^T + \cdots + v_n \cdot v_n^T \\ when \; A \; has \; an \; orthonormal \; basis, A^TA=I P = A ( A T A ) ā 1 A T = A A T = [ v 1 ā āÆ v n ā ] ā v 1 T ā ā® v n T ā ā ā = v 1 ā ā
v 1 T ā + āÆ + v n ā ā
v n T ā w h e n A ha s an or t h o n or ma l ba s i s , A T A = I P b = ( b ā
v 1 ) v 1 + āÆ + ( b ā
v n ) v n Pb=(b\cdot v_1)v_1 +\cdots +(b\cdot v_n)v_n P b = ( b ā
v 1 ā ) v 1 ā + āÆ + ( b ā
v n ā ) v n ā
š¦ Regression by Successive Orthogonalization š¦
Why do we use orthogonalization?
Y = X Ī² + Īµ Ī² ^ = ( X T X ) ā 1 X T y Ī² ^ j = āØ x j , y ā© āØ x j , x j ā© , w h e n ā
ā X = O r t h o g o n a l ā
ā m a t r i x Y=X\beta +\varepsilon \\
\hat{\beta}=(X^TX)^{-1}X^Ty \\
\hat{\beta}_j=\dfrac{\langle \mathbf{x}_j,\mathbf{y}\rangle}{\langle \mathbf{x}_j,\mathbf{x}_j\rangle}, \quad when \;X=Orthogonal \; matrix \\ Y = XĪ² + Īµ Ī² ^ ā = ( X T X ) ā 1 X T y Ī² ^ ā j ā = āØ x j ā , x j ā ā© āØ x j ā , y ā© ā , w h e n X = O r t h o g o na l ma t r i x When X is an orthogonal matrix, simply we can find the coefficient through inner product of jt h _{th} t h ā input vector and y vector. It is computationally efficient compared to matrix multiplication. However, there is no situation we have orthogonal data in real world. Thus we need to make our data orthogonal.
Initialize z 0 = x 0 = 1 \mathbf{z}_0=\mathbf{x}_0=1 z 0 ā = x 0 ā = 1
For j = 1 , 2 , ā¦ , p j=1,2,\dots,p j = 1 , 2 , ā¦ , p
R e g r e s s ā
ā x j ā
ā o n ā
ā z 0 , z 1 , ā¦ , z j ā 1 z j = x j ā ā k = 0 j ā 1 Ī³ ^ k j z k , Ī³ ^ l j = āØ z l , x j ā© āØ z l , z l ā© z j = x j ā ā k = 0 j ā 1 āØ z k , x j ā© āØ z k , z k ā© z k = x j ā ā k = 0 j ā 1 P r o j z k x j Regress \; \mathbf{x}_j \; on \; \mathbf{z}_0,\mathbf{z}_1,\dots,\mathbf{z}_{j-1} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\hat{\gamma}_{kj}\mathbf{z}_k, \quad \hat{\gamma}_{lj}=\dfrac{\langle \mathbf{z}_l,\mathbf{x}_j \rangle}{\langle \mathbf{z}_l, \mathbf{z}_l \rangle} \\
\mathbf{z}_j=\mathbf{x}_j-\sum^{j-1}_{k=0}\dfrac{\langle \mathbf{z}_k,\mathbf{x}_j \rangle}{\langle \mathbf{z}_k, \mathbf{z}_k \rangle} \mathbf{z}_k=\mathbf{x}_j-\sum^{j-1}_{k=0} Proj_{\mathbf{z}_k}\mathbf{x}_j R e g ress x j ā o n z 0 ā , z 1 ā , ā¦ , z j ā 1 ā z j ā = x j ā ā k = 0 ā j ā 1 ā Ī³ ^ ā kj ā z k ā , Ī³ ^ ā l j ā = āØ z l ā , z l ā ā© āØ z l ā , x j ā ā© ā z j ā = x j ā ā k = 0 ā j ā 1 ā āØ z k ā , z k ā ā© āØ z k ā , x j ā ā© ā z k ā = x j ā ā k = 0 ā j ā 1 ā P ro j z k ā ā x j ā Regress y \mathbf{y} y on the residual z p \mathbf{z}_p z p ā to give the estimate Ī² ^ p \hat{\beta}_p Ī² ^ ā p ā
The result of this algorithm is Ī² ^ p = āØ z p , y ā© āØ z p , z p ā© \hat{\beta}_p=\dfrac{\langle \mathbf{z}_p,\mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle} Ī² ^ ā p ā = āØ z p ā , z p ā ā© āØ z p ā , y ā© ā
X has an orthonormal basis, so our input vectors all could be expressed as an linear combination of z j z_j z j ā .
x j = j ā
z x_j=\mathbf{j} \cdot \mathbf{z} \\
x j ā = j ā
z We can do extend it to matrix.
X = Z Ī = [ z 11 āÆ z p 1 ā® ā± ā® z n 1 āÆ z n p ] [ Ī³ 11 Ī³ 12 āÆ Ī³ 1 p 0 Ī³ 22 āÆ Ī³ 2 p 0 0 āÆ Ī³ 3 p ā® ā® ā± ā® 0 0 āÆ Ī³ p p ] = Z D ā 1 D Ī = Q R = [ q 1 āÆ q p ] [ ( x 1 ā
q 1 ) ( x 2 ā
q 1 ) āÆ ( x p ā
q 1 ) 0 ( x 2 ā
q 2 ) āÆ ( x p ā
q 2 ) ā® ā® ā± ā® 0 0 āÆ ( x p ā
q p ) ] \begin{split}
\mathbf{X} &=\mathbf{Z}\mathbf{\Gamma}
\\ &= \begin{bmatrix} z_{11} & \cdots & z_{p1} \\ \vdots & \ddots & \vdots \\ z_{n1} & \cdots & z_{np} \end{bmatrix}
\begin{bmatrix} \gamma_{11} & \gamma_{12}
&\cdots & \gamma_{1p} \\ 0 & \gamma_{22} & \cdots & \gamma_{2p} \\ 0 & 0 & \cdots & \gamma_{3p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \gamma_{pp} \end{bmatrix}
\\ &=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma} \\ &=\mathbf{Q}\mathbf{R} \\
&= \begin{bmatrix} q_1 & \cdots & q_p \end{bmatrix} \begin{bmatrix} (x_1 \cdot q_1) & (x_2 \cdot q_1) & \cdots & (x_p \cdot q_1) \\ 0 & (x_2 \cdot q_2) & \cdots & (x_p \cdot q_2) \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & (x_p \cdot q_p)
\end{bmatrix}
\end{split} X ā = ZĪ = ā z 11 ā ā® z n 1 ā ā āÆ ā± āÆ ā z p 1 ā ā® z n p ā ā ā ā Ī³ 11 ā 0 0 ā® 0 ā Ī³ 12 ā Ī³ 22 ā 0 ā® 0 ā āÆ āÆ āÆ ā± āÆ ā Ī³ 1 p ā Ī³ 2 p ā Ī³ 3 p ā ā® Ī³ pp ā ā ā = Z D ā 1 DĪ = QR = [ q 1 ā ā āÆ ā q p ā ā ] ā ( x 1 ā ā
q 1 ā ) 0 ā® 0 ā ( x 2 ā ā
q 1 ā ) ( x 2 ā ā
q 2 ā ) ā® 0 ā āÆ āÆ ā± āÆ ā ( x p ā ā
q 1 ā ) ( x p ā ā
q 2 ā ) ā® ( x p ā ā
q p ā ) ā ā ā Q \mathbf{Q} Q : direction, R \mathbf{R} R : length(scale)
Ī² ^ = ( X T X ) ā 1 X T y = ( R T Q T Q R ) ā 1 R T Q T y = R ā 1 Q T y y ^ = Q Q T y
\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}=(\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^Ty=
\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \\
\hat{\mathbf{y}}=\mathbf{Q}\mathbf{Q}^T\mathbf{y} Ī² ^ ā = ( X T X ) ā 1 X T y = ( R T Q T QR ) ā 1 R T Q T y = R ā 1 Q T y y ^ ā = Q Q T y It is so computationally efficient!