Linear Regression
Regression with heterogeneous noise
1.1.a Log-likelihood function of independent but not identically distributed data
Here we are given the following:
\begin{aligned}
y = \textbf{x}^T\boldsymbol{\beta} + \varepsilon\end{aligned}
And each data point has different variance.
\begin{aligned}
\varepsilon_n \sim N(0,\sigma_n^2)\end{aligned}
So here we say the following
\begin{aligned}
y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} \sim N(0,\sigma_i^2)\end{aligned}
Writing the log likelihood for the data points
\begin{split}
\ell (\boldsymbol{\beta}) & = \log (P(Y| X,\boldsymbol{\beta}))\\
& = \log\left(\prod_{i = 1 }^{N} \frac{1}{\sqrt{2\pi} \sigma_i} \exp\left( \frac{- (y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2}
\right)
\right)\\
& = \sum_{i = 1}^{N}\left(\log\left(\frac{1}{\sqrt{2\pi} \sigma_i}
\right) - \frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2}
\right)\\
& = -\frac{N}{2}\log 2\pi - \sum_{i = 1}^{N}\log(\sigma_i) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2}
\end{split}
1.1.b Maximum likelihood derivation for \boldsymbol{\beta}
Differentiating above equation w.r.t. \beta_j we get the following:
\begin{split}
\frac{\partial\ell(\boldsymbol{\beta})}{\partial \beta_j} & = -0 - 0 - \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{ij})}{2\sigma_i^2}
\end{split}
Equating above equation to 0 we get:
\begin{split}
\sum_{i = 1}^{N} \frac{(y_i -\textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(x_{ij})}{\sigma_i^2} & = 0\\
\end{split}
In matrix form, above equation can be written as:
\begin{split}
\begin{bmatrix}
x_{11} & x_{21} & . & .& . & x_{n1}
\end{bmatrix}
\begin{bmatrix}
\frac{y_1 - \textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\
\frac{y_2 - \textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\
.\\
.\\
\frac{y_N - \textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2}
\end{bmatrix} = 0
\end{split}
Similarly for other \beta’s we get the same format. Putting all of
them in one matrix we will get the following.
\begin{split}
\begin{bmatrix}
x_{11} & x_{21} & . & .& . & x_{N1}\\
x_{12} & x_{22} & . & .& . & x_{N2}\\
. & . & . & . & . & . & . \\
. & . & . & . & . & . & . \\
x_{1p} & x_{22} & . & .& . & x_{Np}
\end{bmatrix}
\begin{bmatrix}
\frac{y_1 - \textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\
\frac{y_2 - \textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\
.\\
.\\
\frac{y_N - \textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2}
\end{bmatrix} =
\begin{bmatrix}
0\\
0\\
.\\
.\\
0
\end{bmatrix}
\end{split}
Denoting the left PxN matrix as A. Putting Y term on RHS we
get the following:
\begin{split}
A \begin{bmatrix}
\frac{\textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\
\frac{\textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\
.\\
.\\
\frac{\textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2}
\end{bmatrix} = A \begin{bmatrix}
\frac{y_1}{\sigma_1^2}\\
\frac{y_2}{\sigma_2^2}\\
.\\
.\\
\frac{y_N}{\sigma_N^2}
\end{bmatrix}
\end{split}
Above equation can again be decomposed into simpler form by expanding
every row entry of matrix column matrix in L.H.S.
\begin{bmatrix}
\frac{\textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\
\frac{\textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\
.\\
.\\
\frac{\textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2}
\end{bmatrix} = \begin{bmatrix}
\frac{x_{11}}{\sigma_1^2} & \frac{x_{12}}{\sigma_1^2} & . & . & \frac{x_{1p}}{\sigma_1^2}\\
\frac{x_{21}}{\sigma_2^2} & \frac{x_{22}}{\sigma_2^2} & . & . & \frac{x_{2p}}{\sigma_2^2}\\
. & . & . & . & . \\
\frac{x_{N1}}{\sigma_N^2} & \frac{x_{N2}}{\sigma_N^2} & . & . & \frac{x_{Np}}{\sigma_N^2}\\
\end{bmatrix} \begin{bmatrix}
\beta_1\\
\beta_2\\
.\\
.\\
\beta_p
\end{bmatrix}
Assuming B to be the NxP of R.H.S of above equation. Now we
can write eq. (1.3) as
\boldsymbol{AB}\boldsymbol{\beta} =\boldsymbol{AY_{\sigma}}
\boldsymbol{AB} will be a PxP matrix, and if it is invertible
then we can write the solution for \boldsymbol{\beta} as following:
\begin{split}
\boldsymbol{\beta} = \boldsymbol{(AB)^{-1}}\boldsymbol{AY_{\sigma}}
\end{split}
Smooth co-efficients
1.2.a
To apply natural ordering we have to add extra regularization term in
the residual square of sum term. Without natural ordering
\boldsymbol{RSS(\beta)} looks like following
\begin{split}
\boldsymbol{RSS(\beta)} = \sum_{i =1 }^{n} \frac{(y_i - \boldsymbol{x_i^T\beta})}{\sigma^2} + \frac{1}{2}\lambda||\boldsymbol{\beta}||^2
\end{split}
Adding additional constrain of natural ordering
\begin{split}
\boldsymbol{RSS(\beta)} = \sum_{i =1 }^{n} \frac{(y_i - \boldsymbol{x_i^T\beta})}{\sigma^2} + \frac{1}{2}\lambda||\boldsymbol{\beta}||^2 + \frac{1}{2}\gamma\sum_{i=1}^{p-1} \left( \beta_i - \beta_{i+1}
\right)^2
\end{split}
1.2.b
Taking the first derivative and equating it to zero.Note: The derivative
w.r.t \beta_1 and \beta_p will be different from the rest \beta's
\begin{split}
\frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_1} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i1})}{2\sigma^2} + \lambda \beta_1 + \gamma\left(\beta_1 - \beta_2
\right) = 0\\
\frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_2} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i2})}{2\sigma^2} + \lambda \beta_2 - \gamma\left(\beta_1 + \beta_3
\right) = 0 \\
\frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_3} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i3})}{2\sigma^2} + \lambda \beta_3 - \gamma\left(\beta_2 + \beta_4
\right) = 0 \\
... \\
...\\
\frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_p} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{ip})}{2\sigma^2} + \lambda \beta_p + \gamma\left(\beta_p + \beta_{p-1}
\right) = 0
\end{split}
Writing above system of linear equation in matrix form.
\begin{split}
-A \left(Y - A^T\boldsymbol{\beta}\right) + \lambda\boldsymbol{\beta} - \gamma\begin{bmatrix}
\beta_2 - \beta_1\\
\beta_1 + \beta_3\\
.\\
.\\
\beta_{p-1} - \beta_p
\end{bmatrix} & = \begin{bmatrix}
0\\
0\\
.\\
.\\
0
\end{bmatrix}\\
-A \left(Y - A^T\boldsymbol{\beta}\right) + \lambda\boldsymbol{\beta} - \gamma \begin{bmatrix}
-1 & 1 & 0 & 0 & . & . & 0 \\
1 & 0 & 1 & 0 & . & . & 0 \\
0 & 1 & 0 & 1 & . & . & 0 \\
. & . & . & . & . & .& . \\
. & . & . & . & . & .& . \\
0 & 0 & 0 & 0 & . & 1 & -1 \\
\end{bmatrix}\boldsymbol{\beta} = \boldsymbol{0}
\end{split}
Simplifying above equation will lead to the following:
\begin{split}
AA^T\boldsymbol{\beta} + \lambda\boldsymbol{\beta} - \gamma K \boldsymbol{\beta} & = AY\\
\left( AA^T + \lambda\boldsymbol{I} - \gamma K
\right)\boldsymbol{\beta} & = AY \\
\implies \boldsymbol{\beta} & = \left( AA^T + \lambda\boldsymbol{I} - \gamma K
\right)^ {-1} AY
\end{split}
Linearly constrained linear regression
Here we are given the following:
y = \boldsymbol{x^T\beta} + \varepsilon
and the parameters have additional linear constrain as:
A\boldsymbol{\beta} = \boldsymbol{b}
The above equation has non-empty solution.Writing the log likelihood
equation by referring (1.1) with new symbol of
\boldsymbol{\beta} =\boldsymbol{\beta}^\star
\begin{split}
\ell (\boldsymbol{\beta}) & = -\frac{N}{2}\log 2\pi - N\log(\sigma) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta^\star} )^2}{2\sigma^2}\\
& = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log(\sigma^2) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta^\star} )^2}{2\sigma^2}\\
& = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log(\sigma^2) - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2\sigma^2}
\end{split}
Where
\begin{split}
Y = \begin{bmatrix}
y_1 & y_2 & . &. &. & y_N
\end{bmatrix}^T
\boldsymbol{X} = \begin{bmatrix}
x_{11} & x_{12} & x_{13} & . & . & . & x_{1P}\\
x_{21} & x_{22} & x_{23} & . & . & . & x_{2P}\\
. & . & . & . & . & . & .\\
. & . & . & . & . & . & .\\
x_{N1} & x_{N2} & x_{N3} & . & . & . & x_{NP}
\end{bmatrix}
\end{split}
The idea here is to use Lagrangian multiplier in order to address the
linear constrain. Here we can see that the variance does not have any
constraint. We can get estimate of variance, and then plug that into
likelihood, thereafter we can apply Lagrangian multiplier.First getting
the estimate for variance:
\begin{split}
\frac{\partial\ell(\boldsymbol{\beta})}{\partial\sigma^2} & = -\frac{N}{2\sigma^2} - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2}*\left(\frac{-1}{\sigma^4}
\right)\\
& = -\frac{N}{2\sigma^2} + \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2\sigma^4}
\end{split}
equating above equation to zero will give us the estimate of \sigma^2
as a function of constrained \boldsymbol{\beta^\star}
\sigma ^2 = \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}
Putting the value of \sigma^2 in eq. (1.4)
\begin{split}
\ell (\boldsymbol{\beta}) & = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log\left( \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}\right) - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2 \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}} \\
& = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log\left( \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}\right) - \frac{N}{2}
\end{split}
Now we have to maximize with the linear constrain on parameter. In other
terms we just have to minimize the term inside log in above equation. We
have to do the following. Minimize
\begin{split}
&(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})\\
& A\boldsymbol{\beta^\star} = \boldsymbol{b}
\end{split}
Writing the Lagrangian multiplier for above problem
\begin{split}
J & = (Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star}) + \lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\\
& = Y^TY - Y^T\boldsymbol{X\beta^\star} - \boldsymbol{\beta^{\star T} X^T}Y + \boldsymbol{\beta^{\star T} X^TX\beta^\star} + \lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\\
& = Y^TY - 2Y^T\boldsymbol{X\beta^\star} + \boldsymbol{\beta^{\star T} X^TX\beta^\star} +\lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\vspace{5pt}\\
& \hspace{20pt}\text{because both terms are essentially the same}
\end{split}
Differentiating w.r.t to \boldsymbol{\beta^\star}
\begin{split}
\frac{\partial J}{\partial \beta^\star} & = -2Y^T\boldsymbol{X} + 2\boldsymbol{\beta^{\star T}X^TX} + \lambda A = 0 \\
\end{split}
To make it easier for calculation, taking transpose on both side.
\begin{split}
-2\boldsymbol{X^T}Y + 2\boldsymbol{X^TX\beta^\star} + \lambda A^T & = 0 \\
\lambda A^T & = 2\left(\boldsymbol{X^T}Y - \boldsymbol{X^TX\beta^\star} \right)
\end{split}
Multiplying with A\left(\boldsymbol{X^TX}\right)^{-1}
\begin{split}
\lambda A\left(\boldsymbol{X^TX}\right)^{-1} A^T & = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - A\boldsymbol{\beta^\star}
\right)\\
& = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b}
\right)\\
\implies
\lambda & = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T
\right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b}
\right)
\end{split}
simplifying first term of eq. (1.6) for \boldsymbol{\beta^\star}
\begin{split}
2\boldsymbol{X^TX\beta^\star} & = 2\boldsymbol{X^T}Y - \lambda A^T\\
& = 2\boldsymbol{X^T}Y - 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T
\right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b}
\right)\\
\implies
\boldsymbol{\beta^\star} & = \left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \left(\boldsymbol{X^TX}\right)^{-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T
\right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b}
\right)
\end{split}
Online learning
Here the objective function is to minimize the L2 norm of the difference
of the update. For perceptron to classify the current input following
condition must satisfy:
\begin{split}
y_n\boldsymbol{w_{i+1}}^T\boldsymbol{x_n} > 0
\end{split}
we have to minimize the following with above mentioned constraint.
\begin{split}
1/2*||\boldsymbol{w_{i+1}} - \boldsymbol{w_i}||_2^2
\end{split}
Minimizing L2 sqaured norm is as good as minimizing L2 norm. Writing
expression in Lagrangian format
\begin{split}
\boldsymbol{L} = 1/2*||\boldsymbol{w_{i+1}} - \boldsymbol{w_i}||_2^2 + \lambda * (y_n\boldsymbol{w_{i+1}}^T\boldsymbol{x_n} )
\end{split}
differentiating w.r.t. to \boldsymbol{w_{i+1}} and equating it to
zero.
\begin{split}
\frac{\partial \boldsymbol{L}}{\partial \boldsymbol{w_{i+1}}} & = 1/2 * \left(2*\boldsymbol{w_{i+1}} - 2*\boldsymbol{w_i} + *\lambda*y_n * \boldsymbol{x_n}
\right) = 0 \\
\boldsymbol{w_{i+1}} & = \boldsymbol{w_i} - \frac{\lambda*y_n*\boldsymbol{x_n}}{2}
\end{split}
Differential L w.r.t \lambda leads to the following
\begin{split}
\frac{\partial \boldsymbol{L}}{\partial \lambda} & = y_n\boldsymbol{w_{i+1}}^T \boldsymbol{x_n} = 0 \\
\implies y_n\left( \boldsymbol{w_i}^T - \frac{\lambda*y_n*\boldsymbol{x_n}^T}{2}
\right) * \boldsymbol{x_n} & = 0 \\
\implies \lambda = \frac{2*\boldsymbol{w_i}^T\boldsymbol{x_n}}{y_n||\boldsymbol{x_n}||^2}
\end{split}
\begin{split}
\boldsymbol{w_{i+1}} = \boldsymbol{w_i} - \left(\frac{\boldsymbol{w_i}^T\boldsymbol{x_n}}{y_n||\boldsymbol{x_n}||^2}
\right)* y_n*\boldsymbol{x_n}
\end{split}
Kernels
3.a
\begin{aligned}
K_3 & = a_1K_1 + a_2K_2\\
& a_1\geq 0 , a_2 \geq 0\end{aligned}
Since K_1 and K_2 are positive semidefinite matrix. So for any
vectorX, we h ave the following inequality
\begin{split}
X^TK_1X \geq 0,\space
X^TK_2X \geq 0
\end{split}
Taking any vector X and evaluating X^TK_3X
\begin{split}
X^TK_3X & = X^T (a_1K_1 + a_2K_2)X\\
& = X^Ta_1K_1X + X^Ta_aK_2X\\
& = a_1X^TK_1X + a_2X^TK_2X \\
&\geq 0
\end{split}
3.b
K_4 has been defined by kernel function k_4(x_1,x_2) = f(x_1)f(x_2)
, the matrix representation will look like following:
\begin{split}
K_4 & = \begin{bmatrix}
k_4(x_1,x_1) & k_4(x_1,x_2) & . & .& . & k_4(x_1, x_N)\\
k_4(x_2,x_1) & k_4(x_2,x_2) & . & .& . & k_4(x_2, x_N)\\
. & . & . & . & . & . \\
. & . & . & . & . & . \\
k_4(x_N,x_1) & k_4(x_N,x_2) & . & .& . & k_4(x_N, x_N)
\end{bmatrix}\\
& = \begin{bmatrix}
f(x_1)f(x_1) & f(x_1)f(x_2) & . & .& . & f(x_1)f(x_N)\\
k_4(x_2,x_1) & k_4(x_2,x_2) & . & .& . & k_4(x_2, x_N)\\
. & . & . & . & . & . \\
. & . & . & . & . & . \\
k_4(x_N,x_1) & k_4(x_N,x_2) & . & .& . & k_4(x_N, x_N)
\end{bmatrix}\\
& = \begin{bmatrix}
f(x_1)\\
f(x_2)\\
. \\
. \\
f(x_N)
\end{bmatrix} \begin{bmatrix}
f(x_1) & f(x_2) & . & . & f(x_N)
\end{bmatrix}\\
& = \boldsymbol{FF^T}
\end{split}
Now taking any vector X and evaluating X^TK_4X
\begin{split}
X^TK_4X & = X^T\boldsymbol{FF^T}X \\
& = (\boldsymbol{F^T}X)^T \boldsymbol{F^T}X\\
& = ||\boldsymbol{F^T}X||^2\\
& \geq 0
\end{split}
Therefore K_4 is also a positive semidefinite matrix.
3.c
K_5 has been defined by kernel function
k_5(x_1,x_2) = k_1(x_1, x_2)k_2(x_1,x_2) , the matrix representation
will look like following:
\begin{split}
K_5 & = \begin{bmatrix}
k_5(x_1,x_1) & k_5(x_1,x_2) & . & .& . & k_5(x_1, x_N)\\
k_5(x_2,x_1) & k_5(x_2,x_2) & . & .& . & k_5(x_2, x_N)\\
. & . & . & . & . & . \\
. & . & . & . & . & . \\
k_5(x_N,x_1) & k_5(x_N,x_2) & . & .& . & k_5(x_N, x_N)
\end{bmatrix}\\
& = \begin{bmatrix}
k_1(x_1,x_1)k_2(x_1,x_1) & k_1(x_1,x_2)k_2(x_1,x_2) & . & .& . & k_1(x_1,x_N)k_2(x_1,x_N)\\
k_1(x_2,x_1)k_2(x_2,x_1) & k_1(x_2,x_2)k_2(x_2,x_2) & . & .& . & k_1(x_2, x_N)k_2(x_2,x_N)\\
. & . & . & . & . & . \\
. & . & . & . & . & . \\
k_1(x_N,x_1)k_2(x_N,x_1) & k_1(x_N,x_2)k_2(x_N,x_2) & . & .& . & k_1(x_N, x_N)k_2(x_N,x_N)
\end{bmatrix}
\end{split}
Referring to Wikipedia web resource about Hadamard product, Schur
product theorem it can be proved that the above kernel function is PSD
(by the help of Wick’s theorem). Since k_1 and k_2 are kernel
function (so the matrix will be PSD). and their entrywise multiplication
(Hadamard product) is the above resulting matrix. So it will be a PSD.
Bias Variance Tradeoff
Programming
Data preparation
Feature representation
(1)
Top 3 most frequent words found in spam/ham data set are as following
enron 600
will 351
please 291
In ionosphere data I have used 1 for label “g” and 0 for label “b”.
Initialization and extreme conditions has been considered during
implementation.
Batch Gradient descent
(2)
Updating equation of \boldsymbol{w} and b. Here \boldsymbol{x} is
NxP vector where N is number of data points, and P is feature
dimension. y is Nx1 vector. \boldsymbol{w} is Px1 vector and
\boldsymbol{b}_t is a Nx1 vector with each row having same values
which is equal to bias. Without regularization:
\begin{split}
\boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - \eta*\left(\boldsymbol{x}^T\left(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t) - \boldsymbol{y}
\right)
\right)\\
b_{t+1} & = b_{t} - \eta * \left(\boldsymbol{y}^T* (1 - \sigma(\boldsymbol{b}_t + \boldsymbol{x}\boldsymbol{w}_t) - (1- \boldsymbol{y}) *(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t))
\right)
\end{split}
With Regularization:
\begin{split}
\boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - \eta*\left(\boldsymbol{x}^T\left(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t) - \boldsymbol{y}
\right)
+ 2*\lambda*\boldsymbol{w}_t
\right)\\
b_{t+1} & = b_{t} - \eta * \left(\boldsymbol{y}^T* (1 - \sigma(\boldsymbol{b}_t + \boldsymbol{x}\boldsymbol{w}_t) - (1- \boldsymbol{y}) *(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t))
\right)
\end{split}
(3)a
(3)b
L2 norm of \boldsymbol{w} after 50 iterations for each step size
\eta_i
(4)a
(4)b
(4)c
(5)
Note: Here we add one column (with all entries = 1 ) in the beginning of
input data matrix and the seed intercept b as row 1 in the weight
vector to inculcate the bias. Without Regularization: (Using the
notations and assumptions of (2)) Here I am using the derivation of
Hessian directly because it was already described in assignment #1.
Here H_b is used for second derivative w.r.t. bias.
\begin{split}
H & = \frac{\partial^2 \varepsilon(\boldsymbol{w},b)}{\boldsymbol{ww^T}} \\
& = \boldsymbol{x^T}\sigma(b+ \boldsymbol{xw})(1-\sigma(\boldsymbol{b}+\boldsymbol{xw}))\boldsymbol{x}\\
\frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} & = - \left( \sum_{i = 1}^{N} \left( y_i - \sigma(b + \boldsymbol{x_iw}
\right)
\right) \\
H_b = \frac{\partial^2(\varepsilon(\boldsymbol{w},b))}{\partial b^2} & = \sum_{i=1}^{N} \left(\sigma(b + \boldsymbol{xw})*(1-\sigma(b + \boldsymbol{xw})
\right) \\
\boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - H_t^{-1}\nabla\varepsilon_t \\
b_{t+1} & = b_t - (H_b)^{-1} * \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b}
\end{split}
**With Regularization** In this case we shall add 2*\lambda to
diagonal entries of the hessian matrx that we found in case of without
regularization(because for i\neq j the second derivative will be zero
for regularized term). And in the first derivative we shall have
2*\lambda*\boldsymbol{w} added to the weight vector. The bias is not
affected by regularization.
\begin{split}
H & = \frac{\partial^2 \varepsilon(\boldsymbol{w},b)}{\boldsymbol{ww^T}} \\
& = \boldsymbol{x^T}\sigma(b+ \boldsymbol{xw})(1-\sigma(\boldsymbol{b}+\boldsymbol{xw}))\boldsymbol{x} + 2*\lambda*\boldsymbol{I}\\
\frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} & = - \left( \sum_{i = 1}^{N} \left( y_i - \sigma(b + \boldsymbol{x_iw}
\right)
\right) \\
H_b = \frac{\partial^2(\varepsilon(\boldsymbol{w},b))}{\partial b^2} & = \sum_{i=1}^{N} \left(\sigma(b + \boldsymbol{xw})*(1-\sigma(b + \boldsymbol{xw})
\right) \\
\boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - H_t^{-1}\nabla\varepsilon_t \\
b_{t+1} & = b_t - (H_b)^{-1} * \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b}
\end{split}
(6)a
(6)b
L2 norm of w without regularization:
(6)c
Cross entropy of test data using newtons method (showed upto steps where
it converges)
(7)a
(7)b
L2 norm of vector \boldsymbol{w}
(8)
In (3)a we see that with increase in step size overdamped situation is
found. Even with the regularization (in 4(a)) we see the spikes. The
behavior with higher values of \lambda in training and testing cross
entropies with different step sizes. At higher values of step size and
high values of lambda we expect to see the high value of CE which can be
seen in the plots. In newton method we can see very sharp convergence
within 5 steps, but we do see the trade off in time complexity w.r.t
gradient descent method. It has not been reported but I did see giving
good values of weight vector newton method converges very fast