Linear Regression

Regression with heterogeneous noise

1.1.a Log-likelihood function of independent but not identically distributed data

Here we are given the following:

\begin{aligned} y = \textbf{x}^T\boldsymbol{\beta} + \varepsilon\end{aligned}

And each data point has different variance.

\begin{aligned} \varepsilon_n \sim N(0,\sigma_n^2)\end{aligned}

So here we say the following

\begin{aligned} y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} \sim N(0,\sigma_i^2)\end{aligned}

Writing the log likelihood for the data points

\begin{split} \ell (\boldsymbol{\beta}) & = \log (P(Y| X,\boldsymbol{\beta}))\\ & = \log\left(\prod_{i = 1 }^{N} \frac{1}{\sqrt{2\pi} \sigma_i} \exp\left( \frac{- (y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2} \right) \right)\\ & = \sum_{i = 1}^{N}\left(\log\left(\frac{1}{\sqrt{2\pi} \sigma_i} \right) - \frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2} \right)\\ & = -\frac{N}{2}\log 2\pi - \sum_{i = 1}^{N}\log(\sigma_i) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta} )^2}{2\sigma_i^2} \end{split}

1.1.b Maximum likelihood derivation for \boldsymbol{\beta}

Differentiating above equation w.r.t. \beta_j we get the following:

\begin{split} \frac{\partial\ell(\boldsymbol{\beta})}{\partial \beta_j} & = -0 - 0 - \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{ij})}{2\sigma_i^2} \end{split}

Equating above equation to 0 we get:

\begin{split} \sum_{i = 1}^{N} \frac{(y_i -\textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(x_{ij})}{\sigma_i^2} & = 0\\ \end{split}

In matrix form, above equation can be written as:

\begin{split} \begin{bmatrix} x_{11} & x_{21} & . & .& . & x_{n1} \end{bmatrix} \begin{bmatrix} \frac{y_1 - \textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\ \frac{y_2 - \textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\ .\\ .\\ \frac{y_N - \textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2} \end{bmatrix} = 0 \end{split}

Similarly for other \beta’s we get the same format. Putting all of them in one matrix we will get the following.

\begin{split} \begin{bmatrix} x_{11} & x_{21} & . & .& . & x_{N1}\\ x_{12} & x_{22} & . & .& . & x_{N2}\\ . & . & . & . & . & . & . \\ . & . & . & . & . & . & . \\ x_{1p} & x_{22} & . & .& . & x_{Np} \end{bmatrix} \begin{bmatrix} \frac{y_1 - \textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\ \frac{y_2 - \textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\ .\\ .\\ \frac{y_N - \textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2} \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ .\\ .\\ 0 \end{bmatrix} \end{split}

Denoting the left PxN matrix as A. Putting Y term on RHS we get the following:

\begin{split} A \begin{bmatrix} \frac{\textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\ \frac{\textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\ .\\ .\\ \frac{\textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2} \end{bmatrix} = A \begin{bmatrix} \frac{y_1}{\sigma_1^2}\\ \frac{y_2}{\sigma_2^2}\\ .\\ .\\ \frac{y_N}{\sigma_N^2} \end{bmatrix} \end{split}

Above equation can again be decomposed into simpler form by expanding every row entry of matrix column matrix in L.H.S.

\begin{bmatrix} \frac{\textbf{x}_\textbf{1}^T\boldsymbol{\beta} }{\sigma_1^2} \\ \frac{\textbf{x}_\textbf{2}^T\boldsymbol{\beta} }{\sigma_2^2} \\ .\\ .\\ \frac{\textbf{x}_\textbf{N}^T\boldsymbol{\beta} }{\sigma_N^2} \end{bmatrix} = \begin{bmatrix} \frac{x_{11}}{\sigma_1^2} & \frac{x_{12}}{\sigma_1^2} & . & . & \frac{x_{1p}}{\sigma_1^2}\\ \frac{x_{21}}{\sigma_2^2} & \frac{x_{22}}{\sigma_2^2} & . & . & \frac{x_{2p}}{\sigma_2^2}\\ . & . & . & . & . \\ \frac{x_{N1}}{\sigma_N^2} & \frac{x_{N2}}{\sigma_N^2} & . & . & \frac{x_{Np}}{\sigma_N^2}\\ \end{bmatrix} \begin{bmatrix} \beta_1\\ \beta_2\\ .\\ .\\ \beta_p \end{bmatrix}

Assuming B to be the NxP of R.H.S of above equation. Now we can write eq. (1.3) as

\boldsymbol{AB}\boldsymbol{\beta} =\boldsymbol{AY_{\sigma}}

\boldsymbol{AB} will be a PxP matrix, and if it is invertible then we can write the solution for \boldsymbol{\beta} as following:

\begin{split} \boldsymbol{\beta} = \boldsymbol{(AB)^{-1}}\boldsymbol{AY_{\sigma}} \end{split}

Smooth co-efficients

1.2.a

To apply natural ordering we have to add extra regularization term in the residual square of sum term. Without natural ordering \boldsymbol{RSS(\beta)} looks like following

\begin{split} \boldsymbol{RSS(\beta)} = \sum_{i =1 }^{n} \frac{(y_i - \boldsymbol{x_i^T\beta})}{\sigma^2} + \frac{1}{2}\lambda||\boldsymbol{\beta}||^2 \end{split}

Adding additional constrain of natural ordering

\begin{split} \boldsymbol{RSS(\beta)} = \sum_{i =1 }^{n} \frac{(y_i - \boldsymbol{x_i^T\beta})}{\sigma^2} + \frac{1}{2}\lambda||\boldsymbol{\beta}||^2 + \frac{1}{2}\gamma\sum_{i=1}^{p-1} \left( \beta_i - \beta_{i+1} \right)^2 \end{split}

1.2.b

Taking the first derivative and equating it to zero.Note: The derivative w.r.t \beta_1 and \beta_p will be different from the rest \beta's

\begin{split} \frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_1} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i1})}{2\sigma^2} + \lambda \beta_1 + \gamma\left(\beta_1 - \beta_2 \right) = 0\\ \frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_2} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i2})}{2\sigma^2} + \lambda \beta_2 - \gamma\left(\beta_1 + \beta_3 \right) = 0 \\ \frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_3} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{i3})}{2\sigma^2} + \lambda \beta_3 - \gamma\left(\beta_2 + \beta_4 \right) = 0 \\ ... \\ ...\\ \frac{\partial \boldsymbol{RSS(\beta)} }{\partial \beta_p} & = \sum_{i = 1}^{N}\frac{2*(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta})*(-x_{ip})}{2\sigma^2} + \lambda \beta_p + \gamma\left(\beta_p + \beta_{p-1} \right) = 0 \end{split}

Writing above system of linear equation in matrix form.

\begin{split} -A \left(Y - A^T\boldsymbol{\beta}\right) + \lambda\boldsymbol{\beta} - \gamma\begin{bmatrix} \beta_2 - \beta_1\\ \beta_1 + \beta_3\\ .\\ .\\ \beta_{p-1} - \beta_p \end{bmatrix} & = \begin{bmatrix} 0\\ 0\\ .\\ .\\ 0 \end{bmatrix}\\ -A \left(Y - A^T\boldsymbol{\beta}\right) + \lambda\boldsymbol{\beta} - \gamma \begin{bmatrix} -1 & 1 & 0 & 0 & . & . & 0 \\ 1 & 0 & 1 & 0 & . & . & 0 \\ 0 & 1 & 0 & 1 & . & . & 0 \\ . & . & . & . & . & .& . \\ . & . & . & . & . & .& . \\ 0 & 0 & 0 & 0 & . & 1 & -1 \\ \end{bmatrix}\boldsymbol{\beta} = \boldsymbol{0} \end{split}

Simplifying above equation will lead to the following:

\begin{split} AA^T\boldsymbol{\beta} + \lambda\boldsymbol{\beta} - \gamma K \boldsymbol{\beta} & = AY\\ \left( AA^T + \lambda\boldsymbol{I} - \gamma K \right)\boldsymbol{\beta} & = AY \\ \implies \boldsymbol{\beta} & = \left( AA^T + \lambda\boldsymbol{I} - \gamma K \right)^ {-1} AY \end{split}

Linearly constrained linear regression

Here we are given the following:

y = \boldsymbol{x^T\beta} + \varepsilon

and the parameters have additional linear constrain as:

A\boldsymbol{\beta} = \boldsymbol{b}

The above equation has non-empty solution.Writing the log likelihood equation by referring (1.1) with new symbol of \boldsymbol{\beta} =\boldsymbol{\beta}^\star

\begin{split} \ell (\boldsymbol{\beta}) & = -\frac{N}{2}\log 2\pi - N\log(\sigma) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta^\star} )^2}{2\sigma^2}\\ & = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log(\sigma^2) - \sum_{i = 1}^{N}\frac{(y_i - \textbf{x}_\textbf{i}^T\boldsymbol{\beta^\star} )^2}{2\sigma^2}\\ & = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log(\sigma^2) - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2\sigma^2} \end{split}

Where

\begin{split} Y = \begin{bmatrix} y_1 & y_2 & . &. &. & y_N \end{bmatrix}^T \boldsymbol{X} = \begin{bmatrix} x_{11} & x_{12} & x_{13} & . & . & . & x_{1P}\\ x_{21} & x_{22} & x_{23} & . & . & . & x_{2P}\\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ x_{N1} & x_{N2} & x_{N3} & . & . & . & x_{NP} \end{bmatrix} \end{split}

The idea here is to use Lagrangian multiplier in order to address the linear constrain. Here we can see that the variance does not have any constraint. We can get estimate of variance, and then plug that into likelihood, thereafter we can apply Lagrangian multiplier.First getting the estimate for variance:

\begin{split} \frac{\partial\ell(\boldsymbol{\beta})}{\partial\sigma^2} & = -\frac{N}{2\sigma^2} - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2}*\left(\frac{-1}{\sigma^4} \right)\\ & = -\frac{N}{2\sigma^2} + \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2\sigma^4} \end{split}

equating above equation to zero will give us the estimate of \sigma^2 as a function of constrained \boldsymbol{\beta^\star}

\sigma ^2 = \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}

Putting the value of \sigma^2 in eq. (1.4)

\begin{split} \ell (\boldsymbol{\beta}) & = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log\left( \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}\right) - \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{2 \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}} \\ & = -\frac{N}{2}\log 2\pi - \frac{N}{2}\log\left( \frac{(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})}{N}\right) - \frac{N}{2} \end{split}

Now we have to maximize with the linear constrain on parameter. In other terms we just have to minimize the term inside log in above equation. We have to do the following. Minimize

\begin{split} &(Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star})\\ & A\boldsymbol{\beta^\star} = \boldsymbol{b} \end{split}

Writing the Lagrangian multiplier for above problem

\begin{split} J & = (Y - \boldsymbol{X\beta^\star})^T(Y - \boldsymbol{X\beta^\star}) + \lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\\ & = Y^TY - Y^T\boldsymbol{X\beta^\star} - \boldsymbol{\beta^{\star T} X^T}Y + \boldsymbol{\beta^{\star T} X^TX\beta^\star} + \lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\\ & = Y^TY - 2Y^T\boldsymbol{X\beta^\star} + \boldsymbol{\beta^{\star T} X^TX\beta^\star} +\lambda ( A\boldsymbol{\beta^\star} -\boldsymbol{b} )\vspace{5pt}\\ & \hspace{20pt}\text{because both terms are essentially the same} \end{split}

Differentiating w.r.t to \boldsymbol{\beta^\star}

\begin{split} \frac{\partial J}{\partial \beta^\star} & = -2Y^T\boldsymbol{X} + 2\boldsymbol{\beta^{\star T}X^TX} + \lambda A = 0 \\ \end{split}

To make it easier for calculation, taking transpose on both side.

\begin{split} -2\boldsymbol{X^T}Y + 2\boldsymbol{X^TX\beta^\star} + \lambda A^T & = 0 \\ \lambda A^T & = 2\left(\boldsymbol{X^T}Y - \boldsymbol{X^TX\beta^\star} \right) \end{split}

Multiplying with A\left(\boldsymbol{X^TX}\right)^{-1}

\begin{split} \lambda A\left(\boldsymbol{X^TX}\right)^{-1} A^T & = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - A\boldsymbol{\beta^\star} \right)\\ & = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b} \right)\\ \implies \lambda & = 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T \right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b} \right) \end{split}

simplifying first term of eq. (1.6) for \boldsymbol{\beta^\star}

\begin{split} 2\boldsymbol{X^TX\beta^\star} & = 2\boldsymbol{X^T}Y - \lambda A^T\\ & = 2\boldsymbol{X^T}Y - 2 \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T \right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b} \right)\\ \implies \boldsymbol{\beta^\star} & = \left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \left(\boldsymbol{X^TX}\right)^{-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1} A^T \right)^ {-1} \left( A\left(\boldsymbol{X^TX}\right)^{-1}\boldsymbol{X^T}Y - \boldsymbol{b} \right) \end{split}

Online learning

Here the objective function is to minimize the L2 norm of the difference of the update. For perceptron to classify the current input following condition must satisfy:

\begin{split} y_n\boldsymbol{w_{i+1}}^T\boldsymbol{x_n} > 0 \end{split}

we have to minimize the following with above mentioned constraint.

\begin{split} 1/2*||\boldsymbol{w_{i+1}} - \boldsymbol{w_i}||_2^2 \end{split}

Minimizing L2 sqaured norm is as good as minimizing L2 norm. Writing expression in Lagrangian format

\begin{split} \boldsymbol{L} = 1/2*||\boldsymbol{w_{i+1}} - \boldsymbol{w_i}||_2^2 + \lambda * (y_n\boldsymbol{w_{i+1}}^T\boldsymbol{x_n} ) \end{split}

differentiating w.r.t. to \boldsymbol{w_{i+1}} and equating it to zero.

\begin{split} \frac{\partial \boldsymbol{L}}{\partial \boldsymbol{w_{i+1}}} & = 1/2 * \left(2*\boldsymbol{w_{i+1}} - 2*\boldsymbol{w_i} + *\lambda*y_n * \boldsymbol{x_n} \right) = 0 \\ \boldsymbol{w_{i+1}} & = \boldsymbol{w_i} - \frac{\lambda*y_n*\boldsymbol{x_n}}{2} \end{split}

Differential L w.r.t \lambda leads to the following

\begin{split} \frac{\partial \boldsymbol{L}}{\partial \lambda} & = y_n\boldsymbol{w_{i+1}}^T \boldsymbol{x_n} = 0 \\ \implies y_n\left( \boldsymbol{w_i}^T - \frac{\lambda*y_n*\boldsymbol{x_n}^T}{2} \right) * \boldsymbol{x_n} & = 0 \\ \implies \lambda = \frac{2*\boldsymbol{w_i}^T\boldsymbol{x_n}}{y_n||\boldsymbol{x_n}||^2} \end{split}
\begin{split} \boldsymbol{w_{i+1}} = \boldsymbol{w_i} - \left(\frac{\boldsymbol{w_i}^T\boldsymbol{x_n}}{y_n||\boldsymbol{x_n}||^2} \right)* y_n*\boldsymbol{x_n} \end{split}

Kernels

3.a

\begin{aligned} K_3 & = a_1K_1 + a_2K_2\\ & a_1\geq 0 , a_2 \geq 0\end{aligned}

Since K_1 and K_2 are positive semidefinite matrix. So for any vectorX, we h ave the following inequality

\begin{split} X^TK_1X \geq 0,\space X^TK_2X \geq 0 \end{split}

Taking any vector X and evaluating X^TK_3X

\begin{split} X^TK_3X & = X^T (a_1K_1 + a_2K_2)X\\ & = X^Ta_1K_1X + X^Ta_aK_2X\\ & = a_1X^TK_1X + a_2X^TK_2X \\ &\geq 0 \end{split}

3.b

K_4 has been defined by kernel function k_4(x_1,x_2) = f(x_1)f(x_2) , the matrix representation will look like following:

\begin{split} K_4 & = \begin{bmatrix} k_4(x_1,x_1) & k_4(x_1,x_2) & . & .& . & k_4(x_1, x_N)\\ k_4(x_2,x_1) & k_4(x_2,x_2) & . & .& . & k_4(x_2, x_N)\\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ k_4(x_N,x_1) & k_4(x_N,x_2) & . & .& . & k_4(x_N, x_N) \end{bmatrix}\\ & = \begin{bmatrix} f(x_1)f(x_1) & f(x_1)f(x_2) & . & .& . & f(x_1)f(x_N)\\ k_4(x_2,x_1) & k_4(x_2,x_2) & . & .& . & k_4(x_2, x_N)\\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ k_4(x_N,x_1) & k_4(x_N,x_2) & . & .& . & k_4(x_N, x_N) \end{bmatrix}\\ & = \begin{bmatrix} f(x_1)\\ f(x_2)\\ . \\ . \\ f(x_N) \end{bmatrix} \begin{bmatrix} f(x_1) & f(x_2) & . & . & f(x_N) \end{bmatrix}\\ & = \boldsymbol{FF^T} \end{split}

Now taking any vector X and evaluating X^TK_4X

\begin{split} X^TK_4X & = X^T\boldsymbol{FF^T}X \\ & = (\boldsymbol{F^T}X)^T \boldsymbol{F^T}X\\ & = ||\boldsymbol{F^T}X||^2\\ & \geq 0 \end{split}

Therefore K_4 is also a positive semidefinite matrix.

3.c

K_5 has been defined by kernel function k_5(x_1,x_2) = k_1(x_1, x_2)k_2(x_1,x_2) , the matrix representation will look like following:

\begin{split} K_5 & = \begin{bmatrix} k_5(x_1,x_1) & k_5(x_1,x_2) & . & .& . & k_5(x_1, x_N)\\ k_5(x_2,x_1) & k_5(x_2,x_2) & . & .& . & k_5(x_2, x_N)\\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ k_5(x_N,x_1) & k_5(x_N,x_2) & . & .& . & k_5(x_N, x_N) \end{bmatrix}\\ & = \begin{bmatrix} k_1(x_1,x_1)k_2(x_1,x_1) & k_1(x_1,x_2)k_2(x_1,x_2) & . & .& . & k_1(x_1,x_N)k_2(x_1,x_N)\\ k_1(x_2,x_1)k_2(x_2,x_1) & k_1(x_2,x_2)k_2(x_2,x_2) & . & .& . & k_1(x_2, x_N)k_2(x_2,x_N)\\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ k_1(x_N,x_1)k_2(x_N,x_1) & k_1(x_N,x_2)k_2(x_N,x_2) & . & .& . & k_1(x_N, x_N)k_2(x_N,x_N) \end{bmatrix} \end{split}

Referring to Wikipedia web resource about Hadamard product, Schur product theorem it can be proved that the above kernel function is PSD (by the help of Wick’s theorem). Since k_1 and k_2 are kernel function (so the matrix will be PSD). and their entrywise multiplication (Hadamard product) is the above resulting matrix. So it will be a PSD.

Bias Variance Tradeoff

Programming

Data preparation

Feature representation

(1)

Top 3 most frequent words found in spam/ham data set are as following


enron 600
will 351
please 291


In ionosphere data I have used 1 for label “g” and 0 for label “b”. Initialization and extreme conditions has been considered during implementation.

Batch Gradient descent

(2)

Updating equation of \boldsymbol{w} and b. Here \boldsymbol{x} is NxP vector where N is number of data points, and P is feature dimension. y is Nx1 vector. \boldsymbol{w} is Px1 vector and \boldsymbol{b}_t is a Nx1 vector with each row having same values which is equal to bias. Without regularization:

\begin{split} \boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - \eta*\left(\boldsymbol{x}^T\left(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t) - \boldsymbol{y} \right) \right)\\ b_{t+1} & = b_{t} - \eta * \left(\boldsymbol{y}^T* (1 - \sigma(\boldsymbol{b}_t + \boldsymbol{x}\boldsymbol{w}_t) - (1- \boldsymbol{y}) *(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t)) \right) \end{split}

With Regularization:

\begin{split} \boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - \eta*\left(\boldsymbol{x}^T\left(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t) - \boldsymbol{y} \right) + 2*\lambda*\boldsymbol{w}_t \right)\\ b_{t+1} & = b_{t} - \eta * \left(\boldsymbol{y}^T* (1 - \sigma(\boldsymbol{b}_t + \boldsymbol{x}\boldsymbol{w}_t) - (1- \boldsymbol{y}) *(\sigma(\boldsymbol{b}_t+\boldsymbol{x}\boldsymbol{w}_t)) \right) \end{split}

(3)a

fig:example

(3)b

L2 norm of \boldsymbol{w} after 50 iterations for each step size \eta_i



(4)a

fig:example

(4)b



(4)c

fig:example

(5)

Note: Here we add one column (with all entries = 1 ) in the beginning of input data matrix and the seed intercept b as row 1 in the weight vector to inculcate the bias. Without Regularization: (Using the notations and assumptions of (2)) Here I am using the derivation of Hessian directly because it was already described in assignment #1. Here H_b is used for second derivative w.r.t. bias.

\begin{split} H & = \frac{\partial^2 \varepsilon(\boldsymbol{w},b)}{\boldsymbol{ww^T}} \\ & = \boldsymbol{x^T}\sigma(b+ \boldsymbol{xw})(1-\sigma(\boldsymbol{b}+\boldsymbol{xw}))\boldsymbol{x}\\ \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} & = - \left( \sum_{i = 1}^{N} \left( y_i - \sigma(b + \boldsymbol{x_iw} \right) \right) \\ H_b = \frac{\partial^2(\varepsilon(\boldsymbol{w},b))}{\partial b^2} & = \sum_{i=1}^{N} \left(\sigma(b + \boldsymbol{xw})*(1-\sigma(b + \boldsymbol{xw}) \right) \\ \boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - H_t^{-1}\nabla\varepsilon_t \\ b_{t+1} & = b_t - (H_b)^{-1} * \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} \end{split}

**With Regularization** In this case we shall add 2*\lambda to diagonal entries of the hessian matrx that we found in case of without regularization(because for i\neq j the second derivative will be zero for regularized term). And in the first derivative we shall have 2*\lambda*\boldsymbol{w} added to the weight vector. The bias is not affected by regularization.

\begin{split} H & = \frac{\partial^2 \varepsilon(\boldsymbol{w},b)}{\boldsymbol{ww^T}} \\ & = \boldsymbol{x^T}\sigma(b+ \boldsymbol{xw})(1-\sigma(\boldsymbol{b}+\boldsymbol{xw}))\boldsymbol{x} + 2*\lambda*\boldsymbol{I}\\ \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} & = - \left( \sum_{i = 1}^{N} \left( y_i - \sigma(b + \boldsymbol{x_iw} \right) \right) \\ H_b = \frac{\partial^2(\varepsilon(\boldsymbol{w},b))}{\partial b^2} & = \sum_{i=1}^{N} \left(\sigma(b + \boldsymbol{xw})*(1-\sigma(b + \boldsymbol{xw}) \right) \\ \boldsymbol{w_{t+1}} & = \boldsymbol{w_t} - H_t^{-1}\nabla\varepsilon_t \\ b_{t+1} & = b_t - (H_b)^{-1} * \frac{\partial(\varepsilon(\boldsymbol{w},b))}{\partial b} \end{split}

(6)a

fig:example

(6)b

L2 norm of w without regularization:



(6)c

Cross entropy of test data using newtons method (showed upto steps where it converges)



(7)a

fig:example

(7)b

L2 norm of vector \boldsymbol{w}



(8)

In (3)a we see that with increase in step size overdamped situation is found. Even with the regularization (in 4(a)) we see the spikes. The behavior with higher values of \lambda in training and testing cross entropies with different step sizes. At higher values of step size and high values of lambda we expect to see the high value of CE which can be seen in the plots. In newton method we can see very sharp convergence within 5 steps, but we do see the trade off in time complexity w.r.t gradient descent method. It has not been reported but I did see giving good values of weight vector newton method converges very fast