Bayesian inference to L2-regularized MLR

Posted on Fri 09 March 2018 in Concepts

I was really happy to read the chapter 1 of Bishop's Pattern Recognition book and realize the relationship between L2-regularized MLR and Bayesian perspective. On page 30, he shows that if weights follow a gaussian distribution conditioned on selected hyper-parameter, then the maximum likelihood would take the following form:

$$ \frac{\beta}{2}\sum\limits_{n = 1}^{N} \{y(x_{n}, \mathbf{w}) - t_{n}\}^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} $$

Which is similar to L2-regularized MLR with:

$$ \lambda = \frac{\alpha}{\beta}$$

NOTE: This finding was a result of healthy discussion with Tsu-Pei.

A good question asked in this context was:

Why do I need to bother about the distribution in cases where the length of weight vector $\mathbf{w}$ very small, for example, 5 or less than 10.

And my response to this questions was:

Well, probably there may not be any need of regularization in these type of cases. Ideally, we use regularization when we have considerable number of features.