Bayesian inference to L2-regularized MLR
Posted on Fri 09 March 2018 in Concepts
I was really happy to read the chapter 1 of Bishop's Pattern Recognition book
and realize the relationship between L2-regularized MLR and Bayesian
perspective. On page 30
, he shows that if weights follow a gaussian
distribution conditioned on selected hyper-parameter, then the maximum
likelihood would take the following form:
$$ \frac{\beta}{2}\sum\limits_{n = 1}^{N} \{y(x_{n}, \mathbf{w}) - t_{n}\}^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} $$
Which is similar to L2-regularized MLR with:
$$ \lambda = \frac{\alpha}{\beta}$$
NOTE: This finding was a result of healthy discussion with Tsu-Pei.
A good question asked in this context was:
- Why do I need to bother about the distribution in cases where the length of weight vector \(\mathbf{w}\) very small, for example, 5 or less than 10.
And my response to this questions was:
- Well, probably there may not be any need of regularization in these type of cases. Ideally, we use regularization when we have considerable number of features.