Why sum of squared errors




















Averages correspond to evenly distributing the pie. Averages play nice with affine transformations. Higher-dimensional averages correspond to centre of mass. So I think it makes most sense to go from averages to squared error, normality, etc. Ridge Regression:Lasso is about the regularization type, not about the loss, so it disagrees with everything else in your post.

Thanks for a good post on a point I care about, still trying to understand why I care about expected values hence squared errors , and how I might convince myself not to. One reason you might care about expected values is the Von Neumann-Morgenstern theorem , which roughly states that any decision-maker, whose decisions satisfy certain consistency properties, has a utility function for which they are trying to maximize the expected value.

And because Gaussian arises as the large-sample limit of means the central limit theorem , the squared error becomes a central property in statistical theory. This is a great post that tries to find a resolution to this commonly posed question, in a variety of different ways. I think that one of the reasons why we naturally think that the squared error is more mathematically amenable is because our mathematics education has been traditionally and primarily driven with calculus at the pinnacle.

This has been because a career in science or engineering which fundamentally depends on calculus has been typically considered more favorably than a career in statistics, and thus discrete maths has traditionally been considered a poor cousin of calculus. This in turn, has meant that in many ways absolute function has been a poor cousin to the quadratic function. To cite two quick examples that comes to mind.

In the deep learning space, the fact that originally the most neural networks were originally based on the classic differentiable sigmoid functions such as the logistic function or hyperbolic tangent, whereas now the non-differentiable rectified linear units ReLUs are becoming the standard and default functions. Similarly, prior to deep learning becoming all the rage, the data science geeks were discovering that many classification and regression algorithms such as LASSO that were originally formulated for the quadratic error, have much nicer and intuitive results eg variable selection if recast in terms of absolute error.

And secondly,. That is, I do not think that the value of differentiability and mathematical formulations to admit closed-form solutions including the quadratic loss , will decrease per se, but I do believe that we are only recently starting to discover rediscover? Why squared error? However, the squared error has much nicer mathematical properties.

Looking deeper One might well ask whether there is some deep mathematical truth underlying the many different conveniences of the squared error. As far as I know, there are a few which are related in some sense, but not, I would say, the same : Differentiability The squared error is everywhere differentiable , while the absolute error is not its derivative is undefined at 0.

Inner products The squared error is induced by an inner product on the underlying space. December Improve this question. Add a comment. Active Oldest Votes. I can quickly summarise: Your regression parameters are solutions to the maximum likelihood optimisation.

That involves derivative, but absolute difference doesn't have a derivative at zero. There's no unique solution for least absolute regression. Least absolute regression is an alternative to the regular sum of squares regression, commonly classified as one of the robust statistical methods. You'd prefer least absolute regression if you care about outliers, otherwise the regular regression is generally better.

Improve this answer. SmallChess SmallChess 3, 2 2 gold badges 15 15 silver badges 28 28 bronze badges. Thank you.

I tried googling something like, "why do we use squares instead of absolute value in linear regression," but maybe that was too specific. Thank you for the information! I can do some reading to let it sink in a bit more. See: Why squared residuals instead of absolute residuals in OLS estimation? The former is actually a duplicate question from the latter. You may also benefit from answer to this post square things in statistics- generalized rationale. This will determine the distance for each of cell i's variables v from each of the mean vectors variable x vx and add it to the same for cell j.

This is actually the same as saying equation 5 divided by 2 to give:. The '2' is there because it's an average of '2' cells. This cluster is never going to be broken apart again for the rest of the stages of clustering, only single cells or cells in other clusters may join with it. This again has to be added giving a total SSE 3 of 1.

At the 4 th stage something different happens. This is why equation 3 has to be used. For the purposes of Ward's Method d k. I've calculated this on this Excel spreadsheet here. Now there are these clusters at stage 4 the rest are single cells and don't contribute to the SSE :. Equation 3 can be used at all stages it's just that with only 2 cells being joined it is reduced to equation 7.



0コメント

  • 1000 / 1000