ELI5: The Logic Behind Coefficient Estimation in OLS Regression












6














Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.



My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:



$$ beta = (X'X)^{-1}X'Y $$



Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.










share|cite|improve this question


















  • 7




    How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
    – Glen_b
    Dec 11 at 11:52


















6














Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.



My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:



$$ beta = (X'X)^{-1}X'Y $$



Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.










share|cite|improve this question


















  • 7




    How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
    – Glen_b
    Dec 11 at 11:52
















6












6








6


2





Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.



My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:



$$ beta = (X'X)^{-1}X'Y $$



Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.










share|cite|improve this question













Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.



My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:



$$ beta = (X'X)^{-1}X'Y $$



Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.







regression theory






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Dec 11 at 10:57









Jack Bailey

564




564








  • 7




    How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
    – Glen_b
    Dec 11 at 11:52
















  • 7




    How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
    – Glen_b
    Dec 11 at 11:52










7




7




How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b
Dec 11 at 11:52






How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b
Dec 11 at 11:52












2 Answers
2






active

oldest

votes


















7














Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.



If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.



In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.



Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.






share|cite|improve this answer























  • Thanks for your time. This was a great explanation and really useful.
    – Jack Bailey
    Dec 11 at 14:40



















1














If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:




  1. OLS is aiming to minimize the error $||y-Xbeta||$.


  2. The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)


  3. The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.



Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.






share|cite|improve this answer





















  • Thanks! Another good answer.
    – Jack Bailey
    Dec 12 at 11:42











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f381432%2feli5-the-logic-behind-coefficient-estimation-in-ols-regression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









7














Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.



If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.



In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.



Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.






share|cite|improve this answer























  • Thanks for your time. This was a great explanation and really useful.
    – Jack Bailey
    Dec 11 at 14:40
















7














Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.



If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.



In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.



Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.






share|cite|improve this answer























  • Thanks for your time. This was a great explanation and really useful.
    – Jack Bailey
    Dec 11 at 14:40














7












7








7






Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.



If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.



In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.



Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.






share|cite|improve this answer














Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.



If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.



In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.



Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Dec 11 at 12:15

























answered Dec 11 at 12:08









Purple Rover

735




735












  • Thanks for your time. This was a great explanation and really useful.
    – Jack Bailey
    Dec 11 at 14:40


















  • Thanks for your time. This was a great explanation and really useful.
    – Jack Bailey
    Dec 11 at 14:40
















Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 at 14:40




Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 at 14:40













1














If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:




  1. OLS is aiming to minimize the error $||y-Xbeta||$.


  2. The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)


  3. The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.



Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.






share|cite|improve this answer





















  • Thanks! Another good answer.
    – Jack Bailey
    Dec 12 at 11:42
















1














If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:




  1. OLS is aiming to minimize the error $||y-Xbeta||$.


  2. The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)


  3. The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.



Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.






share|cite|improve this answer





















  • Thanks! Another good answer.
    – Jack Bailey
    Dec 12 at 11:42














1












1








1






If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:




  1. OLS is aiming to minimize the error $||y-Xbeta||$.


  2. The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)


  3. The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.



Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.






share|cite|improve this answer












If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:




  1. OLS is aiming to minimize the error $||y-Xbeta||$.


  2. The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)


  3. The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.



Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Dec 11 at 19:43









Acccumulation

1,54826




1,54826












  • Thanks! Another good answer.
    – Jack Bailey
    Dec 12 at 11:42


















  • Thanks! Another good answer.
    – Jack Bailey
    Dec 12 at 11:42
















Thanks! Another good answer.
– Jack Bailey
Dec 12 at 11:42




Thanks! Another good answer.
– Jack Bailey
Dec 12 at 11:42


















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f381432%2feli5-the-logic-behind-coefficient-estimation-in-ols-regression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

If I really need a card on my start hand, how many mulligans make sense? [duplicate]

Alcedinidae

Can an atomic nucleus contain both particles and antiparticles? [duplicate]