How bad is Cholesky decomposition for OLS?
$begingroup$
I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.
First let's set up the problem in R:
X <- rbind(c(1, 1), c(1.00001, 1))
kappa(t(X) %*% X)
The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!
y <- rnorm(2)
L <- chol(t(X) %*% X)
z <- solve(t(L), t(X) %*% y)
result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
result
chol qr
x1 -148230.9 -148231.0
x2 148231.9 148232.1
...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?
regression machine-learning least-squares
$endgroup$
add a comment |
$begingroup$
I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.
First let's set up the problem in R:
X <- rbind(c(1, 1), c(1.00001, 1))
kappa(t(X) %*% X)
The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!
y <- rnorm(2)
L <- chol(t(X) %*% X)
z <- solve(t(L), t(X) %*% y)
result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
result
chol qr
x1 -148230.9 -148231.0
x2 148231.9 148232.1
...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?
regression machine-learning least-squares
$endgroup$
add a comment |
$begingroup$
I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.
First let's set up the problem in R:
X <- rbind(c(1, 1), c(1.00001, 1))
kappa(t(X) %*% X)
The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!
y <- rnorm(2)
L <- chol(t(X) %*% X)
z <- solve(t(L), t(X) %*% y)
result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
result
chol qr
x1 -148230.9 -148231.0
x2 148231.9 148232.1
...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?
regression machine-learning least-squares
$endgroup$
I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.
First let's set up the problem in R:
X <- rbind(c(1, 1), c(1.00001, 1))
kappa(t(X) %*% X)
The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!
y <- rnorm(2)
L <- chol(t(X) %*% X)
z <- solve(t(L), t(X) %*% y)
result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
result
chol qr
x1 -148230.9 -148231.0
x2 148231.9 148232.1
...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?
regression machine-learning least-squares
regression machine-learning least-squares
asked yesterday
badmaxbadmax
41017
41017
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
(Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
$$ 0.2 / 148232 = 1.35 times 10^{-6},$$
which matches the condition number you started out with surprisingly well.
The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by
> y <- X %*% c(1,1.00001)
and rerun your calculation we get
> result
chol qr
x1 0.9999833544961814 0.999999999968327
x2 1.0000266455870466 1.000010000031673
confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.
The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.
The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
$rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:
At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!
Reference
Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389353%2fhow-bad-is-cholesky-decomposition-for-ols%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
(Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
$$ 0.2 / 148232 = 1.35 times 10^{-6},$$
which matches the condition number you started out with surprisingly well.
The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by
> y <- X %*% c(1,1.00001)
and rerun your calculation we get
> result
chol qr
x1 0.9999833544961814 0.999999999968327
x2 1.0000266455870466 1.000010000031673
confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.
The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.
The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
$rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:
At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!
Reference
Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.
$endgroup$
add a comment |
$begingroup$
What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
(Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
$$ 0.2 / 148232 = 1.35 times 10^{-6},$$
which matches the condition number you started out with surprisingly well.
The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by
> y <- X %*% c(1,1.00001)
and rerun your calculation we get
> result
chol qr
x1 0.9999833544961814 0.999999999968327
x2 1.0000266455870466 1.000010000031673
confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.
The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.
The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
$rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:
At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!
Reference
Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.
$endgroup$
add a comment |
$begingroup$
What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
(Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
$$ 0.2 / 148232 = 1.35 times 10^{-6},$$
which matches the condition number you started out with surprisingly well.
The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by
> y <- X %*% c(1,1.00001)
and rerun your calculation we get
> result
chol qr
x1 0.9999833544961814 0.999999999968327
x2 1.0000266455870466 1.000010000031673
confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.
The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.
The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
$rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:
At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!
Reference
Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.
$endgroup$
What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
(Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
$$ 0.2 / 148232 = 1.35 times 10^{-6},$$
which matches the condition number you started out with surprisingly well.
The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by
> y <- X %*% c(1,1.00001)
and rerun your calculation we get
> result
chol qr
x1 0.9999833544961814 0.999999999968327
x2 1.0000266455870466 1.000010000031673
confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.
The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.
The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
$rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:
At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!
Reference
Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.
edited yesterday
answered yesterday
Gordon SmythGordon Smyth
4,7381125
4,7381125
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389353%2fhow-bad-is-cholesky-decomposition-for-ols%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown