How bad is Cholesky decomposition for OLS?












2












$begingroup$


I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.



First let's set up the problem in R:



X <- rbind(c(1, 1), c(1.00001, 1))
kappa(t(X) %*% X)


The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!



y <- rnorm(2)
L <- chol(t(X) %*% X)
z <- solve(t(L), t(X) %*% y)

result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
result

chol qr
x1 -148230.9 -148231.0
x2 148231.9 148232.1


...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?










share|cite|improve this question









$endgroup$

















    2












    $begingroup$


    I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.



    First let's set up the problem in R:



    X <- rbind(c(1, 1), c(1.00001, 1))
    kappa(t(X) %*% X)


    The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!



    y <- rnorm(2)
    L <- chol(t(X) %*% X)
    z <- solve(t(L), t(X) %*% y)

    result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
    result

    chol qr
    x1 -148230.9 -148231.0
    x2 148231.9 148232.1


    ...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?










    share|cite|improve this question









    $endgroup$















      2












      2








      2


      2



      $begingroup$


      I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.



      First let's set up the problem in R:



      X <- rbind(c(1, 1), c(1.00001, 1))
      kappa(t(X) %*% X)


      The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!



      y <- rnorm(2)
      L <- chol(t(X) %*% X)
      z <- solve(t(L), t(X) %*% y)

      result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
      result

      chol qr
      x1 -148230.9 -148231.0
      x2 148231.9 148232.1


      ...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?










      share|cite|improve this question









      $endgroup$




      I've heard that I should never use Cholesky decomposition for OLS as it's super unstable, so I thought I'd try a very simple example to see how bad it could be.



      First let's set up the problem in R:



      X <- rbind(c(1, 1), c(1.00001, 1))
      kappa(t(X) %*% X)


      The condition number is 160002098109! Oooh boy, coefficients will be off by at least a billion for sure!



      y <- rnorm(2)
      L <- chol(t(X) %*% X)
      z <- solve(t(L), t(X) %*% y)

      result <- data.frame(chol = solve(L, z), qr = coefficients(lm.fit(X, y, method = 'qr')))
      result

      chol qr
      x1 -148230.9 -148231.0
      x2 148231.9 148232.1


      ...almost no difference. What happened? If the condition number is so big that no method can handle it well, what is the range of condition numbers at which the Cholesky solution is in fact appreciably less stable than with QR decomposition?







      regression machine-learning least-squares






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked yesterday









      badmaxbadmax

      41017




      41017






















          1 Answer
          1






          active

          oldest

          votes


















          3












          $begingroup$

          What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
          (Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
          Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
          $$ 0.2 / 148232 = 1.35 times 10^{-6},$$
          which matches the condition number you started out with surprisingly well.



          The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by



          > y <- X %*% c(1,1.00001)


          and rerun your calculation we get



          > result
          chol qr
          x1 0.9999833544961814 0.999999999968327
          x2 1.0000266455870466 1.000010000031673


          confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.



          The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
          Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
          My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.



          The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
          $rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
          The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:




          At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!




          Reference



          Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.






          share|cite|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "65"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389353%2fhow-bad-is-cholesky-decomposition-for-ols%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3












            $begingroup$

            What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
            (Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
            Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
            $$ 0.2 / 148232 = 1.35 times 10^{-6},$$
            which matches the condition number you started out with surprisingly well.



            The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by



            > y <- X %*% c(1,1.00001)


            and rerun your calculation we get



            > result
            chol qr
            x1 0.9999833544961814 0.999999999968327
            x2 1.0000266455870466 1.000010000031673


            confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.



            The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
            Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
            My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.



            The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
            $rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
            The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:




            At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!




            Reference



            Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.






            share|cite|improve this answer











            $endgroup$


















              3












              $begingroup$

              What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
              (Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
              Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
              $$ 0.2 / 148232 = 1.35 times 10^{-6},$$
              which matches the condition number you started out with surprisingly well.



              The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by



              > y <- X %*% c(1,1.00001)


              and rerun your calculation we get



              > result
              chol qr
              x1 0.9999833544961814 0.999999999968327
              x2 1.0000266455870466 1.000010000031673


              confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.



              The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
              Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
              My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.



              The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
              $rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
              The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:




              At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!




              Reference



              Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.






              share|cite|improve this answer











              $endgroup$
















                3












                3








                3





                $begingroup$

                What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
                (Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
                Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
                $$ 0.2 / 148232 = 1.35 times 10^{-6},$$
                which matches the condition number you started out with surprisingly well.



                The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by



                > y <- X %*% c(1,1.00001)


                and rerun your calculation we get



                > result
                chol qr
                x1 0.9999833544961814 0.999999999968327
                x2 1.0000266455870466 1.000010000031673


                confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.



                The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
                Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
                My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.



                The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
                $rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
                The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:




                At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!




                Reference



                Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.






                share|cite|improve this answer











                $endgroup$



                What you are seeing is exactly what one would expect. The condition number of your matrix $X$ is about $10^6$. Double precision floating point calculations give about 18 figures of accuracy so, in double precision, you would expect to get at most $18-6=12$ significant figures for the estimated coefficients from QR and at most $18-6-6=6$ significant figures from Choleski, after allowing for accuracy lost due to conditioning.
                (Choleski loses twice the number of significant figures because forming $X^TX$ squares the condition number.)
                Hence the coefficients from the two algorithms should start to differ in the 6th significant figure, which is exactly what you see. Your results show that for x2 the relative difference between the two algorithms is about
                $$ 0.2 / 148232 = 1.35 times 10^{-6},$$
                which matches the condition number you started out with surprisingly well.



                The condition number of $X$ is a reasonable guide to precision when the residuals are small. If we make sure that the exact coefficients are 1 and 1.00001 by



                > y <- X %*% c(1,1.00001)


                and rerun your calculation we get



                > result
                chol qr
                x1 0.9999833544961814 0.999999999968327
                x2 1.0000266455870466 1.000010000031673


                confirming that QR is in fact correct to 12 significant figures and Choleski is correct to 6 significant figures.



                The fitted values from Choleski are generally more precise than the estimated coefficients. If the fitted values are more important to you than the coefficients, and that will often be so in a highly collinear cases, then Choleski is usually fine.
                Choleski is also about as good as QR when the residuals are very large, but this is the hardest case for both algorithms.
                My PhD supervisor was Australia's foremost numerical analysis (Mike Osborne) and he used to say that Choleski was not as bad as it's made out to be.



                The full sensitivity analysis for Choleski vs QR is given in Section 5.3.8 of Golub and Van Loan (1996), and is very much more complex than the simple calculation I used above. The sensitivity of Choleski is roughly proportion to $kappa + rho kappa^2$ where $kappa$ is the condition number of $X$ and $rho$ is a theoretical quantity that is almost impossible to compute in practice.
                $rho$ depends on the size of the residuals-- it is roughly proportional to but much smaller than the average squared residual.
                The sensitivity of QR is somewhat better than Choleski although not always as good as I have suggested above. Golub and Van Loan remark:




                At the very minimum, this discussion should convince you how difficult it can be to choose the "right" algorithm!




                Reference



                Golub, GH, and Van Loan CF (1996). Matrix Computations. Johns Hopkins University Press, Baltimore.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited yesterday

























                answered yesterday









                Gordon SmythGordon Smyth

                4,7381125




                4,7381125






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Cross Validated!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389353%2fhow-bad-is-cholesky-decomposition-for-ols%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Paul Cézanne

                    UIScrollView CustomStickyHeader Resize height generates problems when scroll is too fast

                    Angular material date-picker (MatDatepicker) auto completes the date on focus out