comm -23 not deleting all common lines












1















I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:



comm -23 1.txt 2.txt > 3.txt


When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?



You can download the two files below:



file 1.txt : https://ufile.io/n7vn6



file 2.txt : https://ufile.io/p4s58










share|improve this question

























  • I haven't checked your files but I am guessing you didn't sort them before using comm.

    – mickp
    Nov 22 '18 at 15:28













  • Is there extra whitespace you're not taking into account?

    – glenn jackman
    Nov 22 '18 at 15:43











  • they are sorted, and there is no extract space

    – Youcef
    Nov 22 '18 at 16:53
















1















I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:



comm -23 1.txt 2.txt > 3.txt


When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?



You can download the two files below:



file 1.txt : https://ufile.io/n7vn6



file 2.txt : https://ufile.io/p4s58










share|improve this question

























  • I haven't checked your files but I am guessing you didn't sort them before using comm.

    – mickp
    Nov 22 '18 at 15:28













  • Is there extra whitespace you're not taking into account?

    – glenn jackman
    Nov 22 '18 at 15:43











  • they are sorted, and there is no extract space

    – Youcef
    Nov 22 '18 at 16:53














1












1








1








I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:



comm -23 1.txt 2.txt > 3.txt


When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?



You can download the two files below:



file 1.txt : https://ufile.io/n7vn6



file 2.txt : https://ufile.io/p4s58










share|improve this question
















I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:



comm -23 1.txt 2.txt > 3.txt


When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?



You can download the two files below:



file 1.txt : https://ufile.io/n7vn6



file 2.txt : https://ufile.io/p4s58







duplicates comm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 15:42









chepner

254k34242335




254k34242335










asked Nov 22 '18 at 15:26









YoucefYoucef

366




366













  • I haven't checked your files but I am guessing you didn't sort them before using comm.

    – mickp
    Nov 22 '18 at 15:28













  • Is there extra whitespace you're not taking into account?

    – glenn jackman
    Nov 22 '18 at 15:43











  • they are sorted, and there is no extract space

    – Youcef
    Nov 22 '18 at 16:53



















  • I haven't checked your files but I am guessing you didn't sort them before using comm.

    – mickp
    Nov 22 '18 at 15:28













  • Is there extra whitespace you're not taking into account?

    – glenn jackman
    Nov 22 '18 at 15:43











  • they are sorted, and there is no extract space

    – Youcef
    Nov 22 '18 at 16:53

















I haven't checked your files but I am guessing you didn't sort them before using comm.

– mickp
Nov 22 '18 at 15:28







I haven't checked your files but I am guessing you didn't sort them before using comm.

– mickp
Nov 22 '18 at 15:28















Is there extra whitespace you're not taking into account?

– glenn jackman
Nov 22 '18 at 15:43





Is there extra whitespace you're not taking into account?

– glenn jackman
Nov 22 '18 at 15:43













they are sorted, and there is no extract space

– Youcef
Nov 22 '18 at 16:53





they are sorted, and there is no extract space

– Youcef
Nov 22 '18 at 16:53












2 Answers
2






active

oldest

votes


















1














I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.



You could try:



dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.






share|improve this answer





















  • 1





    dos2unix solved it ! Thanks man!

    – Youcef
    Nov 22 '18 at 16:53











  • well, it would change the input files.

    – hek2mgl
    Nov 22 '18 at 16:53











  • All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

    – Youcef
    Nov 22 '18 at 16:56













  • @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

    – lurker
    Nov 22 '18 at 16:56











  • @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

    – hek2mgl
    Nov 22 '18 at 16:58



















1














comm needs the input to be sorted. You can use process substitution for that:



comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


Update, if you additionally have a problem with line endings, you can use sed to align that:



comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434102%2fcomm-23-not-deleting-all-common-lines%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.



    You could try:



    dos2unix 1.txt 2.txt
    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


    dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.






    share|improve this answer





















    • 1





      dos2unix solved it ! Thanks man!

      – Youcef
      Nov 22 '18 at 16:53











    • well, it would change the input files.

      – hek2mgl
      Nov 22 '18 at 16:53











    • All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

      – Youcef
      Nov 22 '18 at 16:56













    • @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

      – lurker
      Nov 22 '18 at 16:56











    • @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

      – hek2mgl
      Nov 22 '18 at 16:58
















    1














    I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.



    You could try:



    dos2unix 1.txt 2.txt
    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


    dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.






    share|improve this answer





















    • 1





      dos2unix solved it ! Thanks man!

      – Youcef
      Nov 22 '18 at 16:53











    • well, it would change the input files.

      – hek2mgl
      Nov 22 '18 at 16:53











    • All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

      – Youcef
      Nov 22 '18 at 16:56













    • @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

      – lurker
      Nov 22 '18 at 16:56











    • @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

      – hek2mgl
      Nov 22 '18 at 16:58














    1












    1








    1







    I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.



    You could try:



    dos2unix 1.txt 2.txt
    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


    dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.






    share|improve this answer















    I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.



    You could try:



    dos2unix 1.txt 2.txt
    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


    dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 22 '18 at 16:31

























    answered Nov 22 '18 at 16:06









    lurkerlurker

    44.8k74574




    44.8k74574








    • 1





      dos2unix solved it ! Thanks man!

      – Youcef
      Nov 22 '18 at 16:53











    • well, it would change the input files.

      – hek2mgl
      Nov 22 '18 at 16:53











    • All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

      – Youcef
      Nov 22 '18 at 16:56













    • @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

      – lurker
      Nov 22 '18 at 16:56











    • @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

      – hek2mgl
      Nov 22 '18 at 16:58














    • 1





      dos2unix solved it ! Thanks man!

      – Youcef
      Nov 22 '18 at 16:53











    • well, it would change the input files.

      – hek2mgl
      Nov 22 '18 at 16:53











    • All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

      – Youcef
      Nov 22 '18 at 16:56













    • @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

      – lurker
      Nov 22 '18 at 16:56











    • @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

      – hek2mgl
      Nov 22 '18 at 16:58








    1




    1





    dos2unix solved it ! Thanks man!

    – Youcef
    Nov 22 '18 at 16:53





    dos2unix solved it ! Thanks man!

    – Youcef
    Nov 22 '18 at 16:53













    well, it would change the input files.

    – hek2mgl
    Nov 22 '18 at 16:53





    well, it would change the input files.

    – hek2mgl
    Nov 22 '18 at 16:53













    All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

    – Youcef
    Nov 22 '18 at 16:56







    All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D

    – Youcef
    Nov 22 '18 at 16:56















    @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

    – lurker
    Nov 22 '18 at 16:56





    @hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.

    – lurker
    Nov 22 '18 at 16:56













    @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

    – hek2mgl
    Nov 22 '18 at 16:58





    @Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.

    – hek2mgl
    Nov 22 '18 at 16:58













    1














    comm needs the input to be sorted. You can use process substitution for that:



    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


    Update, if you additionally have a problem with line endings, you can use sed to align that:



    comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt





    share|improve this answer






























      1














      comm needs the input to be sorted. You can use process substitution for that:



      comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


      Update, if you additionally have a problem with line endings, you can use sed to align that:



      comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt





      share|improve this answer




























        1












        1








        1







        comm needs the input to be sorted. You can use process substitution for that:



        comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


        Update, if you additionally have a problem with line endings, you can use sed to align that:



        comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt





        share|improve this answer















        comm needs the input to be sorted. You can use process substitution for that:



        comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt


        Update, if you additionally have a problem with line endings, you can use sed to align that:



        comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 22 '18 at 16:53

























        answered Nov 22 '18 at 15:31









        hek2mglhek2mgl

        108k13146170




        108k13146170






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434102%2fcomm-23-not-deleting-all-common-lines%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

            Alcedinidae

            Origin of the phrase “under your belt”?