Get specific text between a certain tag in all files in a directory











up vote
1
down vote

favorite












I have a few hundred .txt files in a directory that have the following format:



<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>


I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)



I have tried the following code but it does not seem to do what I need:



find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'


How can I do this using a bash script (preferably using sed)?










share|improve this question
























  • You mean between the TEXT tags, to be clear - correct?
    – kabanus
    Nov 19 at 19:34










  • Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
    – KaanTheGuru
    Nov 19 at 19:38












  • That is because you can embed HTML in your post so <> should always be in a code block.
    – kabanus
    Nov 19 at 19:40










  • Works for me. Did you try running find and making sure you actually get a hit with the tags?
    – kabanus
    Nov 19 at 19:46










  • Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
    – kabanus
    Nov 19 at 19:47















up vote
1
down vote

favorite












I have a few hundred .txt files in a directory that have the following format:



<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>


I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)



I have tried the following code but it does not seem to do what I need:



find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'


How can I do this using a bash script (preferably using sed)?










share|improve this question
























  • You mean between the TEXT tags, to be clear - correct?
    – kabanus
    Nov 19 at 19:34










  • Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
    – KaanTheGuru
    Nov 19 at 19:38












  • That is because you can embed HTML in your post so <> should always be in a code block.
    – kabanus
    Nov 19 at 19:40










  • Works for me. Did you try running find and making sure you actually get a hit with the tags?
    – kabanus
    Nov 19 at 19:46










  • Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
    – kabanus
    Nov 19 at 19:47













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a few hundred .txt files in a directory that have the following format:



<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>


I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)



I have tried the following code but it does not seem to do what I need:



find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'


How can I do this using a bash script (preferably using sed)?










share|improve this question















I have a few hundred .txt files in a directory that have the following format:



<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>


I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)



I have tried the following code but it does not seem to do what I need:



find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'


How can I do this using a bash script (preferably using sed)?







bash sed tags






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 at 19:38

























asked Nov 19 at 19:21









KaanTheGuru

26




26












  • You mean between the TEXT tags, to be clear - correct?
    – kabanus
    Nov 19 at 19:34










  • Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
    – KaanTheGuru
    Nov 19 at 19:38












  • That is because you can embed HTML in your post so <> should always be in a code block.
    – kabanus
    Nov 19 at 19:40










  • Works for me. Did you try running find and making sure you actually get a hit with the tags?
    – kabanus
    Nov 19 at 19:46










  • Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
    – kabanus
    Nov 19 at 19:47


















  • You mean between the TEXT tags, to be clear - correct?
    – kabanus
    Nov 19 at 19:34










  • Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
    – KaanTheGuru
    Nov 19 at 19:38












  • That is because you can embed HTML in your post so <> should always be in a code block.
    – kabanus
    Nov 19 at 19:40










  • Works for me. Did you try running find and making sure you actually get a hit with the tags?
    – kabanus
    Nov 19 at 19:46










  • Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
    – kabanus
    Nov 19 at 19:47
















You mean between the TEXT tags, to be clear - correct?
– kabanus
Nov 19 at 19:34




You mean between the TEXT tags, to be clear - correct?
– kabanus
Nov 19 at 19:34












Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38






Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38














That is because you can embed HTML in your post so <> should always be in a code block.
– kabanus
Nov 19 at 19:40




That is because you can embed HTML in your post so <> should always be in a code block.
– kabanus
Nov 19 at 19:40












Works for me. Did you try running find and making sure you actually get a hit with the tags?
– kabanus
Nov 19 at 19:46




Works for me. Did you try running find and making sure you actually get a hit with the tags?
– kabanus
Nov 19 at 19:46












Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47




Also, consider dropping xargs for a pure find solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47












2 Answers
2






active

oldest

votes

















up vote
2
down vote



accepted










You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.



find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +





share|improve this answer



















  • 1




    That's it! Thanks Oguz!
    – KaanTheGuru
    Nov 19 at 19:57










  • Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
    – Tyler Marshall
    Nov 19 at 20:02








  • 1




    @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
    – oguzismail
    Nov 19 at 20:10


















up vote
1
down vote













If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:



#!/bin/bash

for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53381276%2fget-specific-text-between-a-certain-tag-in-all-files-in-a-directory%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.



    find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +





    share|improve this answer



















    • 1




      That's it! Thanks Oguz!
      – KaanTheGuru
      Nov 19 at 19:57










    • Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
      – Tyler Marshall
      Nov 19 at 20:02








    • 1




      @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
      – oguzismail
      Nov 19 at 20:10















    up vote
    2
    down vote



    accepted










    You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.



    find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +





    share|improve this answer



















    • 1




      That's it! Thanks Oguz!
      – KaanTheGuru
      Nov 19 at 19:57










    • Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
      – Tyler Marshall
      Nov 19 at 20:02








    • 1




      @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
      – oguzismail
      Nov 19 at 20:10













    up vote
    2
    down vote



    accepted







    up vote
    2
    down vote



    accepted






    You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.



    find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +





    share|improve this answer














    You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.



    find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 19 at 20:21

























    answered Nov 19 at 19:54









    oguzismail

    3,08021025




    3,08021025








    • 1




      That's it! Thanks Oguz!
      – KaanTheGuru
      Nov 19 at 19:57










    • Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
      – Tyler Marshall
      Nov 19 at 20:02








    • 1




      @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
      – oguzismail
      Nov 19 at 20:10














    • 1




      That's it! Thanks Oguz!
      – KaanTheGuru
      Nov 19 at 19:57










    • Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
      – Tyler Marshall
      Nov 19 at 20:02








    • 1




      @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
      – oguzismail
      Nov 19 at 20:10








    1




    1




    That's it! Thanks Oguz!
    – KaanTheGuru
    Nov 19 at 19:57




    That's it! Thanks Oguz!
    – KaanTheGuru
    Nov 19 at 19:57












    Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
    – Tyler Marshall
    Nov 19 at 20:02






    Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
    – Tyler Marshall
    Nov 19 at 20:02






    1




    1




    @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
    – oguzismail
    Nov 19 at 20:10




    @TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching /</TEXT>/or hitting EOF.
    – oguzismail
    Nov 19 at 20:10












    up vote
    1
    down vote













    If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:



    #!/bin/bash

    for file in /root/Desktop/data/data/*.txt; do
    echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
    done





    share|improve this answer



























      up vote
      1
      down vote













      If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:



      #!/bin/bash

      for file in /root/Desktop/data/data/*.txt; do
      echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
      done





      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:



        #!/bin/bash

        for file in /root/Desktop/data/data/*.txt; do
        echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
        done





        share|improve this answer














        If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:



        #!/bin/bash

        for file in /root/Desktop/data/data/*.txt; do
        echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
        done






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 19 at 20:13

























        answered Nov 19 at 19:57









        Tyler Marshall

        1966




        1966






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53381276%2fget-specific-text-between-a-certain-tag-in-all-files-in-a-directory%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

            Alcedinidae

            RAC Tourist Trophy