Get specific text between a certain tag in all files in a directory
up vote
1
down vote
favorite
I have a few hundred .txt
files in a directory that have the following format:
<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>
I would like to manipulate every single file so that the file would only contain the text between the <TEXT>
and </TEXT>
tags (i.e.Text that I want
)
I have tried the following code but it does not seem to do what I need:
find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'
How can I do this using a bash script (preferably using sed
)?
bash sed tags
|
show 3 more comments
up vote
1
down vote
favorite
I have a few hundred .txt
files in a directory that have the following format:
<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>
I would like to manipulate every single file so that the file would only contain the text between the <TEXT>
and </TEXT>
tags (i.e.Text that I want
)
I have tried the following code but it does not seem to do what I need:
find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'
How can I do this using a bash script (preferably using sed
)?
bash sed tags
You mean between theTEXT
tags, to be clear - correct?
– kabanus
Nov 19 at 19:34
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
That is because you can embed HTML in your post so<>
should always be in a code block.
– kabanus
Nov 19 at 19:40
Works for me. Did you try runningfind
and making sure you actually get a hit with the tags?
– kabanus
Nov 19 at 19:46
Also, consider droppingxargs
for a purefind
solution-execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47
|
show 3 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a few hundred .txt
files in a directory that have the following format:
<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>
I would like to manipulate every single file so that the file would only contain the text between the <TEXT>
and </TEXT>
tags (i.e.Text that I want
)
I have tried the following code but it does not seem to do what I need:
find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'
How can I do this using a bash script (preferably using sed
)?
bash sed tags
I have a few hundred .txt
files in a directory that have the following format:
<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>
I would like to manipulate every single file so that the file would only contain the text between the <TEXT>
and </TEXT>
tags (i.e.Text that I want
)
I have tried the following code but it does not seem to do what I need:
find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/</TEXT/p'
How can I do this using a bash script (preferably using sed
)?
bash sed tags
bash sed tags
edited Nov 19 at 19:38
asked Nov 19 at 19:21
KaanTheGuru
26
26
You mean between theTEXT
tags, to be clear - correct?
– kabanus
Nov 19 at 19:34
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
That is because you can embed HTML in your post so<>
should always be in a code block.
– kabanus
Nov 19 at 19:40
Works for me. Did you try runningfind
and making sure you actually get a hit with the tags?
– kabanus
Nov 19 at 19:46
Also, consider droppingxargs
for a purefind
solution-execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47
|
show 3 more comments
You mean between theTEXT
tags, to be clear - correct?
– kabanus
Nov 19 at 19:34
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
That is because you can embed HTML in your post so<>
should always be in a code block.
– kabanus
Nov 19 at 19:40
Works for me. Did you try runningfind
and making sure you actually get a hit with the tags?
– kabanus
Nov 19 at 19:46
Also, consider droppingxargs
for a purefind
solution-execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47
You mean between the
TEXT
tags, to be clear - correct?– kabanus
Nov 19 at 19:34
You mean between the
TEXT
tags, to be clear - correct?– kabanus
Nov 19 at 19:34
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
That is because you can embed HTML in your post so
<>
should always be in a code block.– kabanus
Nov 19 at 19:40
That is because you can embed HTML in your post so
<>
should always be in a code block.– kabanus
Nov 19 at 19:40
Works for me. Did you try running
find
and making sure you actually get a hit with the tags?– kabanus
Nov 19 at 19:46
Works for me. Did you try running
find
and making sure you actually get a hit with the tags?– kabanus
Nov 19 at 19:46
Also, consider dropping
xargs
for a pure find
solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47
Also, consider dropping
xargs
for a pure find
solution -execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47
|
show 3 more comments
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
You want to remove everything but the text between TEXT
tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching/</TEXT>/
or hitting EOF.
– oguzismail
Nov 19 at 20:10
add a comment |
up vote
1
down vote
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53381276%2fget-specific-text-between-a-certain-tag-in-all-files-in-a-directory%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You want to remove everything but the text between TEXT
tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching/</TEXT>/
or hitting EOF.
– oguzismail
Nov 19 at 20:10
add a comment |
up vote
2
down vote
accepted
You want to remove everything but the text between TEXT
tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching/</TEXT>/
or hitting EOF.
– oguzismail
Nov 19 at 20:10
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You want to remove everything but the text between TEXT
tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +
You want to remove everything but the text between TEXT
tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/</TEXT>/,/<TEXT>/d' {} +
edited Nov 19 at 20:21
answered Nov 19 at 19:54
oguzismail
3,08021025
3,08021025
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching/</TEXT>/
or hitting EOF.
– oguzismail
Nov 19 at 20:10
add a comment |
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching/</TEXT>/
or hitting EOF.
– oguzismail
Nov 19 at 20:10
1
1
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
That's it! Thanks Oguz!
– KaanTheGuru
Nov 19 at 19:57
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
Won't this leave the last </TEXT> and everything after it in the file since there won't be a matching opening <TEXT>?
– Tyler Marshall
Nov 19 at 20:02
1
1
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching
/</TEXT>/
or hitting EOF.– oguzismail
Nov 19 at 20:10
@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching
/</TEXT>/
or hitting EOF.– oguzismail
Nov 19 at 20:10
add a comment |
up vote
1
down vote
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done
add a comment |
up vote
1
down vote
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done
add a comment |
up vote
1
down vote
up vote
1
down vote
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d 'n' | sed -nE 's/<TEXT>(.*)</TEXT>/1/p')
done
edited Nov 19 at 20:13
answered Nov 19 at 19:57
Tyler Marshall
1966
1966
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53381276%2fget-specific-text-between-a-certain-tag-in-all-files-in-a-directory%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You mean between the
TEXT
tags, to be clear - correct?– kabanus
Nov 19 at 19:34
Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus
– KaanTheGuru
Nov 19 at 19:38
That is because you can embed HTML in your post so
<>
should always be in a code block.– kabanus
Nov 19 at 19:40
Works for me. Did you try running
find
and making sure you actually get a hit with the tags?– kabanus
Nov 19 at 19:46
Also, consider dropping
xargs
for a purefind
solution-execdir sed -n '/<TEXT/,/</TEXT/p' {} +
– kabanus
Nov 19 at 19:47