SED - complex text deletion and pattern matching
I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.
Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:
PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'
I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).
So if my text is:
<connection ...
... state="wreck" ...
</connection>
I would find any instances of state="wreck" - and then delete everything between
<connection
and </connection>
(including the text <connection
and </connection>
).
Thank you. Hope this is a clear question.
linux command-line bash regex sed
add a comment |
I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.
Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:
PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'
I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).
So if my text is:
<connection ...
... state="wreck" ...
</connection>
I would find any instances of state="wreck" - and then delete everything between
<connection
and </connection>
(including the text <connection
and </connection>
).
Thank you. Hope this is a clear question.
linux command-line bash regex sed
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38
This is a good introduction to both basic and more advanced features ofsed
, including multi-line handling.
– AFH
Dec 13 '18 at 17:58
I don't know ifsed
will be the fastest filter for this, but this is a way to do itsed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
Hi Paulo, this is not working completely either. It deletes much more than the containing block withstate="wreck"
. IE: it deletes more before the previous<connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38
add a comment |
I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.
Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:
PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'
I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).
So if my text is:
<connection ...
... state="wreck" ...
</connection>
I would find any instances of state="wreck" - and then delete everything between
<connection
and </connection>
(including the text <connection
and </connection>
).
Thank you. Hope this is a clear question.
linux command-line bash regex sed
I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.
Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:
PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'
I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).
So if my text is:
<connection ...
... state="wreck" ...
</connection>
I would find any instances of state="wreck" - and then delete everything between
<connection
and </connection>
(including the text <connection
and </connection>
).
Thank you. Hope this is a clear question.
linux command-line bash regex sed
linux command-line bash regex sed
asked Dec 13 '18 at 17:29
efraimip
11
11
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38
This is a good introduction to both basic and more advanced features ofsed
, including multi-line handling.
– AFH
Dec 13 '18 at 17:58
I don't know ifsed
will be the fastest filter for this, but this is a way to do itsed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
Hi Paulo, this is not working completely either. It deletes much more than the containing block withstate="wreck"
. IE: it deletes more before the previous<connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38
add a comment |
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38
This is a good introduction to both basic and more advanced features ofsed
, including multi-line handling.
– AFH
Dec 13 '18 at 17:58
I don't know ifsed
will be the fastest filter for this, but this is a way to do itsed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
Hi Paulo, this is not working completely either. It deletes much more than the containing block withstate="wreck"
. IE: it deletes more before the previous<connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:
sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.– efraimip
Dec 13 '18 at 17:38
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:
sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.– efraimip
Dec 13 '18 at 17:38
This is a good introduction to both basic and more advanced features of
sed
, including multi-line handling.– AFH
Dec 13 '18 at 17:58
This is a good introduction to both basic and more advanced features of
sed
, including multi-line handling.– AFH
Dec 13 '18 at 17:58
I don't know if
sed
will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
I don't know if
sed
will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
Hi Paulo, this is not working completely either. It deletes much more than the containing block with
state="wreck"
. IE: it deletes more before the previous <connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.– efraimip
Dec 14 '18 at 18:38
Hi Paulo, this is not working completely either. It deletes much more than the containing block with
state="wreck"
. IE: it deletes more before the previous <connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.– efraimip
Dec 14 '18 at 18:38
add a comment |
1 Answer
1
active
oldest
votes
Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection>
that include state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
Explanation:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
Regex:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containingstate="wreck"
It is deleting from 2 previous matches of<connection
rather than just the previous match. Also, if it helps - the structure is like this:<connection .... ... state="wreck" ... ... </connection>
The line with<connection ...
is always 1 line before the line withstate="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383353%2fsed-complex-text-deletion-and-pattern-matching%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection>
that include state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
Explanation:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
Regex:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containingstate="wreck"
It is deleting from 2 previous matches of<connection
rather than just the previous match. Also, if it helps - the structure is like this:<connection .... ... state="wreck" ... ... </connection>
The line with<connection ...
is always 1 line before the line withstate="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37
add a comment |
Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection>
that include state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
Explanation:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
Regex:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containingstate="wreck"
It is deleting from 2 previous matches of<connection
rather than just the previous match. Also, if it helps - the structure is like this:<connection .... ... state="wreck" ... ... </connection>
The line with<connection ...
is always 1 line before the line withstate="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37
add a comment |
Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection>
that include state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
Explanation:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
Regex:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline
Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection>
that include state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
Explanation:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
Regex:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline
answered Dec 13 '18 at 18:06
Toto
3,62391226
3,62391226
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containingstate="wreck"
It is deleting from 2 previous matches of<connection
rather than just the previous match. Also, if it helps - the structure is like this:<connection .... ... state="wreck" ... ... </connection>
The line with<connection ...
is always 1 line before the line withstate="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37
add a comment |
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containingstate="wreck"
It is deleting from 2 previous matches of<connection
rather than just the previous match. Also, if it helps - the structure is like this:<connection .... ... state="wreck" ... ... </connection>
The line with<connection ...
is always 1 line before the line withstate="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing
state="wreck"
It is deleting from 2 previous matches of <connection
rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection>
The line with <connection ...
is always 1 line before the line with state="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?– efraimip
Dec 14 '18 at 18:37
unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing
state="wreck"
It is deleting from 2 previous matches of <connection
rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection>
The line with <connection ...
is always 1 line before the line with state="wreck"
, so maybe just deleting from the previous line rather than pattern matching previous?– efraimip
Dec 14 '18 at 18:37
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383353%2fsed-complex-text-deletion-and-pattern-matching%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply:
sed '/^<discovered>/,/</discovered>/d' file1 > file2
problem now is that I need to find another string inside of a block, and then delete the containing block.– efraimip
Dec 13 '18 at 17:38
This is a good introduction to both basic and more advanced features of
sed
, including multi-line handling.– AFH
Dec 13 '18 at 17:58
I don't know if
sed
will be the fastest filter for this, but this is a way to do itsed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09
Hi Paulo, this is not working completely either. It deletes much more than the containing block with
state="wreck"
. IE: it deletes more before the previous<connection
pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.– efraimip
Dec 14 '18 at 18:38