SED - complex text deletion and pattern matching












0














I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.



Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:



PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'


I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).



So if my text is:



<connection ...
... state="wreck" ...
</connection>


I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).



Thank you. Hope this is a clear question.










share|improve this question






















  • btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
    – efraimip
    Dec 13 '18 at 17:38












  • This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
    – AFH
    Dec 13 '18 at 17:58












  • I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
    – Paulo
    Dec 13 '18 at 18:09












  • Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
    – efraimip
    Dec 14 '18 at 18:38
















0














I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.



Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:



PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'


I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).



So if my text is:



<connection ...
... state="wreck" ...
</connection>


I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).



Thank you. Hope this is a clear question.










share|improve this question






















  • btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
    – efraimip
    Dec 13 '18 at 17:38












  • This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
    – AFH
    Dec 13 '18 at 17:58












  • I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
    – Paulo
    Dec 13 '18 at 18:09












  • Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
    – efraimip
    Dec 14 '18 at 18:38














0












0








0







I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.



Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:



PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'


I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).



So if my text is:



<connection ...
... state="wreck" ...
</connection>


I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).



Thank you. Hope this is a clear question.










share|improve this question













I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.



Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:



PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'


I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).



So if my text is:



<connection ...
... state="wreck" ...
</connection>


I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).



Thank you. Hope this is a clear question.







linux command-line bash regex sed






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 13 '18 at 17:29









efraimip

11




11












  • btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
    – efraimip
    Dec 13 '18 at 17:38












  • This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
    – AFH
    Dec 13 '18 at 17:58












  • I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
    – Paulo
    Dec 13 '18 at 18:09












  • Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
    – efraimip
    Dec 14 '18 at 18:38


















  • btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
    – efraimip
    Dec 13 '18 at 17:38












  • This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
    – AFH
    Dec 13 '18 at 17:58












  • I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
    – Paulo
    Dec 13 '18 at 18:09












  • Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
    – efraimip
    Dec 14 '18 at 18:38
















btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38






btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38














This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58






This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58














I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09






I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09














Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38




Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38










1 Answer
1






active

oldest

votes


















0














Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"



cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah


Explanation:



-0      # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions


Regex:



s#                      : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline





share|improve this answer





















  • thanks! the sed command suggested above also worked.
    – efraimip
    Dec 14 '18 at 2:27










  • unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
    – efraimip
    Dec 14 '18 at 18:37











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383353%2fsed-complex-text-deletion-and-pattern-matching%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"



cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah


Explanation:



-0      # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions


Regex:



s#                      : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline





share|improve this answer





















  • thanks! the sed command suggested above also worked.
    – efraimip
    Dec 14 '18 at 2:27










  • unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
    – efraimip
    Dec 14 '18 at 18:37
















0














Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"



cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah


Explanation:



-0      # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions


Regex:



s#                      : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline





share|improve this answer





















  • thanks! the sed command suggested above also worked.
    – efraimip
    Dec 14 '18 at 2:27










  • unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
    – efraimip
    Dec 14 '18 at 18:37














0












0








0






Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"



cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah


Explanation:



-0      # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions


Regex:



s#                      : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline





share|improve this answer












Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"



cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah


Explanation:



-0      # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions


Regex:



s#                      : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline






share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 13 '18 at 18:06









Toto

3,62391226




3,62391226












  • thanks! the sed command suggested above also worked.
    – efraimip
    Dec 14 '18 at 2:27










  • unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
    – efraimip
    Dec 14 '18 at 18:37


















  • thanks! the sed command suggested above also worked.
    – efraimip
    Dec 14 '18 at 2:27










  • unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
    – efraimip
    Dec 14 '18 at 18:37
















thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27




thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27












unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37




unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37


















draft saved

draft discarded




















































Thanks for contributing an answer to Super User!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383353%2fsed-complex-text-deletion-and-pattern-matching%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

If I really need a card on my start hand, how many mulligans make sense? [duplicate]

Alcedinidae

Can an atomic nucleus contain both particles and antiparticles? [duplicate]