SED - complex text deletion and pattern matching

I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.

Now I have a much more complex task. I need to find a pattern in the text file, and delete all the text from the a line BEFORE the pattern, up to another line matching another pattern. I will give an example:

PATTERN-1 = '<connection'

PATTERN-2 = state="wreck"

PATTERN-3 = '</connection>'

I need to search for PATTERN-2. IE: state="wreck"
When I find PATTERN-2, I need to find the PREVIOUS PATTERN-1.
Then I need to delete all text between PATTERN-1 and PATTERN-3 (which would include deleting PATTERN-2).

So if my text is:

<connection ...

... state="wreck" ...

</connection>

I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).

Thank you. Hope this is a clear question.

asked Dec 13 '18 at 17:29

efraimip

btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38

This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58

I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09

Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38

add a comment |

I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.

PATTERN-1 = '<connection'

PATTERN-2 = state="wreck"

PATTERN-3 = '</connection>'

So if my text is:

<connection ...

... state="wreck" ...

</connection>

I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).

Thank you. Hope this is a clear question.

asked Dec 13 '18 at 17:29

efraimip

btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38

This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58

I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09

Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38

add a comment |

I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.

PATTERN-1 = '<connection'

PATTERN-2 = state="wreck"

PATTERN-3 = '</connection>'

So if my text is:

<connection ...

... state="wreck" ...

</connection>

I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).

Thank you. Hope this is a clear question.

asked Dec 13 '18 at 17:29

efraimip

I searched allot on SED. I'm new to using it. Managed to make a command that deletes a block of text between PATTERN-1 and PATTERN-2 (including the patterns) in a large (250mb+) text file.

PATTERN-1 = '<connection'

PATTERN-2 = state="wreck"

PATTERN-3 = '</connection>'

So if my text is:

<connection ...

... state="wreck" ...

</connection>

I would find any instances of state="wreck" - and then delete everything between
<connection and </connection> (including the text <connection and </connection>).

Thank you. Hope this is a clear question.

linux command-line bash regex sed

asked Dec 13 '18 at 17:29

efraimip

asked Dec 13 '18 at 17:29

efraimip

asked Dec 13 '18 at 17:29

efraimip

asked Dec 13 '18 at 17:29

efraimip

asked Dec 13 '18 at 17:29

efraimip

btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38

This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58

I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09

Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38

add a comment |

btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38

This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58

I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09

Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38

btw the pattern matching i used previously to delete text between 2 blocks worked fine. it was simply: sed '/^<discovered>/,/</discovered>/d' file1 > file2 problem now is that I need to find another string inside of a block, and then delete the containing block.
– efraimip
Dec 13 '18 at 17:38

This is a good introduction to both basic and more advanced features of sed, including multi-line handling.
– AFH
Dec 13 '18 at 17:58

I don't know if sed will be the fastest filter for this, but this is a way to do it sed -r '/<connection/ { :a; N; //connection/ { /wreck/ {d}; p;d }; ba}'
– Paulo
Dec 13 '18 at 18:09

Hi Paulo, this is not working completely either. It deletes much more than the containing block with state="wreck". IE: it deletes more before the previous <connection pattern. It also leaves empty lines, but thats easy to fix with another sed command so no problem.
– efraimip
Dec 14 '18 at 18:38

add a comment |

1 Answer
1

active

oldest

votes

Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"

cat file.txt

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah



perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt

blah blah



blah blah



blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah



blah blah

Explanation:

-0      # slurp mode, read the file as it has only 1 line

-pe     # print current line, execute the following instructions

Regex:

s#                      : substitute, regex delimiter

<connection             : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

state="wreck"           : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

</connection>           : literally

##gs                    : replace with empty string, global, dot match newline

answered Dec 13 '18 at 18:06

Toto

3,62391226

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1383353%2fsed-complex-text-deletion-and-pattern-matching%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"

cat file.txt

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah



perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt

blah blah



blah blah



blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah



blah blah

Explanation:

-0      # slurp mode, read the file as it has only 1 line

-pe     # print current line, execute the following instructions

Regex:

s#                      : substitute, regex delimiter

<connection             : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

state="wreck"           : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

</connection>           : literally

##gs                    : replace with empty string, global, dot match newline

answered Dec 13 '18 at 18:06

Toto

3,62391226

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

add a comment |

Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"

cat file.txt

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah



perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt

blah blah



blah blah



blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah



blah blah

Explanation:

-0      # slurp mode, read the file as it has only 1 line

-pe     # print current line, execute the following instructions

Regex:

s#                      : substitute, regex delimiter

<connection             : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

state="wreck"           : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

</connection>           : literally

##gs                    : replace with empty string, global, dot match newline

answered Dec 13 '18 at 18:06

Toto

3,62391226

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

add a comment |

Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"

cat file.txt

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah



perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt

blah blah



blah blah



blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah



blah blah

Explanation:

-0      # slurp mode, read the file as it has only 1 line

-pe     # print current line, execute the following instructions

Regex:

s#                      : substitute, regex delimiter

<connection             : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

state="wreck"           : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

</connection>           : literally

##gs                    : replace with empty string, global, dot match newline

answered Dec 13 '18 at 18:06

Toto

3,62391226

Here is a way to go if you're OK to use perl, this removes all blocks <connection...</connection> that include state="wreck"

cat file.txt

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah

<connection ...

... state="wreck" ...

</connection>

blah blah



perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt

blah blah



blah blah



blah blah

blah blah

<connection ...

... state="another" ...

</connection>

blah blah



blah blah

Explanation:

-0      # slurp mode, read the file as it has only 1 line

-pe     # print current line, execute the following instructions

Regex:

s#                      : substitute, regex delimiter

<connection             : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

state="wreck"           : literally

(?:                     : start non capture group

    (?!</connection>)   : negative lookahead, make sure we don't find </connection>

    .                   : any character, including newline because of the s flag

)*                      : group may appear 0 or more times

</connection>           : literally

##gs                    : replace with empty string, global, dot match newline

answered Dec 13 '18 at 18:06

Toto

3,62391226

answered Dec 13 '18 at 18:06

Toto

3,62391226

answered Dec 13 '18 at 18:06

Toto

3,62391226

answered Dec 13 '18 at 18:06

Toto

3,62391226

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

add a comment |

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

thanks! the sed command suggested above also worked.
– efraimip
Dec 14 '18 at 2:27

unfortunately this is not working! There are 2 issues. 1 - it is leaving empty lines. Can easily fix with sed. np. 2 - it is deleting much more than the block containing state="wreck" It is deleting from 2 previous matches of <connection rather than just the previous match. Also, if it helps - the structure is like this: <connection .... ... state="wreck" ... ... </connection> The line with <connection ... is always 1 line before the line with state="wreck", so maybe just deleting from the previous line rather than pattern matching previous?
– efraimip
Dec 14 '18 at 18:37

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Super User!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr