comm -23 not deleting all common lines
I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:
comm -23 1.txt 2.txt > 3.txt
When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?
You can download the two files below:
file 1.txt : https://ufile.io/n7vn6
file 2.txt : https://ufile.io/p4s58
duplicates comm
add a comment |
I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:
comm -23 1.txt 2.txt > 3.txt
When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?
You can download the two files below:
file 1.txt : https://ufile.io/n7vn6
file 2.txt : https://ufile.io/p4s58
duplicates comm
I haven't checked your files but I am guessing you didn't sort them before usingcomm
.
– mickp
Nov 22 '18 at 15:28
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53
add a comment |
I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:
comm -23 1.txt 2.txt > 3.txt
When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?
You can download the two files below:
file 1.txt : https://ufile.io/n7vn6
file 2.txt : https://ufile.io/p4s58
duplicates comm
I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:
comm -23 1.txt 2.txt > 3.txt
When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?
You can download the two files below:
file 1.txt : https://ufile.io/n7vn6
file 2.txt : https://ufile.io/p4s58
duplicates comm
duplicates comm
edited Nov 22 '18 at 15:42
chepner
254k34242335
254k34242335
asked Nov 22 '18 at 15:26
YoucefYoucef
366
366
I haven't checked your files but I am guessing you didn't sort them before usingcomm
.
– mickp
Nov 22 '18 at 15:28
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53
add a comment |
I haven't checked your files but I am guessing you didn't sort them before usingcomm
.
– mickp
Nov 22 '18 at 15:28
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53
I haven't checked your files but I am guessing you didn't sort them before using
comm
.– mickp
Nov 22 '18 at 15:28
I haven't checked your files but I am guessing you didn't sort them before using
comm
.– mickp
Nov 22 '18 at 15:28
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53
add a comment |
2 Answers
2
active
oldest
votes
I'm not sure how you generated your text files, but the problem is that some of your 1.txt
and 2.txt
lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M
which doesn't match registry
(Linux programs that examine text will see ^M
as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M
isn't visible so it appears registry
is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix
will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm
will give an error that one of the files isn't sorted.
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
|
show 5 more comments
comm
needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed
to align that:
comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434102%2fcomm-23-not-deleting-all-common-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm not sure how you generated your text files, but the problem is that some of your 1.txt
and 2.txt
lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M
which doesn't match registry
(Linux programs that examine text will see ^M
as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M
isn't visible so it appears registry
is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix
will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm
will give an error that one of the files isn't sorted.
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
|
show 5 more comments
I'm not sure how you generated your text files, but the problem is that some of your 1.txt
and 2.txt
lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M
which doesn't match registry
(Linux programs that examine text will see ^M
as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M
isn't visible so it appears registry
is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix
will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm
will give an error that one of the files isn't sorted.
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
|
show 5 more comments
I'm not sure how you generated your text files, but the problem is that some of your 1.txt
and 2.txt
lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M
which doesn't match registry
(Linux programs that examine text will see ^M
as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M
isn't visible so it appears registry
is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix
will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm
will give an error that one of the files isn't sorted.
I'm not sure how you generated your text files, but the problem is that some of your 1.txt
and 2.txt
lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M
which doesn't match registry
(Linux programs that examine text will see ^M
as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M
isn't visible so it appears registry
is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix
will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm
will give an error that one of the files isn't sorted.
edited Nov 22 '18 at 16:31
answered Nov 22 '18 at 16:06
lurkerlurker
44.8k74574
44.8k74574
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
|
show 5 more comments
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
1
1
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
dos2unix solved it ! Thanks man!
– Youcef
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
well, it would change the input files.
– hek2mgl
Nov 22 '18 at 16:53
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
All I need is the alphabetic characters, what would I do with those hidden characters related to some bizzare encoding :D
– Youcef
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@hek2mgl that may be a good thing if the contents need fixing. Unclear what the overall use case is. The OP can choose other means of addressing the issue if needed now that they know the problem.
– lurker
Nov 22 '18 at 16:56
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
@Youcef it's not about alphabetic characters. This solution doesn't change the encoding, just the line endings.
– hek2mgl
Nov 22 '18 at 16:58
|
show 5 more comments
comm
needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed
to align that:
comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt
add a comment |
comm
needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed
to align that:
comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt
add a comment |
comm
needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed
to align that:
comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt
comm
needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed
to align that:
comm -23 <(sed 's/r//g' 1.txt | sort) <(sed 's/r//g' 2.txt| sort) > 3.txt
edited Nov 22 '18 at 16:53
answered Nov 22 '18 at 15:31
hek2mglhek2mgl
108k13146170
108k13146170
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434102%2fcomm-23-not-deleting-all-common-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I haven't checked your files but I am guessing you didn't sort them before using
comm
.– mickp
Nov 22 '18 at 15:28
Is there extra whitespace you're not taking into account?
– glenn jackman
Nov 22 '18 at 15:43
they are sorted, and there is no extract space
– Youcef
Nov 22 '18 at 16:53