How to remove duplicate lines












11














I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.



input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()
outfile = open(output_file, "w")

for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)

outfile.close()




input.txt



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal




Expected output



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?









share|improve this question









New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 2




    You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
    – busybear
    2 days ago










  • @busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
    – Ethan K
    2 days ago










  • Possible duplicate of How might I remove duplicate lines from a file?
    – glennv
    2 days ago
















11














I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.



input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()
outfile = open(output_file, "w")

for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)

outfile.close()




input.txt



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal




Expected output



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?









share|improve this question









New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 2




    You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
    – busybear
    2 days ago










  • @busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
    – Ethan K
    2 days ago










  • Possible duplicate of How might I remove duplicate lines from a file?
    – glennv
    2 days ago














11












11








11







I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.



input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()
outfile = open(output_file, "w")

for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)

outfile.close()




input.txt



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal




Expected output



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?









share|improve this question









New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.



input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()
outfile = open(output_file, "w")

for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)

outfile.close()




input.txt



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal




Expected output



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?






python text-files






share|improve this question









New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 2 days ago









Mad Physicist

33.7k156894




33.7k156894






New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 days ago









Mark

584




584




New contributor




Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Mark is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 2




    You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
    – busybear
    2 days ago










  • @busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
    – Ethan K
    2 days ago










  • Possible duplicate of How might I remove duplicate lines from a file?
    – glennv
    2 days ago














  • 2




    You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
    – busybear
    2 days ago










  • @busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
    – Ethan K
    2 days ago










  • Possible duplicate of How might I remove duplicate lines from a file?
    – glennv
    2 days ago








2




2




You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
– busybear
2 days ago




You open the file twice, since input_file and output_file are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
– busybear
2 days ago












@busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
– Ethan K
2 days ago




@busybear Yes. Open your file as r+ to read and write to the file at the same time (they will both work).
– Ethan K
2 days ago












Possible duplicate of How might I remove duplicate lines from a file?
– glennv
2 days ago




Possible duplicate of How might I remove duplicate lines from a file?
– glennv
2 days ago












6 Answers
6






active

oldest

votes


















4














The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:




  1. Open a temp file for writing

  2. Process the input to the new output

  3. Close both files

  4. Move the temp file to the input file name


This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.



Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:



from tempfile import NamedTemporaryFile
from shutil import move

input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()

with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)


The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.



Note also that I'm stripping the newline from each line in the set, since the last line might not have one.



Alt Solution



If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:



input_file = "input.txt"
output_file = "input.txt"

with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')


You can compare this against



with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))


The second version does exactly the same thing, but loads and writes all at once.






share|improve this answer























  • I get an error of outfile is not defined
    – Mark
    2 days ago










  • just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
    – Mark
    2 days ago












  • @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
    – Mad Physicist
    2 days ago










  • @Mark. Fixed the error. It was just a typo
    – Mad Physicist
    2 days ago










  • @Mark. I've proposed an alternative
    – Mad Physicist
    2 days ago



















3














The problem is that you're trying to write to the same file that you're reading from. You have at least two options:



Option 1



Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.



Option 2



Read all data in from your input file, close that file, then open the file for writing.



with open('input.txt', 'r') as f:
lines = f.readlines()

seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)


Option 3



Open the file for both reading and writing using r+ mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.






share|improve this answer



















  • 1




    Or use r+ for reading and writing.
    – Ethan K
    2 days ago



















1














import os
seen_lines =

with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)

with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)


Output:



I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?





share|improve this answer










New contributor




bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
    – Flight Odyssey
    2 days ago










  • When I use this code, I see Keep the change ya filthy animal twice in the output?
    – Mark
    2 days ago










  • @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
    – bitto
    2 days ago










  • Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
    – Mark
    2 days ago












  • @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
    – bitto
    2 days ago



















0














I believe this is the easiest way to do what you want:



with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file





share|improve this answer










New contributor




Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
    – Mad Physicist
    2 days ago



















0














Try the below code, using list comprehension with str.join and set and sorted:



input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()





share|improve this answer





























    0














    Just my two cents, in case you happen to be able to use Python3. It uses:




    • A reusable Path object which has a handy write_text() method.

    • An OrderedDict as data structure to satisfy the constraints of uniqueness and order at once.

    • A generator expression instead of Path.read_text() to save on memory.




    # in-place removal of duplicate lines, while remaining order
    import os
    from collections import OrderedDict
    from pathlib import Path

    filepath = Path("./duplicates.txt")

    with filepath.open() as _file:
    no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)

    filepath.write_text("n".join(no_duplicates))





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });






      Mark is a new contributor. Be nice, and check out our Code of Conduct.










      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53974070%2fhow-to-remove-duplicate-lines%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      4














      The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:




      1. Open a temp file for writing

      2. Process the input to the new output

      3. Close both files

      4. Move the temp file to the input file name


      This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.



      Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:



      from tempfile import NamedTemporaryFile
      from shutil import move

      input_file = "input.txt"
      output_file = "input.txt"

      seen_lines = set()

      with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
      for line in open(input_file, "r"):
      sline = line.rstrip('n')
      if sline not in seen_lines:
      output.write(line)
      seen_lines.add(sline)
      move(output.name, output_file)


      The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.



      Note also that I'm stripping the newline from each line in the set, since the last line might not have one.



      Alt Solution



      If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:



      input_file = "input.txt"
      output_file = "input.txt"

      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input)
      with open(output_file, 'w') as output:
      for line in unique:
      output.write(line)
      output.write('n')


      You can compare this against



      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input.readlines())
      with open(output_file, 'w') as output:
      output.write('n'.join(unique))


      The second version does exactly the same thing, but loads and writes all at once.






      share|improve this answer























      • I get an error of outfile is not defined
        – Mark
        2 days ago










      • just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
        – Mark
        2 days ago












      • @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
        – Mad Physicist
        2 days ago










      • @Mark. Fixed the error. It was just a typo
        – Mad Physicist
        2 days ago










      • @Mark. I've proposed an alternative
        – Mad Physicist
        2 days ago
















      4














      The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:




      1. Open a temp file for writing

      2. Process the input to the new output

      3. Close both files

      4. Move the temp file to the input file name


      This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.



      Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:



      from tempfile import NamedTemporaryFile
      from shutil import move

      input_file = "input.txt"
      output_file = "input.txt"

      seen_lines = set()

      with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
      for line in open(input_file, "r"):
      sline = line.rstrip('n')
      if sline not in seen_lines:
      output.write(line)
      seen_lines.add(sline)
      move(output.name, output_file)


      The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.



      Note also that I'm stripping the newline from each line in the set, since the last line might not have one.



      Alt Solution



      If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:



      input_file = "input.txt"
      output_file = "input.txt"

      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input)
      with open(output_file, 'w') as output:
      for line in unique:
      output.write(line)
      output.write('n')


      You can compare this against



      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input.readlines())
      with open(output_file, 'w') as output:
      output.write('n'.join(unique))


      The second version does exactly the same thing, but loads and writes all at once.






      share|improve this answer























      • I get an error of outfile is not defined
        – Mark
        2 days ago










      • just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
        – Mark
        2 days ago












      • @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
        – Mad Physicist
        2 days ago










      • @Mark. Fixed the error. It was just a typo
        – Mad Physicist
        2 days ago










      • @Mark. I've proposed an alternative
        – Mad Physicist
        2 days ago














      4












      4








      4






      The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:




      1. Open a temp file for writing

      2. Process the input to the new output

      3. Close both files

      4. Move the temp file to the input file name


      This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.



      Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:



      from tempfile import NamedTemporaryFile
      from shutil import move

      input_file = "input.txt"
      output_file = "input.txt"

      seen_lines = set()

      with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
      for line in open(input_file, "r"):
      sline = line.rstrip('n')
      if sline not in seen_lines:
      output.write(line)
      seen_lines.add(sline)
      move(output.name, output_file)


      The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.



      Note also that I'm stripping the newline from each line in the set, since the last line might not have one.



      Alt Solution



      If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:



      input_file = "input.txt"
      output_file = "input.txt"

      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input)
      with open(output_file, 'w') as output:
      for line in unique:
      output.write(line)
      output.write('n')


      You can compare this against



      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input.readlines())
      with open(output_file, 'w') as output:
      output.write('n'.join(unique))


      The second version does exactly the same thing, but loads and writes all at once.






      share|improve this answer














      The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:




      1. Open a temp file for writing

      2. Process the input to the new output

      3. Close both files

      4. Move the temp file to the input file name


      This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.



      Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:



      from tempfile import NamedTemporaryFile
      from shutil import move

      input_file = "input.txt"
      output_file = "input.txt"

      seen_lines = set()

      with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
      for line in open(input_file, "r"):
      sline = line.rstrip('n')
      if sline not in seen_lines:
      output.write(line)
      seen_lines.add(sline)
      move(output.name, output_file)


      The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.



      Note also that I'm stripping the newline from each line in the set, since the last line might not have one.



      Alt Solution



      If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:



      input_file = "input.txt"
      output_file = "input.txt"

      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input)
      with open(output_file, 'w') as output:
      for line in unique:
      output.write(line)
      output.write('n')


      You can compare this against



      with open(input_file) as input:
      unique = set(line.rstrip('n') for line in input.readlines())
      with open(output_file, 'w') as output:
      output.write('n'.join(unique))


      The second version does exactly the same thing, but loads and writes all at once.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 2 days ago

























      answered 2 days ago









      Mad Physicist

      33.7k156894




      33.7k156894












      • I get an error of outfile is not defined
        – Mark
        2 days ago










      • just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
        – Mark
        2 days ago












      • @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
        – Mad Physicist
        2 days ago










      • @Mark. Fixed the error. It was just a typo
        – Mad Physicist
        2 days ago










      • @Mark. I've proposed an alternative
        – Mad Physicist
        2 days ago


















      • I get an error of outfile is not defined
        – Mark
        2 days ago










      • just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
        – Mark
        2 days ago












      • @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
        – Mad Physicist
        2 days ago










      • @Mark. Fixed the error. It was just a typo
        – Mad Physicist
        2 days ago










      • @Mark. I've proposed an alternative
        – Mad Physicist
        2 days ago
















      I get an error of outfile is not defined
      – Mark
      2 days ago




      I get an error of outfile is not defined
      – Mark
      2 days ago












      just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
      – Mark
      2 days ago






      just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
      – Mark
      2 days ago














      @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
      – Mad Physicist
      2 days ago




      @Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
      – Mad Physicist
      2 days ago












      @Mark. Fixed the error. It was just a typo
      – Mad Physicist
      2 days ago




      @Mark. Fixed the error. It was just a typo
      – Mad Physicist
      2 days ago












      @Mark. I've proposed an alternative
      – Mad Physicist
      2 days ago




      @Mark. I've proposed an alternative
      – Mad Physicist
      2 days ago













      3














      The problem is that you're trying to write to the same file that you're reading from. You have at least two options:



      Option 1



      Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.



      Option 2



      Read all data in from your input file, close that file, then open the file for writing.



      with open('input.txt', 'r') as f:
      lines = f.readlines()

      seen_lines = set()
      with open('input.txt', 'w') as f:
      for line in lines:
      if line not in seen_lines:
      seen_lines.add(line)
      f.write(line)


      Option 3



      Open the file for both reading and writing using r+ mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.






      share|improve this answer



















      • 1




        Or use r+ for reading and writing.
        – Ethan K
        2 days ago
















      3














      The problem is that you're trying to write to the same file that you're reading from. You have at least two options:



      Option 1



      Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.



      Option 2



      Read all data in from your input file, close that file, then open the file for writing.



      with open('input.txt', 'r') as f:
      lines = f.readlines()

      seen_lines = set()
      with open('input.txt', 'w') as f:
      for line in lines:
      if line not in seen_lines:
      seen_lines.add(line)
      f.write(line)


      Option 3



      Open the file for both reading and writing using r+ mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.






      share|improve this answer



















      • 1




        Or use r+ for reading and writing.
        – Ethan K
        2 days ago














      3












      3








      3






      The problem is that you're trying to write to the same file that you're reading from. You have at least two options:



      Option 1



      Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.



      Option 2



      Read all data in from your input file, close that file, then open the file for writing.



      with open('input.txt', 'r') as f:
      lines = f.readlines()

      seen_lines = set()
      with open('input.txt', 'w') as f:
      for line in lines:
      if line not in seen_lines:
      seen_lines.add(line)
      f.write(line)


      Option 3



      Open the file for both reading and writing using r+ mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.






      share|improve this answer














      The problem is that you're trying to write to the same file that you're reading from. You have at least two options:



      Option 1



      Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.



      Option 2



      Read all data in from your input file, close that file, then open the file for writing.



      with open('input.txt', 'r') as f:
      lines = f.readlines()

      seen_lines = set()
      with open('input.txt', 'w') as f:
      for line in lines:
      if line not in seen_lines:
      seen_lines.add(line)
      f.write(line)


      Option 3



      Open the file for both reading and writing using r+ mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 2 days ago

























      answered 2 days ago









      Jonah Bishop

      8,50232957




      8,50232957








      • 1




        Or use r+ for reading and writing.
        – Ethan K
        2 days ago














      • 1




        Or use r+ for reading and writing.
        – Ethan K
        2 days ago








      1




      1




      Or use r+ for reading and writing.
      – Ethan K
      2 days ago




      Or use r+ for reading and writing.
      – Ethan K
      2 days ago











      1














      import os
      seen_lines =

      with open('input.txt','r') as infile:
      lines=infile.readlines()
      for line in lines:
      line_stripped=line.strip()
      if line_stripped not in seen_lines:
      seen_lines.append(line_stripped)

      with open('input.txt','w') as outfile:
      for line in seen_lines:
      outfile.write(line)
      if line != seen_lines[-1]:
      outfile.write(os.linesep)


      Output:



      I really love christmas
      Keep the change ya filthy animal
      Pizza is my fav food
      Did someone say peanut butter?





      share|improve this answer










      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.


















      • This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
        – Flight Odyssey
        2 days ago










      • When I use this code, I see Keep the change ya filthy animal twice in the output?
        – Mark
        2 days ago










      • @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
        – bitto
        2 days ago










      • Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
        – Mark
        2 days ago












      • @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
        – bitto
        2 days ago
















      1














      import os
      seen_lines =

      with open('input.txt','r') as infile:
      lines=infile.readlines()
      for line in lines:
      line_stripped=line.strip()
      if line_stripped not in seen_lines:
      seen_lines.append(line_stripped)

      with open('input.txt','w') as outfile:
      for line in seen_lines:
      outfile.write(line)
      if line != seen_lines[-1]:
      outfile.write(os.linesep)


      Output:



      I really love christmas
      Keep the change ya filthy animal
      Pizza is my fav food
      Did someone say peanut butter?





      share|improve this answer










      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.


















      • This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
        – Flight Odyssey
        2 days ago










      • When I use this code, I see Keep the change ya filthy animal twice in the output?
        – Mark
        2 days ago










      • @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
        – bitto
        2 days ago










      • Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
        – Mark
        2 days ago












      • @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
        – bitto
        2 days ago














      1












      1








      1






      import os
      seen_lines =

      with open('input.txt','r') as infile:
      lines=infile.readlines()
      for line in lines:
      line_stripped=line.strip()
      if line_stripped not in seen_lines:
      seen_lines.append(line_stripped)

      with open('input.txt','w') as outfile:
      for line in seen_lines:
      outfile.write(line)
      if line != seen_lines[-1]:
      outfile.write(os.linesep)


      Output:



      I really love christmas
      Keep the change ya filthy animal
      Pizza is my fav food
      Did someone say peanut butter?





      share|improve this answer










      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      import os
      seen_lines =

      with open('input.txt','r') as infile:
      lines=infile.readlines()
      for line in lines:
      line_stripped=line.strip()
      if line_stripped not in seen_lines:
      seen_lines.append(line_stripped)

      with open('input.txt','w') as outfile:
      for line in seen_lines:
      outfile.write(line)
      if line != seen_lines[-1]:
      outfile.write(os.linesep)


      Output:



      I really love christmas
      Keep the change ya filthy animal
      Pizza is my fav food
      Did someone say peanut butter?






      share|improve this answer










      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this answer



      share|improve this answer








      edited 2 days ago





















      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      answered 2 days ago









      bitto

      86010




      86010




      New contributor




      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      bitto is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.












      • This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
        – Flight Odyssey
        2 days ago










      • When I use this code, I see Keep the change ya filthy animal twice in the output?
        – Mark
        2 days ago










      • @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
        – bitto
        2 days ago










      • Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
        – Mark
        2 days ago












      • @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
        – bitto
        2 days ago


















      • This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
        – Flight Odyssey
        2 days ago










      • When I use this code, I see Keep the change ya filthy animal twice in the output?
        – Mark
        2 days ago










      • @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
        – bitto
        2 days ago










      • Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
        – Mark
        2 days ago












      • @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
        – bitto
        2 days ago
















      This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
      – Flight Odyssey
      2 days ago




      This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through seen_lines.
      – Flight Odyssey
      2 days ago












      When I use this code, I see Keep the change ya filthy animal twice in the output?
      – Mark
      2 days ago




      When I use this code, I see Keep the change ya filthy animal twice in the output?
      – Mark
      2 days ago












      @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
      – bitto
      2 days ago




      @Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
      – bitto
      2 days ago












      Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
      – Mark
      2 days ago






      Wait, I think its because the last line has the EOF at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF. Any way around this? I am on windows by the way
      – Mark
      2 days ago














      @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
      – bitto
      2 days ago




      @Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
      – bitto
      2 days ago











      0














      I believe this is the easiest way to do what you want:



      with open('FileName.txt', 'r+') as i:
      AllLines = i.readlines()
      for line in AllLines:
      #write to file





      share|improve this answer










      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.


















      • At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
        – Mad Physicist
        2 days ago
















      0














      I believe this is the easiest way to do what you want:



      with open('FileName.txt', 'r+') as i:
      AllLines = i.readlines()
      for line in AllLines:
      #write to file





      share|improve this answer










      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.


















      • At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
        – Mad Physicist
        2 days ago














      0












      0








      0






      I believe this is the easiest way to do what you want:



      with open('FileName.txt', 'r+') as i:
      AllLines = i.readlines()
      for line in AllLines:
      #write to file





      share|improve this answer










      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      I believe this is the easiest way to do what you want:



      with open('FileName.txt', 'r+') as i:
      AllLines = i.readlines()
      for line in AllLines:
      #write to file






      share|improve this answer










      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this answer



      share|improve this answer








      edited 2 days ago





















      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      answered 2 days ago









      Matt Hawkins

      66




      66




      New contributor




      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Matt Hawkins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.












      • At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
        – Mad Physicist
        2 days ago


















      • At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
        – Mad Physicist
        2 days ago
















      At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
      – Mad Physicist
      2 days ago




      At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
      – Mad Physicist
      2 days ago











      0














      Try the below code, using list comprehension with str.join and set and sorted:



      input_file = "input.txt"
      output_file = "input.txt"
      seen_lines =
      outfile = open(output_file, "w")
      infile = open(input_file, "r")
      l = [i.rstrip() for i in infile.readlines()]
      outfile.write('n'.join(sorted(set(l,key=l.index))))
      outfile.close()





      share|improve this answer


























        0














        Try the below code, using list comprehension with str.join and set and sorted:



        input_file = "input.txt"
        output_file = "input.txt"
        seen_lines =
        outfile = open(output_file, "w")
        infile = open(input_file, "r")
        l = [i.rstrip() for i in infile.readlines()]
        outfile.write('n'.join(sorted(set(l,key=l.index))))
        outfile.close()





        share|improve this answer
























          0












          0








          0






          Try the below code, using list comprehension with str.join and set and sorted:



          input_file = "input.txt"
          output_file = "input.txt"
          seen_lines =
          outfile = open(output_file, "w")
          infile = open(input_file, "r")
          l = [i.rstrip() for i in infile.readlines()]
          outfile.write('n'.join(sorted(set(l,key=l.index))))
          outfile.close()





          share|improve this answer












          Try the below code, using list comprehension with str.join and set and sorted:



          input_file = "input.txt"
          output_file = "input.txt"
          seen_lines =
          outfile = open(output_file, "w")
          infile = open(input_file, "r")
          l = [i.rstrip() for i in infile.readlines()]
          outfile.write('n'.join(sorted(set(l,key=l.index))))
          outfile.close()






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 2 days ago









          U9-Forward

          13k21137




          13k21137























              0














              Just my two cents, in case you happen to be able to use Python3. It uses:




              • A reusable Path object which has a handy write_text() method.

              • An OrderedDict as data structure to satisfy the constraints of uniqueness and order at once.

              • A generator expression instead of Path.read_text() to save on memory.




              # in-place removal of duplicate lines, while remaining order
              import os
              from collections import OrderedDict
              from pathlib import Path

              filepath = Path("./duplicates.txt")

              with filepath.open() as _file:
              no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)

              filepath.write_text("n".join(no_duplicates))





              share|improve this answer




























                0














                Just my two cents, in case you happen to be able to use Python3. It uses:




                • A reusable Path object which has a handy write_text() method.

                • An OrderedDict as data structure to satisfy the constraints of uniqueness and order at once.

                • A generator expression instead of Path.read_text() to save on memory.




                # in-place removal of duplicate lines, while remaining order
                import os
                from collections import OrderedDict
                from pathlib import Path

                filepath = Path("./duplicates.txt")

                with filepath.open() as _file:
                no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)

                filepath.write_text("n".join(no_duplicates))





                share|improve this answer


























                  0












                  0








                  0






                  Just my two cents, in case you happen to be able to use Python3. It uses:




                  • A reusable Path object which has a handy write_text() method.

                  • An OrderedDict as data structure to satisfy the constraints of uniqueness and order at once.

                  • A generator expression instead of Path.read_text() to save on memory.




                  # in-place removal of duplicate lines, while remaining order
                  import os
                  from collections import OrderedDict
                  from pathlib import Path

                  filepath = Path("./duplicates.txt")

                  with filepath.open() as _file:
                  no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)

                  filepath.write_text("n".join(no_duplicates))





                  share|improve this answer














                  Just my two cents, in case you happen to be able to use Python3. It uses:




                  • A reusable Path object which has a handy write_text() method.

                  • An OrderedDict as data structure to satisfy the constraints of uniqueness and order at once.

                  • A generator expression instead of Path.read_text() to save on memory.




                  # in-place removal of duplicate lines, while remaining order
                  import os
                  from collections import OrderedDict
                  from pathlib import Path

                  filepath = Path("./duplicates.txt")

                  with filepath.open() as _file:
                  no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)

                  filepath.write_text("n".join(no_duplicates))






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 2 days ago

























                  answered 2 days ago









                  timmwagener

                  7571814




                  7571814






















                      Mark is a new contributor. Be nice, and check out our Code of Conduct.










                      draft saved

                      draft discarded


















                      Mark is a new contributor. Be nice, and check out our Code of Conduct.













                      Mark is a new contributor. Be nice, and check out our Code of Conduct.












                      Mark is a new contributor. Be nice, and check out our Code of Conduct.
















                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53974070%2fhow-to-remove-duplicate-lines%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

                      Alcedinidae

                      RAC Tourist Trophy