Python: replace part of file path using pandas match












2














Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.



The script iterates over a list of files.



For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.



I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.



Any help is appreciated. Sample below uses df.iterrows()



import pandas as pd
import os

df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
# 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
# 2 C:Temp C:NewTemp

excel_link_analysis_list = [
{'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
'file_read': 'OK'}
]

for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)


prints:




c:newtemp12345distribution adjusted claim.xlsx



c:newtempsubfoldercost estimates.xlsx











share|improve this question





























    2














    Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.



    The script iterates over a list of files.



    For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.



    I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.



    Any help is appreciated. Sample below uses df.iterrows()



    import pandas as pd
    import os

    df = pd.read_csv('path_lookup.csv')
    # df:
    # old_path new_path
    # 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
    # 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
    # 2 C:Temp C:NewTemp

    excel_link_analysis_list = [
    {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
    'file_read': 'OK'},
    {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
    'file_read': 'OK'}
    ]

    for i in excel_link_analysis_list:
    for index, row in df.iterrows():
    if row['old_path'].lower() in i['excel_filename'].lower():
    dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
    row['new_path'].lower())
    print(dest_path_and_file)


    prints:




    c:newtemp12345distribution adjusted claim.xlsx



    c:newtempsubfoldercost estimates.xlsx











    share|improve this question



























      2












      2








      2







      Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.



      The script iterates over a list of files.



      For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.



      I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.



      Any help is appreciated. Sample below uses df.iterrows()



      import pandas as pd
      import os

      df = pd.read_csv('path_lookup.csv')
      # df:
      # old_path new_path
      # 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
      # 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
      # 2 C:Temp C:NewTemp

      excel_link_analysis_list = [
      {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
      'file_read': 'OK'},
      {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
      'file_read': 'OK'}
      ]

      for i in excel_link_analysis_list:
      for index, row in df.iterrows():
      if row['old_path'].lower() in i['excel_filename'].lower():
      dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
      row['new_path'].lower())
      print(dest_path_and_file)


      prints:




      c:newtemp12345distribution adjusted claim.xlsx



      c:newtempsubfoldercost estimates.xlsx











      share|improve this question















      Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.



      The script iterates over a list of files.



      For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.



      I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.



      Any help is appreciated. Sample below uses df.iterrows()



      import pandas as pd
      import os

      df = pd.read_csv('path_lookup.csv')
      # df:
      # old_path new_path
      # 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
      # 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
      # 2 C:Temp C:NewTemp

      excel_link_analysis_list = [
      {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
      'file_read': 'OK'},
      {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
      'file_read': 'OK'}
      ]

      for i in excel_link_analysis_list:
      for index, row in df.iterrows():
      if row['old_path'].lower() in i['excel_filename'].lower():
      dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
      row['new_path'].lower())
      print(dest_path_and_file)


      prints:




      c:newtemp12345distribution adjusted claim.xlsx



      c:newtempsubfoldercost estimates.xlsx








      python pandas loops for-loop filepath






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 1:48









      oetoni

      749621




      749621










      asked Nov 20 at 0:23









      Ruan

      284




      284
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains



          This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path



          Edit: updated to handle the case where new_path_matches has one value.



          import pandas as pd

          old_path = df['old_path']
          new_path = df['new_path']

          for filename in filenames:
          b = old_path.str.contains(filename)

          # Get the index of matches from `old_path` column
          indeces_of_matches = b[b].index.values

          # use the index of matches to get the corresponding `new_path' values
          new_path_matches = old_path.loc[indeces_of_matches]

          if (new_path_matches.value.shape[0]>0):
          print new_path_matches.values[0] # print the new_path value





          share|improve this answer























          • Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
            – Ruan
            Nov 20 at 1:06










          • Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
            – Ruan
            Nov 20 at 1:16












          • It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
            – Ruan
            Nov 20 at 2:11










          • That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
            – killian95
            Nov 20 at 2:49










          • Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
            – Ruan
            Nov 20 at 3:13













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384525%2fpython-replace-part-of-file-path-using-pandas-match%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains



          This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path



          Edit: updated to handle the case where new_path_matches has one value.



          import pandas as pd

          old_path = df['old_path']
          new_path = df['new_path']

          for filename in filenames:
          b = old_path.str.contains(filename)

          # Get the index of matches from `old_path` column
          indeces_of_matches = b[b].index.values

          # use the index of matches to get the corresponding `new_path' values
          new_path_matches = old_path.loc[indeces_of_matches]

          if (new_path_matches.value.shape[0]>0):
          print new_path_matches.values[0] # print the new_path value





          share|improve this answer























          • Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
            – Ruan
            Nov 20 at 1:06










          • Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
            – Ruan
            Nov 20 at 1:16












          • It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
            – Ruan
            Nov 20 at 2:11










          • That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
            – killian95
            Nov 20 at 2:49










          • Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
            – Ruan
            Nov 20 at 3:13


















          0














          Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains



          This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path



          Edit: updated to handle the case where new_path_matches has one value.



          import pandas as pd

          old_path = df['old_path']
          new_path = df['new_path']

          for filename in filenames:
          b = old_path.str.contains(filename)

          # Get the index of matches from `old_path` column
          indeces_of_matches = b[b].index.values

          # use the index of matches to get the corresponding `new_path' values
          new_path_matches = old_path.loc[indeces_of_matches]

          if (new_path_matches.value.shape[0]>0):
          print new_path_matches.values[0] # print the new_path value





          share|improve this answer























          • Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
            – Ruan
            Nov 20 at 1:06










          • Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
            – Ruan
            Nov 20 at 1:16












          • It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
            – Ruan
            Nov 20 at 2:11










          • That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
            – killian95
            Nov 20 at 2:49










          • Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
            – Ruan
            Nov 20 at 3:13
















          0












          0








          0






          Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains



          This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path



          Edit: updated to handle the case where new_path_matches has one value.



          import pandas as pd

          old_path = df['old_path']
          new_path = df['new_path']

          for filename in filenames:
          b = old_path.str.contains(filename)

          # Get the index of matches from `old_path` column
          indeces_of_matches = b[b].index.values

          # use the index of matches to get the corresponding `new_path' values
          new_path_matches = old_path.loc[indeces_of_matches]

          if (new_path_matches.value.shape[0]>0):
          print new_path_matches.values[0] # print the new_path value





          share|improve this answer














          Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains



          This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path



          Edit: updated to handle the case where new_path_matches has one value.



          import pandas as pd

          old_path = df['old_path']
          new_path = df['new_path']

          for filename in filenames:
          b = old_path.str.contains(filename)

          # Get the index of matches from `old_path` column
          indeces_of_matches = b[b].index.values

          # use the index of matches to get the corresponding `new_path' values
          new_path_matches = old_path.loc[indeces_of_matches]

          if (new_path_matches.value.shape[0]>0):
          print new_path_matches.values[0] # print the new_path value






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 20 at 4:53

























          answered Nov 20 at 0:34









          killian95

          658310




          658310












          • Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
            – Ruan
            Nov 20 at 1:06










          • Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
            – Ruan
            Nov 20 at 1:16












          • It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
            – Ruan
            Nov 20 at 2:11










          • That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
            – killian95
            Nov 20 at 2:49










          • Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
            – Ruan
            Nov 20 at 3:13




















          • Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
            – Ruan
            Nov 20 at 1:06










          • Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
            – Ruan
            Nov 20 at 1:16












          • It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
            – Ruan
            Nov 20 at 2:11










          • That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
            – killian95
            Nov 20 at 2:49










          • Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
            – Ruan
            Nov 20 at 3:13


















          Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
          – Ruan
          Nov 20 at 1:06




          Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
          – Ruan
          Nov 20 at 1:06












          Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
          – Ruan
          Nov 20 at 1:16






          Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
          – Ruan
          Nov 20 at 1:16














          It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
          – Ruan
          Nov 20 at 2:11




          It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
          – Ruan
          Nov 20 at 2:11












          That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
          – killian95
          Nov 20 at 2:49




          That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
          – killian95
          Nov 20 at 2:49












          Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
          – Ruan
          Nov 20 at 3:13






          Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
          – Ruan
          Nov 20 at 3:13




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384525%2fpython-replace-part-of-file-path-using-pandas-match%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

          Alcedinidae

          Origin of the phrase “under your belt”?