Python: replace part of file path using pandas match

Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.

The script iterates over a list of files.

For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.

I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.

Any help is appreciated. Sample below uses df.iterrows()

import pandas as pd

import os



df = pd.read_csv('path_lookup.csv')

# df:

#                                         old_path                      new_path

# 0               F:BusinessBudget & Forecasting  M:BusinessFinanceForecast

# 1                    F:BusinessTreasury Shared  M:BusinessFinanceTreasury

# 2                                        C:Temp                    C:NewTemp



excel_link_analysis_list = [

    {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',

     'file_read': 'OK'},

    {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',

     'file_read': 'OK'}

]



for i in excel_link_analysis_list:

    for index, row in df.iterrows():

        if row['old_path'].lower() in i['excel_filename'].lower():

            dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(), 

                                                                     row['new_path'].lower())

            print(dest_path_and_file)

prints:

c:newtemp12345distribution adjusted claim.xlsx

c:newtempsubfoldercost estimates.xlsx

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

add a comment |

Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.

The script iterates over a list of files.

I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.

Any help is appreciated. Sample below uses df.iterrows()

import pandas as pd

import os



df = pd.read_csv('path_lookup.csv')

# df:

#                                         old_path                      new_path

# 0               F:BusinessBudget & Forecasting  M:BusinessFinanceForecast

# 1                    F:BusinessTreasury Shared  M:BusinessFinanceTreasury

# 2                                        C:Temp                    C:NewTemp



excel_link_analysis_list = [

    {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',

     'file_read': 'OK'},

    {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',

     'file_read': 'OK'}

]



for i in excel_link_analysis_list:

    for index, row in df.iterrows():

        if row['old_path'].lower() in i['excel_filename'].lower():

            dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(), 

                                                                     row['new_path'].lower())

            print(dest_path_and_file)

prints:

c:newtemp12345distribution adjusted claim.xlsx

c:newtempsubfoldercost estimates.xlsx

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

add a comment |

Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.

The script iterates over a list of files.

I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.

Any help is appreciated. Sample below uses df.iterrows()

import pandas as pd

import os



df = pd.read_csv('path_lookup.csv')

# df:

#                                         old_path                      new_path

# 0               F:BusinessBudget & Forecasting  M:BusinessFinanceForecast

# 1                    F:BusinessTreasury Shared  M:BusinessFinanceTreasury

# 2                                        C:Temp                    C:NewTemp



excel_link_analysis_list = [

    {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',

     'file_read': 'OK'},

    {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',

     'file_read': 'OK'}

]



for i in excel_link_analysis_list:

    for index, row in df.iterrows():

        if row['old_path'].lower() in i['excel_filename'].lower():

            dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(), 

                                                                     row['new_path'].lower())

            print(dest_path_and_file)

prints:

c:newtemp12345distribution adjusted claim.xlsx

c:newtempsubfoldercost estimates.xlsx

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.

The script iterates over a list of files.

I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.

Any help is appreciated. Sample below uses df.iterrows()

import pandas as pd

import os



df = pd.read_csv('path_lookup.csv')

# df:

#                                         old_path                      new_path

# 0               F:BusinessBudget & Forecasting  M:BusinessFinanceForecast

# 1                    F:BusinessTreasury Shared  M:BusinessFinanceTreasury

# 2                                        C:Temp                    C:NewTemp



excel_link_analysis_list = [

    {'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',

     'file_read': 'OK'},

    {'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',

     'file_read': 'OK'}

]



for i in excel_link_analysis_list:

    for index, row in df.iterrows():

        if row['old_path'].lower() in i['excel_filename'].lower():

            dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(), 

                                                                     row['new_path'].lower())

            print(dest_path_and_file)

prints:

c:newtemp12345distribution adjusted claim.xlsx

c:newtempsubfoldercost estimates.xlsx

python pandas loops for-loop filepath

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

edited Nov 20 at 1:48

oetoni

749621

edited Nov 20 at 1:48

oetoni

749621

edited Nov 20 at 1:48

oetoni

749621

asked Nov 20 at 0:23

Ruan

284

asked Nov 20 at 0:23

Ruan

284

asked Nov 20 at 0:23

Ruan

284

add a comment |

1 Answer
1

active

oldest

votes

Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path

Edit: updated to handle the case where new_path_matches has one value.

import pandas as pd



old_path = df['old_path']

new_path = df['new_path']



for filename in filenames:

    b = old_path.str.contains(filename)



    # Get the index of matches from `old_path` column

    indeces_of_matches = b[b].index.values    



    # use the index of matches to get the corresponding `new_path' values 

    new_path_matches = old_path.loc[indeces_of_matches]



    if (new_path_matches.value.shape[0]>0):

        print new_path_matches.values[0]   # print the new_path value

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384525%2fpython-replace-part-of-file-path-using-pandas-match%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path

Edit: updated to handle the case where new_path_matches has one value.

import pandas as pd



old_path = df['old_path']

new_path = df['new_path']



for filename in filenames:

    b = old_path.str.contains(filename)



    # Get the index of matches from `old_path` column

    indeces_of_matches = b[b].index.values    



    # use the index of matches to get the corresponding `new_path' values 

    new_path_matches = old_path.loc[indeces_of_matches]



    if (new_path_matches.value.shape[0]>0):

        print new_path_matches.values[0]   # print the new_path value

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

|
show 3 more comments

Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path

Edit: updated to handle the case where new_path_matches has one value.

import pandas as pd



old_path = df['old_path']

new_path = df['new_path']



for filename in filenames:

    b = old_path.str.contains(filename)



    # Get the index of matches from `old_path` column

    indeces_of_matches = b[b].index.values    



    # use the index of matches to get the corresponding `new_path' values 

    new_path_matches = old_path.loc[indeces_of_matches]



    if (new_path_matches.value.shape[0]>0):

        print new_path_matches.values[0]   # print the new_path value

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

|
show 3 more comments

Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path

Edit: updated to handle the case where new_path_matches has one value.

import pandas as pd



old_path = df['old_path']

new_path = df['new_path']



for filename in filenames:

    b = old_path.str.contains(filename)



    # Get the index of matches from `old_path` column

    indeces_of_matches = b[b].index.values    



    # use the index of matches to get the corresponding `new_path' values 

    new_path_matches = old_path.loc[indeces_of_matches]



    if (new_path_matches.value.shape[0]>0):

        print new_path_matches.values[0]   # print the new_path value

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains

This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path

Edit: updated to handle the case where new_path_matches has one value.

import pandas as pd



old_path = df['old_path']

new_path = df['new_path']



for filename in filenames:

    b = old_path.str.contains(filename)



    # Get the index of matches from `old_path` column

    indeces_of_matches = b[b].index.values    



    # use the index of matches to get the corresponding `new_path' values 

    new_path_matches = old_path.loc[indeces_of_matches]



    if (new_path_matches.value.shape[0]>0):

        print new_path_matches.values[0]   # print the new_path value

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

edited Nov 20 at 4:53

answered Nov 20 at 0:34

killian95

658310

answered Nov 20 at 0:34

killian95

658310

answered Nov 20 at 0:34

killian95

658310

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

|
show 3 more comments

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

Thanks for the fast reply. When running the above, slighty modified, I get an error: raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2 on line: b = old_path.str.contains(i['excel_filename']). Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06

Fixed it by changing to b = old_path.str.contains(i['excel_filename'], regex=False, case=False). This still does not match though. All iterations return False if you do a print(b) immediately afterwards.
– Ruan
Nov 20 at 1:16

It now successfully matches after adding the following: file_path = os.path.dirname(os.path.abspath(i['excel_filename'])) and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any(). Now getting a new error indices_of_matches = b[b].index.values gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11

That's because the call to .any() returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49

Removing the call to .any() works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool, which is great. Last line: print(new_path_matches[0]) gives KeyError: 0 though.
– Ruan
Nov 20 at 3:13

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr