Python: replace part of file path using pandas match
Data frame with 2 columns: old_path
and new_path
. Data frame can contain hundreds of rows.
The script iterates over a list of files.
For each file in the list, check if any part of its folder path matches a value in the old_path
column. If there is a match, replace the file's matched old_path
with the corresponding new_path
value.
I achieved this with for index, row in df.iterrows():
or for row in df.itertuples():
, but I'm thinking there should be a more efficient way to do it without having to use the second for
loop.
Any help is appreciated. Sample below uses df.iterrows()
import pandas as pd
import os
df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
# 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
# 2 C:Temp C:NewTemp
excel_link_analysis_list = [
{'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
'file_read': 'OK'}
]
for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)
prints:
c:newtemp12345distribution adjusted claim.xlsx
c:newtempsubfoldercost estimates.xlsx
python pandas loops for-loop filepath
add a comment |
Data frame with 2 columns: old_path
and new_path
. Data frame can contain hundreds of rows.
The script iterates over a list of files.
For each file in the list, check if any part of its folder path matches a value in the old_path
column. If there is a match, replace the file's matched old_path
with the corresponding new_path
value.
I achieved this with for index, row in df.iterrows():
or for row in df.itertuples():
, but I'm thinking there should be a more efficient way to do it without having to use the second for
loop.
Any help is appreciated. Sample below uses df.iterrows()
import pandas as pd
import os
df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
# 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
# 2 C:Temp C:NewTemp
excel_link_analysis_list = [
{'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
'file_read': 'OK'}
]
for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)
prints:
c:newtemp12345distribution adjusted claim.xlsx
c:newtempsubfoldercost estimates.xlsx
python pandas loops for-loop filepath
add a comment |
Data frame with 2 columns: old_path
and new_path
. Data frame can contain hundreds of rows.
The script iterates over a list of files.
For each file in the list, check if any part of its folder path matches a value in the old_path
column. If there is a match, replace the file's matched old_path
with the corresponding new_path
value.
I achieved this with for index, row in df.iterrows():
or for row in df.itertuples():
, but I'm thinking there should be a more efficient way to do it without having to use the second for
loop.
Any help is appreciated. Sample below uses df.iterrows()
import pandas as pd
import os
df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
# 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
# 2 C:Temp C:NewTemp
excel_link_analysis_list = [
{'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
'file_read': 'OK'}
]
for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)
prints:
c:newtemp12345distribution adjusted claim.xlsx
c:newtempsubfoldercost estimates.xlsx
python pandas loops for-loop filepath
Data frame with 2 columns: old_path
and new_path
. Data frame can contain hundreds of rows.
The script iterates over a list of files.
For each file in the list, check if any part of its folder path matches a value in the old_path
column. If there is a match, replace the file's matched old_path
with the corresponding new_path
value.
I achieved this with for index, row in df.iterrows():
or for row in df.itertuples():
, but I'm thinking there should be a more efficient way to do it without having to use the second for
loop.
Any help is appreciated. Sample below uses df.iterrows()
import pandas as pd
import os
df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:BusinessBudget & Forecasting M:BusinessFinanceForecast
# 1 F:BusinessTreasury Shared M:BusinessFinanceTreasury
# 2 C:Temp C:NewTemp
excel_link_analysis_list = [
{'excel_filename': 'C:\Temp\12345\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\Temp\SubFolder\cost estimates.xlsx',
'file_read': 'OK'}
]
for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)
prints:
c:newtemp12345distribution adjusted claim.xlsx
c:newtempsubfoldercost estimates.xlsx
python pandas loops for-loop filepath
python pandas loops for-loop filepath
edited Nov 20 at 1:48
oetoni
749621
749621
asked Nov 20 at 0:23
Ruan
284
284
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains
to get the index of the matching value (i.e. from the column old_path
). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value
Thanks for the fast reply. When running the above, slighty modified, I get an error:raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line:b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06
Fixed it by changing tob = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations returnFalse
if you do aprint(b)
immediately afterwards.
– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to readb = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new errorindices_of_matches = b[b].index.values
gives error:AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to.any()
returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49
Removing the call to.any()
works. Printing b gives0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line:print(new_path_matches[0])
givesKeyError: 0
though.
– Ruan
Nov 20 at 3:13
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384525%2fpython-replace-part-of-file-path-using-pandas-match%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains
to get the index of the matching value (i.e. from the column old_path
). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value
Thanks for the fast reply. When running the above, slighty modified, I get an error:raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line:b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06
Fixed it by changing tob = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations returnFalse
if you do aprint(b)
immediately afterwards.
– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to readb = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new errorindices_of_matches = b[b].index.values
gives error:AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to.any()
returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49
Removing the call to.any()
works. Printing b gives0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line:print(new_path_matches[0])
givesKeyError: 0
though.
– Ruan
Nov 20 at 3:13
|
show 3 more comments
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains
to get the index of the matching value (i.e. from the column old_path
). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value
Thanks for the fast reply. When running the above, slighty modified, I get an error:raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line:b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06
Fixed it by changing tob = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations returnFalse
if you do aprint(b)
immediately afterwards.
– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to readb = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new errorindices_of_matches = b[b].index.values
gives error:AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to.any()
returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49
Removing the call to.any()
works. Printing b gives0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line:print(new_path_matches[0])
givesKeyError: 0
though.
– Ruan
Nov 20 at 3:13
|
show 3 more comments
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains
to get the index of the matching value (i.e. from the column old_path
). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains
to get the index of the matching value (i.e. from the column old_path
). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value
edited Nov 20 at 4:53
answered Nov 20 at 0:34
killian95
658310
658310
Thanks for the fast reply. When running the above, slighty modified, I get an error:raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line:b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06
Fixed it by changing tob = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations returnFalse
if you do aprint(b)
immediately afterwards.
– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to readb = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new errorindices_of_matches = b[b].index.values
gives error:AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to.any()
returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49
Removing the call to.any()
works. Printing b gives0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line:print(new_path_matches[0])
givesKeyError: 0
though.
– Ruan
Nov 20 at 3:13
|
show 3 more comments
Thanks for the fast reply. When running the above, slighty modified, I get an error:raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line:b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.
– Ruan
Nov 20 at 1:06
Fixed it by changing tob = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations returnFalse
if you do aprint(b)
immediately afterwards.
– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to readb = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new errorindices_of_matches = b[b].index.values
gives error:AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to.any()
returns a numpy array rather than a series. What happens if you print b?
– killian95
Nov 20 at 2:49
Removing the call to.any()
works. Printing b gives0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line:print(new_path_matches[0])
givesKeyError: 0
though.
– Ruan
Nov 20 at 3:13
Thanks for the fast reply. When running the above, slighty modified, I get an error:
raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line: b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.– Ruan
Nov 20 at 1:06
Thanks for the fast reply. When running the above, slighty modified, I get an error:
raise source.error("bad escape %s" % escape, len(escape)) sre_constants.error: bad escape T at position 2
on line: b = old_path.str.contains(i['excel_filename'])
. Thinking it has to do with the backslashes in the filepaths.– Ruan
Nov 20 at 1:06
Fixed it by changing to
b = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations return False
if you do a print(b)
immediately afterwards.– Ruan
Nov 20 at 1:16
Fixed it by changing to
b = old_path.str.contains(i['excel_filename'], regex=False, case=False)
. This still does not match though. All iterations return False
if you do a print(b)
immediately afterwards.– Ruan
Nov 20 at 1:16
It now successfully matches after adding the following:
file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new error indices_of_matches = b[b].index.values
gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
It now successfully matches after adding the following:
file_path = os.path.dirname(os.path.abspath(i['excel_filename']))
and then changed the next line to read b = old_path.str.contains(file_path, case=False, regex=False).any()
. Now getting a new error indices_of_matches = b[b].index.values
gives error: AttributeError: 'numpy.ndarray' object has no attribute 'index'
– Ruan
Nov 20 at 2:11
That's because the call to
.any()
returns a numpy array rather than a series. What happens if you print b?– killian95
Nov 20 at 2:49
That's because the call to
.any()
returns a numpy array rather than a series. What happens if you print b?– killian95
Nov 20 at 2:49
Removing the call to
.any()
works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line: print(new_path_matches[0])
gives KeyError: 0
though.– Ruan
Nov 20 at 3:13
Removing the call to
.any()
works. Printing b gives 0 False 1 False 2 True Name: old_path, dtype: bool
, which is great. Last line: print(new_path_matches[0])
gives KeyError: 0
though.– Ruan
Nov 20 at 3:13
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384525%2fpython-replace-part-of-file-path-using-pandas-match%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown