Pandas DataFrame: copy the contents of a column if it is empty
I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a*
and b*
columns to a
and b
, respectively.
Is there an efficient way to do such an operation?
python pandas dataframe
add a comment |
I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a*
and b*
columns to a
and b
, respectively.
Is there an efficient way to do such an operation?
python pandas dataframe
add a comment |
I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a*
and b*
columns to a
and b
, respectively.
Is there an efficient way to do such an operation?
python pandas dataframe
I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a*
and b*
columns to a
and b
, respectively.
Is there an efficient way to do such an operation?
python pandas dataframe
python pandas dataframe
edited Nov 20 '18 at 9:30
Sociopath
3,46781635
3,46781635
asked Nov 20 '18 at 9:16
hamslice
133
133
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
add a comment |
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python3.6
syntax, for lowwer versions needfillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't usefillna
".
– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389699%2fpandas-dataframe-copy-the-contents-of-a-column-if-it-is-empty%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
add a comment |
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
add a comment |
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
answered Nov 20 '18 at 9:21
Sociopath
3,46781635
3,46781635
add a comment |
add a comment |
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using fillna()
is a lot slower than np.where
but has the advantage of being pandas
only. If you want a faster method and keep it pandas
pure, you can use combine_first()
which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edited Nov 20 '18 at 10:12
answered Nov 20 '18 at 9:59
user3471881
1,0842619
1,0842619
add a comment |
add a comment |
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python3.6
syntax, for lowwer versions needfillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't usefillna
".
– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
add a comment |
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python3.6
syntax, for lowwer versions needfillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't usefillna
".
– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
add a comment |
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For better performance is possible use numpy.isnan
and convert Series to numpy arrays by values
:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without *
in columns of DataFrame and is necessary remove *
columns:
First create MultiIndex
by split
with append *val
:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs
for DataFrames, so DataFrame.fillna
working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edited Nov 20 '18 at 10:23
answered Nov 20 '18 at 9:21
jezrael
320k22259338
320k22259338
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python3.6
syntax, for lowwer versions needfillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't usefillna
".
– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
add a comment |
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python3.6
syntax, for lowwer versions needfillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't usefillna
".
– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53
@hamslice - it is python
3.6
syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - it is python
3.6
syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use
fillna
".– user3471881
Nov 20 '18 at 10:10
I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use
fillna
".– user3471881
Nov 20 '18 at 10:10
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389699%2fpandas-dataframe-copy-the-contents-of-a-column-if-it-is-empty%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown