Pandas DataFrame: copy the contents of a column if it is empty

I have the following DataFrame with named columns and index:

  'a'     'a*'    'b'    'b*'

1  5      NaN     9      NaN

2  NaN    3       3      NaN

3  4      NaN     1      NaN

4  NaN    9       NaN    7

The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.

I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.

Is there an efficient way to do such an operation?

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

add a comment |

I have the following DataFrame with named columns and index:

  'a'     'a*'    'b'    'b*'

1  5      NaN     9      NaN

2  NaN    3       3      NaN

3  4      NaN     1      NaN

4  NaN    9       NaN    7

I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.

Is there an efficient way to do such an operation?

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

add a comment |

I have the following DataFrame with named columns and index:

  'a'     'a*'    'b'    'b*'

1  5      NaN     9      NaN

2  NaN    3       3      NaN

3  4      NaN     1      NaN

4  NaN    9       NaN    7

I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.

Is there an efficient way to do such an operation?

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

I have the following DataFrame with named columns and index:

  'a'     'a*'    'b'    'b*'

1  5      NaN     9      NaN

2  NaN    3       3      NaN

3  4      NaN     1      NaN

4  NaN    9       NaN    7

I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.

Is there an efficient way to do such an operation?

python pandas dataframe

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

edited Nov 20 '18 at 9:30

Sociopath

3,46781635

asked Nov 20 '18 at 9:16

hamslice

133

asked Nov 20 '18 at 9:16

hamslice

133

asked Nov 20 '18 at 9:16

hamslice

133

add a comment |

3 Answers
3

active

oldest

votes

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])

df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*

0   5.0 NaN 9.0 NaN

1   3.0 3.0 3.0 NaN

2   4.0 NaN 1.0 NaN

3   9.0 9.0 7.0 7.0

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

add a comment |

Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    return df['A'].fillna(df['A*'])



def using_combine_first(df):

    return df['A'].combine_first(df['A*'])



def using_np_where(df):

    return np.where(df['A'].isnull(), df['A*'], df['A'])



def using_np_where_numpy(df):

    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_np_where_numpy(df)



1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

add a comment |

For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)

df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))

print (df)

     a    b

1  5.0  9.0

2  3.0  3.0

3  4.0  1.0

4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 

                   'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    df['A'] = df['A'].fillna(df['A*'])

    return df



def using_np_where(df):

    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])

    return df



def using_np_where_numpy(df):

    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

    return df



def using_combine_first(df):

    df['D'] = df['A'].combine_first(df['A*'])

    return df



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where_numpy(df)



1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389699%2fpandas-dataframe-copy-the-contents-of-a-column-if-it-is-empty%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])

df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*

0   5.0 NaN 9.0 NaN

1   3.0 3.0 3.0 NaN

2   4.0 NaN 1.0 NaN

3   9.0 9.0 7.0 7.0

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

add a comment |

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])

df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*

0   5.0 NaN 9.0 NaN

1   3.0 3.0 3.0 NaN

2   4.0 NaN 1.0 NaN

3   9.0 9.0 7.0 7.0

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

add a comment |

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])

df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*

0   5.0 NaN 9.0 NaN

1   3.0 3.0 3.0 NaN

2   4.0 NaN 1.0 NaN

3   9.0 9.0 7.0 7.0

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

Use np.where

df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])

df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])

Output:

     a  a*  b   b*

0   5.0 NaN 9.0 NaN

1   3.0 3.0 3.0 NaN

2   4.0 NaN 1.0 NaN

3   9.0 9.0 7.0 7.0

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

answered Nov 20 '18 at 9:21

Sociopath

3,46781635

add a comment |

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    return df['A'].fillna(df['A*'])



def using_combine_first(df):

    return df['A'].combine_first(df['A*'])



def using_np_where(df):

    return np.where(df['A'].isnull(), df['A*'], df['A'])



def using_np_where_numpy(df):

    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_np_where_numpy(df)



1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

add a comment |

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    return df['A'].fillna(df['A*'])



def using_combine_first(df):

    return df['A'].combine_first(df['A*'])



def using_np_where(df):

    return np.where(df['A'].isnull(), df['A*'], df['A'])



def using_np_where_numpy(df):

    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_np_where_numpy(df)



1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

add a comment |

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    return df['A'].fillna(df['A*'])



def using_combine_first(df):

    return df['A'].combine_first(df['A*'])



def using_np_where(df):

    return np.where(df['A'].isnull(), df['A*'], df['A'])



def using_np_where_numpy(df):

    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_np_where_numpy(df)



1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Translation: this is a method designed to do exactly what is asked in the question.

How do I use it?

df['a'].combine_first(df['a*'])

Performance:

df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    return df['A'].fillna(df['A*'])



def using_combine_first(df):

    return df['A'].combine_first(df['A*'])



def using_np_where(df):

    return np.where(df['A'].isnull(), df['A*'], df['A'])



def using_np_where_numpy(df):

    return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_np_where_numpy(df)



1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

edited Nov 20 '18 at 10:12

answered Nov 20 '18 at 9:59

user3471881

1,0842619

answered Nov 20 '18 at 9:59

user3471881

1,0842619

answered Nov 20 '18 at 9:59

user3471881

1,0842619

add a comment |

For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)

df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))

print (df)

     a    b

1  5.0  9.0

2  3.0  3.0

3  4.0  1.0

4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 

                   'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    df['A'] = df['A'].fillna(df['A*'])

    return df



def using_np_where(df):

    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])

    return df



def using_np_where_numpy(df):

    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

    return df



def using_combine_first(df):

    df['D'] = df['A'].combine_first(df['A*'])

    return df



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where_numpy(df)



1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

add a comment |

For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)

df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))

print (df)

     a    b

1  5.0  9.0

2  3.0  3.0

3  4.0  1.0

4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 

                   'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    df['A'] = df['A'].fillna(df['A*'])

    return df



def using_np_where(df):

    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])

    return df



def using_np_where_numpy(df):

    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

    return df



def using_combine_first(df):

    df['D'] = df['A'].combine_first(df['A*'])

    return df



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where_numpy(df)



1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

add a comment |

For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)

df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))

print (df)

     a    b

1  5.0  9.0

2  3.0  3.0

3  4.0  1.0

4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 

                   'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    df['A'] = df['A'].fillna(df['A*'])

    return df



def using_np_where(df):

    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])

    return df



def using_np_where_numpy(df):

    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

    return df



def using_combine_first(df):

    df['D'] = df['A'].combine_first(df['A*'])

    return df



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where_numpy(df)



1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:

df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)

df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)

Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:

First create MultiIndex by split with append *val:

df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)

And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:

df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))

print (df)

     a    b

1  5.0  9.0

2  3.0  3.0

3  4.0  1.0

4  9.0  7.0

Performance: (depends of number of missing values and length of DataFrame)

df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 

                   'A*': [4, 4, 5, 6, 7, 8] * 10000})



def using_fillna(df):

    df['A'] = df['A'].fillna(df['A*'])

    return df



def using_np_where(df):

    df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])

    return df



def using_np_where_numpy(df):

    df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

    return df



def using_combine_first(df):

    df['D'] = df['A'].combine_first(df['A*'])

    return df



%timeit -n 100 using_fillna(df)

%timeit -n 100 using_np_where(df)

%timeit -n 100 using_combine_first(df)

%timeit -n 100 using_np_where_numpy(df)



1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

edited Nov 20 '18 at 10:23

answered Nov 20 '18 at 9:21

jezrael

320k22259338

answered Nov 20 '18 at 9:21

jezrael

320k22259338

answered Nov 20 '18 at 9:21

jezrael

320k22259338

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

add a comment |

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
– hamslice
Nov 20 '18 at 9:53

@hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
– jezrael
Nov 20 '18 at 9:54

@hamslice - so not working for you?
– jezrael
Nov 20 '18 at 9:56

I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
– user3471881
Nov 20 '18 at 10:10

@user3471881 - removed :)
– jezrael
Nov 20 '18 at 10:11

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr