Pandas DataFrame: copy the contents of a column if it is empty












2














I have the following DataFrame with named columns and index:



  'a'     'a*'    'b'    'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7


The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.



I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.



Is there an efficient way to do such an operation?










share|improve this question





























    2














    I have the following DataFrame with named columns and index:



      'a'     'a*'    'b'    'b*'
    1 5 NaN 9 NaN
    2 NaN 3 3 NaN
    3 4 NaN 1 NaN
    4 NaN 9 NaN 7


    The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.



    I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.



    Is there an efficient way to do such an operation?










    share|improve this question



























      2












      2








      2







      I have the following DataFrame with named columns and index:



        'a'     'a*'    'b'    'b*'
      1 5 NaN 9 NaN
      2 NaN 3 3 NaN
      3 4 NaN 1 NaN
      4 NaN 9 NaN 7


      The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.



      I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.



      Is there an efficient way to do such an operation?










      share|improve this question















      I have the following DataFrame with named columns and index:



        'a'     'a*'    'b'    'b*'
      1 5 NaN 9 NaN
      2 NaN 3 3 NaN
      3 4 NaN 1 NaN
      4 NaN 9 NaN 7


      The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.



      I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.



      Is there an efficient way to do such an operation?







      python pandas dataframe






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 9:30









      Sociopath

      3,46781635




      3,46781635










      asked Nov 20 '18 at 9:16









      hamslice

      133




      133
























          3 Answers
          3






          active

          oldest

          votes


















          4














          Use np.where



          df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
          df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])


          Output:



               a  a*  b   b*
          0 5.0 NaN 9.0 NaN
          1 3.0 3.0 3.0 NaN
          2 4.0 NaN 1.0 NaN
          3 9.0 9.0 7.0 7.0





          share|improve this answer





























            3














            Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:




            Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes




            Translation: this is a method designed to do exactly what is asked in the question.



            How do I use it?



            df['a'].combine_first(df['a*'])


            Performance:



            df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

            def using_fillna(df):
            return df['A'].fillna(df['A*'])

            def using_combine_first(df):
            return df['A'].combine_first(df['A*'])

            def using_np_where(df):
            return np.where(df['A'].isnull(), df['A*'], df['A'])

            def using_np_where_numpy(df):
            return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

            %timeit -n 100 using_fillna(df)
            %timeit -n 100 using_combine_first(df)
            %timeit -n 100 using_np_where(df)
            %timeit -n 100 using_np_where_numpy(df)

            1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
            166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





            share|improve this answer































              2














              For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:



              df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
              df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)


              Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:



              First create MultiIndex by split with append *val:



              df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)


              And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:



              df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
              print (df)
              a b
              1 5.0 9.0
              2 3.0 3.0
              3 4.0 1.0
              4 9.0 7.0


              Performance: (depends of number of missing values and length of DataFrame)



              df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
              'A*': [4, 4, 5, 6, 7, 8] * 10000})

              def using_fillna(df):
              df['A'] = df['A'].fillna(df['A*'])
              return df

              def using_np_where(df):
              df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
              return df

              def using_np_where_numpy(df):
              df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
              return df

              def using_combine_first(df):
              df['D'] = df['A'].combine_first(df['A*'])
              return df

              %timeit -n 100 using_fillna(df)
              %timeit -n 100 using_np_where(df)
              %timeit -n 100 using_combine_first(df)
              %timeit -n 100 using_np_where_numpy(df)

              1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





              share|improve this answer























              • oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                – hamslice
                Nov 20 '18 at 9:53










              • @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                – jezrael
                Nov 20 '18 at 9:54












              • @hamslice - so not working for you?
                – jezrael
                Nov 20 '18 at 9:56










              • I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                – user3471881
                Nov 20 '18 at 10:10












              • @user3471881 - removed :)
                – jezrael
                Nov 20 '18 at 10:11











              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389699%2fpandas-dataframe-copy-the-contents-of-a-column-if-it-is-empty%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              4














              Use np.where



              df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
              df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])


              Output:



                   a  a*  b   b*
              0 5.0 NaN 9.0 NaN
              1 3.0 3.0 3.0 NaN
              2 4.0 NaN 1.0 NaN
              3 9.0 9.0 7.0 7.0





              share|improve this answer


























                4














                Use np.where



                df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
                df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])


                Output:



                     a  a*  b   b*
                0 5.0 NaN 9.0 NaN
                1 3.0 3.0 3.0 NaN
                2 4.0 NaN 1.0 NaN
                3 9.0 9.0 7.0 7.0





                share|improve this answer
























                  4












                  4








                  4






                  Use np.where



                  df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
                  df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])


                  Output:



                       a  a*  b   b*
                  0 5.0 NaN 9.0 NaN
                  1 3.0 3.0 3.0 NaN
                  2 4.0 NaN 1.0 NaN
                  3 9.0 9.0 7.0 7.0





                  share|improve this answer












                  Use np.where



                  df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
                  df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])


                  Output:



                       a  a*  b   b*
                  0 5.0 NaN 9.0 NaN
                  1 3.0 3.0 3.0 NaN
                  2 4.0 NaN 1.0 NaN
                  3 9.0 9.0 7.0 7.0






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 20 '18 at 9:21









                  Sociopath

                  3,46781635




                  3,46781635

























                      3














                      Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:




                      Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes




                      Translation: this is a method designed to do exactly what is asked in the question.



                      How do I use it?



                      df['a'].combine_first(df['a*'])


                      Performance:



                      df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

                      def using_fillna(df):
                      return df['A'].fillna(df['A*'])

                      def using_combine_first(df):
                      return df['A'].combine_first(df['A*'])

                      def using_np_where(df):
                      return np.where(df['A'].isnull(), df['A*'], df['A'])

                      def using_np_where_numpy(df):
                      return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

                      %timeit -n 100 using_fillna(df)
                      %timeit -n 100 using_combine_first(df)
                      %timeit -n 100 using_np_where(df)
                      %timeit -n 100 using_np_where_numpy(df)

                      1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                      281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                      257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                      166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                      share|improve this answer




























                        3














                        Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:




                        Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes




                        Translation: this is a method designed to do exactly what is asked in the question.



                        How do I use it?



                        df['a'].combine_first(df['a*'])


                        Performance:



                        df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

                        def using_fillna(df):
                        return df['A'].fillna(df['A*'])

                        def using_combine_first(df):
                        return df['A'].combine_first(df['A*'])

                        def using_np_where(df):
                        return np.where(df['A'].isnull(), df['A*'], df['A'])

                        def using_np_where_numpy(df):
                        return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

                        %timeit -n 100 using_fillna(df)
                        %timeit -n 100 using_combine_first(df)
                        %timeit -n 100 using_np_where(df)
                        %timeit -n 100 using_np_where_numpy(df)

                        1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                        281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                        257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                        166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                        share|improve this answer


























                          3












                          3








                          3






                          Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:




                          Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes




                          Translation: this is a method designed to do exactly what is asked in the question.



                          How do I use it?



                          df['a'].combine_first(df['a*'])


                          Performance:



                          df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

                          def using_fillna(df):
                          return df['A'].fillna(df['A*'])

                          def using_combine_first(df):
                          return df['A'].combine_first(df['A*'])

                          def using_np_where(df):
                          return np.where(df['A'].isnull(), df['A*'], df['A'])

                          def using_np_where_numpy(df):
                          return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

                          %timeit -n 100 using_fillna(df)
                          %timeit -n 100 using_combine_first(df)
                          %timeit -n 100 using_np_where(df)
                          %timeit -n 100 using_np_where_numpy(df)

                          1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                          share|improve this answer














                          Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:




                          Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes




                          Translation: this is a method designed to do exactly what is asked in the question.



                          How do I use it?



                          df['a'].combine_first(df['a*'])


                          Performance:



                          df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})

                          def using_fillna(df):
                          return df['A'].fillna(df['A*'])

                          def using_combine_first(df):
                          return df['A'].combine_first(df['A*'])

                          def using_np_where(df):
                          return np.where(df['A'].isnull(), df['A*'], df['A'])

                          def using_np_where_numpy(df):
                          return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)

                          %timeit -n 100 using_fillna(df)
                          %timeit -n 100 using_combine_first(df)
                          %timeit -n 100 using_np_where(df)
                          %timeit -n 100 using_np_where_numpy(df)

                          1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                          166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 20 '18 at 10:12

























                          answered Nov 20 '18 at 9:59









                          user3471881

                          1,0842619




                          1,0842619























                              2














                              For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:



                              df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
                              df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)


                              Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:



                              First create MultiIndex by split with append *val:



                              df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)


                              And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:



                              df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
                              print (df)
                              a b
                              1 5.0 9.0
                              2 3.0 3.0
                              3 4.0 1.0
                              4 9.0 7.0


                              Performance: (depends of number of missing values and length of DataFrame)



                              df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
                              'A*': [4, 4, 5, 6, 7, 8] * 10000})

                              def using_fillna(df):
                              df['A'] = df['A'].fillna(df['A*'])
                              return df

                              def using_np_where(df):
                              df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
                              return df

                              def using_np_where_numpy(df):
                              df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
                              return df

                              def using_combine_first(df):
                              df['D'] = df['A'].combine_first(df['A*'])
                              return df

                              %timeit -n 100 using_fillna(df)
                              %timeit -n 100 using_np_where(df)
                              %timeit -n 100 using_combine_first(df)
                              %timeit -n 100 using_np_where_numpy(df)

                              1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                              share|improve this answer























                              • oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                                – hamslice
                                Nov 20 '18 at 9:53










                              • @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                                – jezrael
                                Nov 20 '18 at 9:54












                              • @hamslice - so not working for you?
                                – jezrael
                                Nov 20 '18 at 9:56










                              • I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                                – user3471881
                                Nov 20 '18 at 10:10












                              • @user3471881 - removed :)
                                – jezrael
                                Nov 20 '18 at 10:11
















                              2














                              For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:



                              df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
                              df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)


                              Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:



                              First create MultiIndex by split with append *val:



                              df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)


                              And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:



                              df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
                              print (df)
                              a b
                              1 5.0 9.0
                              2 3.0 3.0
                              3 4.0 1.0
                              4 9.0 7.0


                              Performance: (depends of number of missing values and length of DataFrame)



                              df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
                              'A*': [4, 4, 5, 6, 7, 8] * 10000})

                              def using_fillna(df):
                              df['A'] = df['A'].fillna(df['A*'])
                              return df

                              def using_np_where(df):
                              df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
                              return df

                              def using_np_where_numpy(df):
                              df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
                              return df

                              def using_combine_first(df):
                              df['D'] = df['A'].combine_first(df['A*'])
                              return df

                              %timeit -n 100 using_fillna(df)
                              %timeit -n 100 using_np_where(df)
                              %timeit -n 100 using_combine_first(df)
                              %timeit -n 100 using_np_where_numpy(df)

                              1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                              share|improve this answer























                              • oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                                – hamslice
                                Nov 20 '18 at 9:53










                              • @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                                – jezrael
                                Nov 20 '18 at 9:54












                              • @hamslice - so not working for you?
                                – jezrael
                                Nov 20 '18 at 9:56










                              • I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                                – user3471881
                                Nov 20 '18 at 10:10












                              • @user3471881 - removed :)
                                – jezrael
                                Nov 20 '18 at 10:11














                              2












                              2








                              2






                              For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:



                              df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
                              df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)


                              Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:



                              First create MultiIndex by split with append *val:



                              df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)


                              And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:



                              df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
                              print (df)
                              a b
                              1 5.0 9.0
                              2 3.0 3.0
                              3 4.0 1.0
                              4 9.0 7.0


                              Performance: (depends of number of missing values and length of DataFrame)



                              df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
                              'A*': [4, 4, 5, 6, 7, 8] * 10000})

                              def using_fillna(df):
                              df['A'] = df['A'].fillna(df['A*'])
                              return df

                              def using_np_where(df):
                              df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
                              return df

                              def using_np_where_numpy(df):
                              df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
                              return df

                              def using_combine_first(df):
                              df['D'] = df['A'].combine_first(df['A*'])
                              return df

                              %timeit -n 100 using_fillna(df)
                              %timeit -n 100 using_np_where(df)
                              %timeit -n 100 using_combine_first(df)
                              %timeit -n 100 using_np_where_numpy(df)

                              1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)





                              share|improve this answer














                              For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:



                              df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
                              df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)


                              Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:



                              First create MultiIndex by split with append *val:



                              df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)


                              And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:



                              df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
                              print (df)
                              a b
                              1 5.0 9.0
                              2 3.0 3.0
                              3 4.0 1.0
                              4 9.0 7.0


                              Performance: (depends of number of missing values and length of DataFrame)



                              df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000, 
                              'A*': [4, 4, 5, 6, 7, 8] * 10000})

                              def using_fillna(df):
                              df['A'] = df['A'].fillna(df['A*'])
                              return df

                              def using_np_where(df):
                              df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
                              return df

                              def using_np_where_numpy(df):
                              df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
                              return df

                              def using_combine_first(df):
                              df['D'] = df['A'].combine_first(df['A*'])
                              return df

                              %timeit -n 100 using_fillna(df)
                              %timeit -n 100 using_np_where(df)
                              %timeit -n 100 using_combine_first(df)
                              %timeit -n 100 using_np_where_numpy(df)

                              1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
                              423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Nov 20 '18 at 10:23

























                              answered Nov 20 '18 at 9:21









                              jezrael

                              320k22259338




                              320k22259338












                              • oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                                – hamslice
                                Nov 20 '18 at 9:53










                              • @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                                – jezrael
                                Nov 20 '18 at 9:54












                              • @hamslice - so not working for you?
                                – jezrael
                                Nov 20 '18 at 9:56










                              • I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                                – user3471881
                                Nov 20 '18 at 10:10












                              • @user3471881 - removed :)
                                – jezrael
                                Nov 20 '18 at 10:11


















                              • oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                                – hamslice
                                Nov 20 '18 at 9:53










                              • @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                                – jezrael
                                Nov 20 '18 at 9:54












                              • @hamslice - so not working for you?
                                – jezrael
                                Nov 20 '18 at 9:56










                              • I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                                – user3471881
                                Nov 20 '18 at 10:10












                              • @user3471881 - removed :)
                                – jezrael
                                Nov 20 '18 at 10:11
















                              oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                              – hamslice
                              Nov 20 '18 at 9:53




                              oh, that gives me an 'SyntaxError: invalid syntax' on the fillna(df[f'{c}*'])
                              – hamslice
                              Nov 20 '18 at 9:53












                              @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                              – jezrael
                              Nov 20 '18 at 9:54






                              @hamslice - it is python 3.6 syntax, for lowwer versions need fillna(df['{}*'.format(c)])
                              – jezrael
                              Nov 20 '18 at 9:54














                              @hamslice - so not working for you?
                              – jezrael
                              Nov 20 '18 at 9:56




                              @hamslice - so not working for you?
                              – jezrael
                              Nov 20 '18 at 9:56












                              I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                              – user3471881
                              Nov 20 '18 at 10:10






                              I think "use numpy" if important comment is a bit too categorical. The performance difference is not HUGE, rather the conclusion should be "if performance is important, don't use fillna".
                              – user3471881
                              Nov 20 '18 at 10:10














                              @user3471881 - removed :)
                              – jezrael
                              Nov 20 '18 at 10:11




                              @user3471881 - removed :)
                              – jezrael
                              Nov 20 '18 at 10:11


















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389699%2fpandas-dataframe-copy-the-contents-of-a-column-if-it-is-empty%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              "Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

                              Alcedinidae

                              RAC Tourist Trophy