Fastest way to drop rows / get subset with difference from large DataFrame in Pandas












2














Question



I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.



So far I have two solutions, which seem relatively slow to me:





  1. df.loc[df.difference(indices)]



    which takes ~115 sec on my dataset




  2. df.drop(indices)



    which takes ~215 sec on my dataset




Is there a faster way to do this? Preferably in Pandas.



Performance of proposed Solutions




  • ~41 sec: df[~df.index.isin(indices)] by @jezrael










share|improve this question




















  • 1




    How working df[~df.index.isin(indices)] ?
    – jezrael
    Nov 20 '18 at 13:54












  • gimme a sec. running atm
    – Ichixgo
    Nov 20 '18 at 13:55










  • uuuh nice :D 41 sec ^^. nice speed up so far.
    – Ichixgo
    Nov 20 '18 at 13:56










  • @jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
    – user32185
    Nov 20 '18 at 14:21






  • 1




    @user32185 - Thanks, I try add it to answer.
    – jezrael
    Nov 20 '18 at 14:29
















2














Question



I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.



So far I have two solutions, which seem relatively slow to me:





  1. df.loc[df.difference(indices)]



    which takes ~115 sec on my dataset




  2. df.drop(indices)



    which takes ~215 sec on my dataset




Is there a faster way to do this? Preferably in Pandas.



Performance of proposed Solutions




  • ~41 sec: df[~df.index.isin(indices)] by @jezrael










share|improve this question




















  • 1




    How working df[~df.index.isin(indices)] ?
    – jezrael
    Nov 20 '18 at 13:54












  • gimme a sec. running atm
    – Ichixgo
    Nov 20 '18 at 13:55










  • uuuh nice :D 41 sec ^^. nice speed up so far.
    – Ichixgo
    Nov 20 '18 at 13:56










  • @jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
    – user32185
    Nov 20 '18 at 14:21






  • 1




    @user32185 - Thanks, I try add it to answer.
    – jezrael
    Nov 20 '18 at 14:29














2












2








2







Question



I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.



So far I have two solutions, which seem relatively slow to me:





  1. df.loc[df.difference(indices)]



    which takes ~115 sec on my dataset




  2. df.drop(indices)



    which takes ~215 sec on my dataset




Is there a faster way to do this? Preferably in Pandas.



Performance of proposed Solutions




  • ~41 sec: df[~df.index.isin(indices)] by @jezrael










share|improve this question















Question



I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.



So far I have two solutions, which seem relatively slow to me:





  1. df.loc[df.difference(indices)]



    which takes ~115 sec on my dataset




  2. df.drop(indices)



    which takes ~215 sec on my dataset




Is there a faster way to do this? Preferably in Pandas.



Performance of proposed Solutions




  • ~41 sec: df[~df.index.isin(indices)] by @jezrael







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 14:00







Ichixgo

















asked Nov 20 '18 at 13:52









IchixgoIchixgo

1618




1618








  • 1




    How working df[~df.index.isin(indices)] ?
    – jezrael
    Nov 20 '18 at 13:54












  • gimme a sec. running atm
    – Ichixgo
    Nov 20 '18 at 13:55










  • uuuh nice :D 41 sec ^^. nice speed up so far.
    – Ichixgo
    Nov 20 '18 at 13:56










  • @jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
    – user32185
    Nov 20 '18 at 14:21






  • 1




    @user32185 - Thanks, I try add it to answer.
    – jezrael
    Nov 20 '18 at 14:29














  • 1




    How working df[~df.index.isin(indices)] ?
    – jezrael
    Nov 20 '18 at 13:54












  • gimme a sec. running atm
    – Ichixgo
    Nov 20 '18 at 13:55










  • uuuh nice :D 41 sec ^^. nice speed up so far.
    – Ichixgo
    Nov 20 '18 at 13:56










  • @jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
    – user32185
    Nov 20 '18 at 14:21






  • 1




    @user32185 - Thanks, I try add it to answer.
    – jezrael
    Nov 20 '18 at 14:29








1




1




How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54






How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54














gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55




gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55












uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56




uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56












@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21




@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21




1




1




@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29




@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29












3 Answers
3






active

oldest

votes


















3














I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:



df1 = df[~df.index.isin(indices)]


As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:



df1 = df[~df.index.isin(indices)].copy()




This filtering depends of number of matched indices and also by length of DataFrame.



So another possible solution is create array/list of indices for keeping and then inverting is not necessary:



df1 = df[df.index.isin(need_indices)]





share|improve this answer























  • Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
    – user3471881
    Nov 20 '18 at 14:00



















2














Using iloc (or loc, see below) and Series.drop:



df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).



All three options on 10 million rows:



df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Why does super slow loc outperform boolean_indexing?



Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):



%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can compare this to the performance of boolean_indexing vs iloc vs loc:



boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





share|improve this answer























  • hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
    – jezrael
    Nov 20 '18 at 14:47










  • @user3471881 I'll check out your solution later.
    – Ichixgo
    Nov 20 '18 at 14:54






  • 1




    updated with loc @jezrael. Thanks for pointing it out!
    – user3471881
    Nov 20 '18 at 15:09










  • added some extra context for why even loc outperforms boolean_indexing in this case.
    – user3471881
    Nov 20 '18 at 15:55



















1














If order of rows doesn't mind, you can arrange them in place :



n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1

def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]


Runs :



In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms





share|improve this answer























  • upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
    – user3471881
    Nov 20 '18 at 16:50












  • @user3471881: OP said preferably ;) . But that shuffle rows.
    – B. M.
    Nov 20 '18 at 16:59











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:



df1 = df[~df.index.isin(indices)]


As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:



df1 = df[~df.index.isin(indices)].copy()




This filtering depends of number of matched indices and also by length of DataFrame.



So another possible solution is create array/list of indices for keeping and then inverting is not necessary:



df1 = df[df.index.isin(need_indices)]





share|improve this answer























  • Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
    – user3471881
    Nov 20 '18 at 14:00
















3














I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:



df1 = df[~df.index.isin(indices)]


As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:



df1 = df[~df.index.isin(indices)].copy()




This filtering depends of number of matched indices and also by length of DataFrame.



So another possible solution is create array/list of indices for keeping and then inverting is not necessary:



df1 = df[df.index.isin(need_indices)]





share|improve this answer























  • Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
    – user3471881
    Nov 20 '18 at 14:00














3












3








3






I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:



df1 = df[~df.index.isin(indices)]


As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:



df1 = df[~df.index.isin(indices)].copy()




This filtering depends of number of matched indices and also by length of DataFrame.



So another possible solution is create array/list of indices for keeping and then inverting is not necessary:



df1 = df[df.index.isin(need_indices)]





share|improve this answer














I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:



df1 = df[~df.index.isin(indices)]


As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:



df1 = df[~df.index.isin(indices)].copy()




This filtering depends of number of matched indices and also by length of DataFrame.



So another possible solution is create array/list of indices for keeping and then inverting is not necessary:



df1 = df[df.index.isin(need_indices)]






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 14:28

























answered Nov 20 '18 at 13:57









jezraeljezrael

322k23265342




322k23265342












  • Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
    – user3471881
    Nov 20 '18 at 14:00


















  • Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
    – user3471881
    Nov 20 '18 at 14:00
















Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00




Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00













2














Using iloc (or loc, see below) and Series.drop:



df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).



All three options on 10 million rows:



df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Why does super slow loc outperform boolean_indexing?



Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):



%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can compare this to the performance of boolean_indexing vs iloc vs loc:



boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





share|improve this answer























  • hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
    – jezrael
    Nov 20 '18 at 14:47










  • @user3471881 I'll check out your solution later.
    – Ichixgo
    Nov 20 '18 at 14:54






  • 1




    updated with loc @jezrael. Thanks for pointing it out!
    – user3471881
    Nov 20 '18 at 15:09










  • added some extra context for why even loc outperforms boolean_indexing in this case.
    – user3471881
    Nov 20 '18 at 15:55
















2














Using iloc (or loc, see below) and Series.drop:



df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).



All three options on 10 million rows:



df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Why does super slow loc outperform boolean_indexing?



Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):



%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can compare this to the performance of boolean_indexing vs iloc vs loc:



boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





share|improve this answer























  • hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
    – jezrael
    Nov 20 '18 at 14:47










  • @user3471881 I'll check out your solution later.
    – Ichixgo
    Nov 20 '18 at 14:54






  • 1




    updated with loc @jezrael. Thanks for pointing it out!
    – user3471881
    Nov 20 '18 at 15:09










  • added some extra context for why even loc outperforms boolean_indexing in this case.
    – user3471881
    Nov 20 '18 at 15:55














2












2








2






Using iloc (or loc, see below) and Series.drop:



df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).



All three options on 10 million rows:



df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Why does super slow loc outperform boolean_indexing?



Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):



%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can compare this to the performance of boolean_indexing vs iloc vs loc:



boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





share|improve this answer














Using iloc (or loc, see below) and Series.drop:



df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).



All three options on 10 million rows:



df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Why does super slow loc outperform boolean_indexing?



Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):



%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can compare this to the performance of boolean_indexing vs iloc vs loc:



boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 15:54

























answered Nov 20 '18 at 14:36









user3471881user3471881

1,1232619




1,1232619












  • hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
    – jezrael
    Nov 20 '18 at 14:47










  • @user3471881 I'll check out your solution later.
    – Ichixgo
    Nov 20 '18 at 14:54






  • 1




    updated with loc @jezrael. Thanks for pointing it out!
    – user3471881
    Nov 20 '18 at 15:09










  • added some extra context for why even loc outperforms boolean_indexing in this case.
    – user3471881
    Nov 20 '18 at 15:55


















  • hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
    – jezrael
    Nov 20 '18 at 14:47










  • @user3471881 I'll check out your solution later.
    – Ichixgo
    Nov 20 '18 at 14:54






  • 1




    updated with loc @jezrael. Thanks for pointing it out!
    – user3471881
    Nov 20 '18 at 15:09










  • added some extra context for why even loc outperforms boolean_indexing in this case.
    – user3471881
    Nov 20 '18 at 15:55
















hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47




hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47












@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54




@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54




1




1




updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09




updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09












added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55




added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55











1














If order of rows doesn't mind, you can arrange them in place :



n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1

def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]


Runs :



In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms





share|improve this answer























  • upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
    – user3471881
    Nov 20 '18 at 16:50












  • @user3471881: OP said preferably ;) . But that shuffle rows.
    – B. M.
    Nov 20 '18 at 16:59
















1














If order of rows doesn't mind, you can arrange them in place :



n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1

def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]


Runs :



In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms





share|improve this answer























  • upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
    – user3471881
    Nov 20 '18 at 16:50












  • @user3471881: OP said preferably ;) . But that shuffle rows.
    – B. M.
    Nov 20 '18 at 16:59














1












1








1






If order of rows doesn't mind, you can arrange them in place :



n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1

def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]


Runs :



In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms





share|improve this answer














If order of rows doesn't mind, you can arrange them in place :



n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1

def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]


Runs :



In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 16:50

























answered Nov 20 '18 at 16:26









B. M.B. M.

12.9k11934




12.9k11934












  • upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
    – user3471881
    Nov 20 '18 at 16:50












  • @user3471881: OP said preferably ;) . But that shuffle rows.
    – B. M.
    Nov 20 '18 at 16:59


















  • upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
    – user3471881
    Nov 20 '18 at 16:50












  • @user3471881: OP said preferably ;) . But that shuffle rows.
    – B. M.
    Nov 20 '18 at 16:59
















upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50






upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50














@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59




@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

RAC Tourist Trophy