Fastest way to drop rows / get subset with difference from large DataFrame in Pandas
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
- ~41 sec:
df[~df.index.isin(indices)]
by @jezrael
python pandas dataframe
|
show 5 more comments
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
- ~41 sec:
df[~df.index.isin(indices)]
by @jezrael
python pandas dataframe
1
How workingdf[~df.index.isin(indices)]
?
– jezrael
Nov 20 '18 at 13:54
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example ifindices2remove
is bigger thanindices2keep
thedf[df.index.isin(indices2keep)]
it will be faster thandf[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
1
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29
|
show 5 more comments
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
- ~41 sec:
df[~df.index.isin(indices)]
by @jezrael
python pandas dataframe
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
- ~41 sec:
df[~df.index.isin(indices)]
by @jezrael
python pandas dataframe
python pandas dataframe
edited Nov 20 '18 at 14:00
Ichixgo
asked Nov 20 '18 at 13:52
IchixgoIchixgo
1618
1618
1
How workingdf[~df.index.isin(indices)]
?
– jezrael
Nov 20 '18 at 13:54
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example ifindices2remove
is bigger thanindices2keep
thedf[df.index.isin(indices2keep)]
it will be faster thandf[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
1
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29
|
show 5 more comments
1
How workingdf[~df.index.isin(indices)]
?
– jezrael
Nov 20 '18 at 13:54
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example ifindices2remove
is bigger thanindices2keep
thedf[df.index.isin(indices2keep)]
it will be faster thandf[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
1
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29
1
1
How working
df[~df.index.isin(indices)]
?– jezrael
Nov 20 '18 at 13:54
How working
df[~df.index.isin(indices)]
?– jezrael
Nov 20 '18 at 13:54
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if
indices2remove
is bigger than indices2keep
the df[df.index.isin(indices2keep)]
it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if
indices2remove
is bigger than indices2keep
the df[df.index.isin(indices2keep)]
it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
1
1
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29
|
show 5 more comments
3 Answers
3
active
oldest
votes
I believe you can create boolean mask, inverting by ~
and filtering by boolean indexing
:
df1 = df[~df.index.isin(indices)]
As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df
later is necessary add copy
:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list
of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Note that if you save this in a new variable you will have tocopy
the filtereddf
, if you are planning on manipulating the filtereddf
later. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00
add a comment |
Using iloc
(or loc
, see below) and Series.drop
:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As @jezrael points out you can only use iloc
if index
is a RangeIndex
otherwise you will have to use loc
. But this is still faster than df[df.isin()]
(see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc
outperform boolean_indexing
?
Well, the short answer is that it doesn't. df.index.drop(indices)
is just a lot faster than ~df.index.isin(indices)
(given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing
vs iloc
vs loc
:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
updated withloc
@jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09
added some extra context for why evenloc
outperformsboolean_indexing
in this case.
– user3471881
Nov 20 '18 at 15:55
add a comment |
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
upvoted for speed but OP asked for solutions using onlypandas
as far as I understood, this requiresnumba
which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
I believe you can create boolean mask, inverting by ~
and filtering by boolean indexing
:
df1 = df[~df.index.isin(indices)]
As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df
later is necessary add copy
:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list
of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Note that if you save this in a new variable you will have tocopy
the filtereddf
, if you are planning on manipulating the filtereddf
later. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00
add a comment |
I believe you can create boolean mask, inverting by ~
and filtering by boolean indexing
:
df1 = df[~df.index.isin(indices)]
As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df
later is necessary add copy
:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list
of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Note that if you save this in a new variable you will have tocopy
the filtereddf
, if you are planning on manipulating the filtereddf
later. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00
add a comment |
I believe you can create boolean mask, inverting by ~
and filtering by boolean indexing
:
df1 = df[~df.index.isin(indices)]
As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df
later is necessary add copy
:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list
of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
I believe you can create boolean mask, inverting by ~
and filtering by boolean indexing
:
df1 = df[~df.index.isin(indices)]
As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df
later is necessary add copy
:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list
of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
edited Nov 20 '18 at 14:28
answered Nov 20 '18 at 13:57
jezraeljezrael
322k23265342
322k23265342
Note that if you save this in a new variable you will have tocopy
the filtereddf
, if you are planning on manipulating the filtereddf
later. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00
add a comment |
Note that if you save this in a new variable you will have tocopy
the filtereddf
, if you are planning on manipulating the filtereddf
later. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00
Note that if you save this in a new variable you will have to
copy
the filtered df
, if you are planning on manipulating the filtered df
later. Don't know how this will effect performance but worth noting.– user3471881
Nov 20 '18 at 14:00
Note that if you save this in a new variable you will have to
copy
the filtered df
, if you are planning on manipulating the filtered df
later. Don't know how this will effect performance but worth noting.– user3471881
Nov 20 '18 at 14:00
add a comment |
Using iloc
(or loc
, see below) and Series.drop
:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As @jezrael points out you can only use iloc
if index
is a RangeIndex
otherwise you will have to use loc
. But this is still faster than df[df.isin()]
(see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc
outperform boolean_indexing
?
Well, the short answer is that it doesn't. df.index.drop(indices)
is just a lot faster than ~df.index.isin(indices)
(given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing
vs iloc
vs loc
:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
updated withloc
@jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09
added some extra context for why evenloc
outperformsboolean_indexing
in this case.
– user3471881
Nov 20 '18 at 15:55
add a comment |
Using iloc
(or loc
, see below) and Series.drop
:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As @jezrael points out you can only use iloc
if index
is a RangeIndex
otherwise you will have to use loc
. But this is still faster than df[df.isin()]
(see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc
outperform boolean_indexing
?
Well, the short answer is that it doesn't. df.index.drop(indices)
is just a lot faster than ~df.index.isin(indices)
(given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing
vs iloc
vs loc
:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
updated withloc
@jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09
added some extra context for why evenloc
outperformsboolean_indexing
in this case.
– user3471881
Nov 20 '18 at 15:55
add a comment |
Using iloc
(or loc
, see below) and Series.drop
:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As @jezrael points out you can only use iloc
if index
is a RangeIndex
otherwise you will have to use loc
. But this is still faster than df[df.isin()]
(see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc
outperform boolean_indexing
?
Well, the short answer is that it doesn't. df.index.drop(indices)
is just a lot faster than ~df.index.isin(indices)
(given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing
vs iloc
vs loc
:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using iloc
(or loc
, see below) and Series.drop
:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As @jezrael points out you can only use iloc
if index
is a RangeIndex
otherwise you will have to use loc
. But this is still faster than df[df.isin()]
(see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc
outperform boolean_indexing
?
Well, the short answer is that it doesn't. df.index.drop(indices)
is just a lot faster than ~df.index.isin(indices)
(given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing
vs iloc
vs loc
:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
edited Nov 20 '18 at 15:54
answered Nov 20 '18 at 14:36
user3471881user3471881
1,1232619
1,1232619
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
updated withloc
@jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09
added some extra context for why evenloc
outperformsboolean_indexing
in this case.
– user3471881
Nov 20 '18 at 15:55
add a comment |
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
updated withloc
@jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09
added some extra context for why evenloc
outperformsboolean_indexing
in this case.
– user3471881
Nov 20 '18 at 15:55
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54
1
1
updated with
loc
@jezrael. Thanks for pointing it out!– user3471881
Nov 20 '18 at 15:09
updated with
loc
@jezrael. Thanks for pointing it out!– user3471881
Nov 20 '18 at 15:09
added some extra context for why even
loc
outperforms boolean_indexing
in this case.– user3471881
Nov 20 '18 at 15:55
added some extra context for why even
loc
outperforms boolean_indexing
in this case.– user3471881
Nov 20 '18 at 15:55
add a comment |
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
upvoted for speed but OP asked for solutions using onlypandas
as far as I understood, this requiresnumba
which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
add a comment |
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
upvoted for speed but OP asked for solutions using onlypandas
as far as I understood, this requiresnumba
which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
add a comment |
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
@njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
edited Nov 20 '18 at 16:50
answered Nov 20 '18 at 16:26
B. M.B. M.
12.9k11934
12.9k11934
upvoted for speed but OP asked for solutions using onlypandas
as far as I understood, this requiresnumba
which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
add a comment |
upvoted for speed but OP asked for solutions using onlypandas
as far as I understood, this requiresnumba
which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
upvoted for speed but OP asked for solutions using only
pandas
as far as I understood, this requires numba
which should at least be mentioned explicitly. (not only by importing the module :D)– user3471881
Nov 20 '18 at 16:50
upvoted for speed but OP asked for solutions using only
pandas
as far as I understood, this requires numba
which should at least be mentioned explicitly. (not only by importing the module :D)– user3471881
Nov 20 '18 at 16:50
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
How working
df[~df.index.isin(indices)]
?– jezrael
Nov 20 '18 at 13:54
gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55
uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56
@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if
indices2remove
is bigger thanindices2keep
thedf[df.index.isin(indices2keep)]
it will be faster thandf[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21
1
@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29