Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

python pandas dataframe

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

asked Nov 20 '18 at 13:52

Ichixgo

1618

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

3 Answers
3

active

oldest

votes

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

answered Nov 20 '18 at 13:57

jezrael

322k23265342

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

answered Nov 20 '18 at 14:36

user3471881

1,1232619

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

If I really need a card on my start hand, how many mulligans make sense? [duplicate]

Alcedinidae

Can an atomic nucleus contain both particles and antiparticles? [duplicate]

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

If I really need a card on my start hand, how many mulligans make sense? [duplicate]

Alcedinidae

Can an atomic nucleus contain both particles and antiparticles? [duplicate]

3 Answers
3

3 Answers
3

3 Answers
3