Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset

df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

python pandas dataframe

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

edited Nov 20 '18 at 14:00

asked Nov 20 '18 at 13:52

Ichixgo

1618

asked Nov 20 '18 at 13:52

Ichixgo

1618

asked Nov 20 '18 at 13:52

Ichixgo

1618

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

1

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

1

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

How working df[~df.index.isin(indices)] ?
– jezrael
Nov 20 '18 at 13:54

gimme a sec. running atm
– Ichixgo
Nov 20 '18 at 13:55

uuuh nice :D 41 sec ^^. nice speed up so far.
– Ichixgo
Nov 20 '18 at 13:56

@jezrael I agree that your solution is the fastest but one as to consider how many indices you want to remove. As example if indices2remove is bigger than indices2keep the df[df.index.isin(indices2keep)] it will be faster than df[df.index.isin(indices2remove)]
– user32185
Nov 20 '18 at 14:21

@user32185 - Thanks, I try add it to answer.
– jezrael
Nov 20 '18 at 14:29

|
show 5 more comments

3 Answers
3

active

oldest

votes

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394523%2ffastest-way-to-drop-rows-get-subset-with-difference-from-large-dataframe-in-pa%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

edited Nov 20 '18 at 14:28

answered Nov 20 '18 at 13:57

jezrael

322k23265342

answered Nov 20 '18 at 13:57

jezrael

322k23265342

answered Nov 20 '18 at 13:57

jezrael

322k23265342

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

Note that if you save this in a new variable you will have to copy the filtered df, if you are planning on manipulating the filtered dflater. Don't know how this will effect performance but worth noting.
– user3471881
Nov 20 '18 at 14:00

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))

indices = np.arange(0, 1000000, 3)



%timeit -n 100 df[~df.index.isin(indices)]

%timeit -n 100 df.iloc[df.index.drop(indices)]



41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))

indices = np.arange(0, 10000000, 3)



%timeit -n 10 df[~df.index.isin(indices)]

%timeit -n 10 df.iloc[df.index.drop(indices)]

%timeit -n 10 df.loc[df.index.drop(indices)]



4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)

%timeit -n 10 df.index.drop(indices)



4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)

dropped_index = df.index.drop(indices)



%timeit -n 10 df[boolean_mask]

%timeit -n 10 df.iloc[dropped_index]

%timeit -n 10 df.loc[dropped_index]





489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

edited Nov 20 '18 at 15:54

answered Nov 20 '18 at 14:36

user3471881

1,1232619

answered Nov 20 '18 at 14:36

user3471881

1,1232619

answered Nov 20 '18 at 14:36

user3471881

1,1232619

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

1

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

hmmm, iloc is possible use only for default RangeIndex, so not sure if possible use for OP solution.. :)
– jezrael
Nov 20 '18 at 14:47

@user3471881 I'll check out your solution later.
– Ichixgo
Nov 20 '18 at 14:54

updated with loc @jezrael. Thanks for pointing it out!
– user3471881
Nov 20 '18 at 15:09

added some extra context for why even loc outperforms boolean_indexing in this case.
– user3471881
Nov 20 '18 at 15:55

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

If order of rows doesn't mind, you can arrange them in place :

n=10**7

df=pd.DataFrame(arange(4*n).reshape(n,4))

indices=np.unique(randint(0,n,size=n//2))



from numba import njit

@njit

def _dropfew(values,indices):

    k=len(values)-1

    for ind in indices[::-1]:

            values[ind]=values[k]

            k-=1



def dropfew(df,indices):

    _dropfew(df.values,indices)

    return df.iloc[:len(df)-len(indices)]

Runs :

In [39]: %time df.iloc[df.index.drop(indices)]

Wall time: 1.07 s



In [40]: %time dropfew(df,indices)

Wall time: 219 ms

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

edited Nov 20 '18 at 16:50

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

answered Nov 20 '18 at 16:26

B. M.

12.9k11934

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

upvoted for speed but OP asked for solutions using only pandas as far as I understood, this requires numba which should at least be mentioned explicitly. (not only by importing the module :D)
– user3471881
Nov 20 '18 at 16:50

@user3471881: OP said preferably ;) . But that shuffle rows.
– B. M.
Nov 20 '18 at 16:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

G i,p9RkxLbAV4,f g,AFj3SeoM Sgf,AcrE5Fw,MLw,4Qr,NK1qHKDd56Gdd0zz3uFjF

搜尋此網誌

Argthtjtr

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

7 octobre

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

Question

Performance of proposed Solutions

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

7 octobre

3 Answers
3

3 Answers
3

3 Answers
3