Pandas Dataframe selecting groups with minimal cardinality

I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.

I am using the code below

train = train[train.groupby('id').id.transform(len) > headRows]

groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

This works. But the first line, it is very slow :(. 30 minutes or more.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

Thanks in advance
Regards

Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

add a comment |

I am using the code below

train = train[train.groupby('id').id.transform(len) > headRows]

groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

This works. But the first line, it is very slow :(. 30 minutes or more.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

Thanks in advance
Regards

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

add a comment |

I am using the code below

train = train[train.groupby('id').id.transform(len) > headRows]

groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

This works. But the first line, it is very slow :(. 30 minutes or more.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

Thanks in advance
Regards

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

I am using the code below

train = train[train.groupby('id').id.transform(len) > headRows]

groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

This works. But the first line, it is very slow :(. 30 minutes or more.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

Thanks in advance
Regards

python pandas dataframe data-processing

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

edited May 23 '17 at 10:28

Community♦

edited May 23 '17 at 10:28

Community♦

edited May 23 '17 at 10:28

Community♦

asked May 18 '14 at 6:28

Run2

9721226

asked May 18 '14 at 6:28

Run2

9721226

asked May 18 '14 at 6:28

Run2

9721226

add a comment |

1 Answer
1

active

oldest

votes

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)

1 loops, best of 3: 12.6 s per loop

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

1

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f23719203%2fpandas-dataframe-selecting-groups-with-minimal-cardinality%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)

1 loops, best of 3: 12.6 s per loop

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

1

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

add a comment |

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)

1 loops, best of 3: 12.6 s per loop

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

1

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

add a comment |

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)

1 loops, best of 3: 12.6 s per loop

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)

1 loops, best of 3: 12.6 s per loop

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

edited May 18 '14 at 7:07

answered May 18 '14 at 6:51

Karl D.

7,0512827

answered May 18 '14 at 6:51

Karl D.

7,0512827

answered May 18 '14 at 6:51

Karl D.

7,0512827

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

1

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

add a comment |

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

1

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

Hey thanks Karl. Let me evaluate that. I will get back. Thanks
– Run2
May 18 '14 at 6:54

Hey Karl, worked. 4 mins now. Thanks.
– Run2
May 18 '14 at 7:04

@Karl D If u your take this question (and the original) and make a combined entry for the cookbook would be great (maybe add a little explanation of the problem). thanks
– Jeff
May 18 '14 at 10:57

Yeah, I can do that @Jeff.
– Karl D.
May 18 '14 at 18:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr