Pandas optimise datetime comparison on two columns

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

How can I optimise the following operation:

df[(df.start <= x) & (df.end >= y)]

I tried using MultiIndex but saw no significant speedup.

df = df.set_index(['start', 'end'])

df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]

Sample data:

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

Try df.query()

– Srce Cde
Nov 23 '18 at 11:50

@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50

I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51

Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52

If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54

|
show 1 more comment

How can I optimise the following operation:

df[(df.start <= x) & (df.end >= y)]

I tried using MultiIndex but saw no significant speedup.

df = df.set_index(['start', 'end'])

df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]

Sample data:

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

Try df.query()

– Srce Cde
Nov 23 '18 at 11:50

@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50

I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51

Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52

If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54

|
show 1 more comment

How can I optimise the following operation:

df[(df.start <= x) & (df.end >= y)]

I tried using MultiIndex but saw no significant speedup.

df = df.set_index(['start', 'end'])

df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]

Sample data:

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

How can I optimise the following operation:

df[(df.start <= x) & (df.end >= y)]

I tried using MultiIndex but saw no significant speedup.

df = df.set_index(['start', 'end'])

df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]

Sample data:

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'

python pandas performance dataframe

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

edited Nov 23 '18 at 12:01

asked Nov 23 '18 at 11:49

Lokesh

61531227

asked Nov 23 '18 at 11:49

Lokesh

61531227

asked Nov 23 '18 at 11:49

Lokesh

61531227

Try df.query()

– Srce Cde
Nov 23 '18 at 11:50

@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50

I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51

Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52

If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54

|
show 1 more comment

Try df.query()

– Srce Cde
Nov 23 '18 at 11:50

@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50

I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51

Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52

If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54

Try df.query()

– Srce Cde
Nov 23 '18 at 11:50

@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50

I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51

Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52

If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54

|
show 1 more comment

1 Answer
1

active

oldest

votes

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing

mindt = pd.to_datetime('2016-01-01') 

maxdt = pd.to_datetime('2017-01-01')



x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values

y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())



# check results are the same

assert np.array_equal(x, y)



%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# 55.6 ms per loop



%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)

# 108 ms per loop

Setup

np.random.seed(0)



def random_dates(start, end, n):

    start_u = start.value//10**9

    end_u = end.value//10**9

    cols = ['start', 'end']

    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})

    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)

    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')

    return df



# construct a dataframe of random dates

df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

|
show 6 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446171%2fpandas-optimise-datetime-comparison-on-two-columns%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing

mindt = pd.to_datetime('2016-01-01') 

maxdt = pd.to_datetime('2017-01-01')



x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values

y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())



# check results are the same

assert np.array_equal(x, y)



%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# 55.6 ms per loop



%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)

# 108 ms per loop

Setup

np.random.seed(0)



def random_dates(start, end, n):

    start_u = start.value//10**9

    end_u = end.value//10**9

    cols = ['start', 'end']

    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})

    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)

    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')

    return df



# construct a dataframe of random dates

df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

|
show 6 more comments

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing

mindt = pd.to_datetime('2016-01-01') 

maxdt = pd.to_datetime('2017-01-01')



x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values

y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())



# check results are the same

assert np.array_equal(x, y)



%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# 55.6 ms per loop



%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)

# 108 ms per loop

Setup

np.random.seed(0)



def random_dates(start, end, n):

    start_u = start.value//10**9

    end_u = end.value//10**9

    cols = ['start', 'end']

    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})

    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)

    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')

    return df



# construct a dataframe of random dates

df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

|
show 6 more comments

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing

mindt = pd.to_datetime('2016-01-01') 

maxdt = pd.to_datetime('2017-01-01')



x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values

y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())



# check results are the same

assert np.array_equal(x, y)



%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# 55.6 ms per loop



%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)

# 108 ms per loop

Setup

np.random.seed(0)



def random_dates(start, end, n):

    start_u = start.value//10**9

    end_u = end.value//10**9

    cols = ['start', 'end']

    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})

    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)

    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')

    return df



# construct a dataframe of random dates

df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing

mindt = pd.to_datetime('2016-01-01') 

maxdt = pd.to_datetime('2017-01-01')



x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values

y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())



# check results are the same

assert np.array_equal(x, y)



%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# 55.6 ms per loop



%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)

# 108 ms per loop

Setup

np.random.seed(0)



def random_dates(start, end, n):

    start_u = start.value//10**9

    end_u = end.value//10**9

    cols = ['start', 'end']

    df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})

    df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)

    df[cols] = df[cols].apply(pd.to_datetime, errors='raise')

    return df



# construct a dataframe of random dates

df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

edited Nov 23 '18 at 12:31

answered Nov 23 '18 at 12:20

jpp

102k2166116

answered Nov 23 '18 at 12:20

jpp

102k2166116

answered Nov 23 '18 at 12:20

jpp

102k2166116

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

|
show 6 more comments

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29

@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30

Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32

@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35

But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36

|
show 6 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr