Pandas optimise datetime comparison on two columns





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















How can I optimise the following operation:



df[(df.start <= x) & (df.end >= y)]


I tried using MultiIndex but saw no significant speedup.



df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]


Sample data:






'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'












share|improve this question

























  • Try df.query()

    – Srce Cde
    Nov 23 '18 at 11:50











  • @Chirag - it is slowier :(

    – jezrael
    Nov 23 '18 at 11:50











  • I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

    – Lokesh
    Nov 23 '18 at 11:51











  • Can you add some sample data?

    – jezrael
    Nov 23 '18 at 11:52













  • If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

    – Mohamed Thasin ah
    Nov 23 '18 at 11:54




















1















How can I optimise the following operation:



df[(df.start <= x) & (df.end >= y)]


I tried using MultiIndex but saw no significant speedup.



df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]


Sample data:






'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'












share|improve this question

























  • Try df.query()

    – Srce Cde
    Nov 23 '18 at 11:50











  • @Chirag - it is slowier :(

    – jezrael
    Nov 23 '18 at 11:50











  • I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

    – Lokesh
    Nov 23 '18 at 11:51











  • Can you add some sample data?

    – jezrael
    Nov 23 '18 at 11:52













  • If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

    – Mohamed Thasin ah
    Nov 23 '18 at 11:54
















1












1








1








How can I optimise the following operation:



df[(df.start <= x) & (df.end >= y)]


I tried using MultiIndex but saw no significant speedup.



df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]


Sample data:






'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'












share|improve this question
















How can I optimise the following operation:



df[(df.start <= x) & (df.end >= y)]


I tried using MultiIndex but saw no significant speedup.



df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]


Sample data:






'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'








'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'





'<table border="1" class="dataframe">n  <thead>n    <tr style="text-align: right;">n      <th></th>n      <th>start</th>n      <th>end</th>n    </tr>n  </thead>n  <tbody>n    <tr>n      <th>0</th>n      <td>2018-11-13 10:28:30.304287</td>n      <td>2018-11-13 10:46:28.663868</td>n    </tr>n    <tr>n      <th>1</th>n      <td>2018-11-13 12:27:32.226550</td>n      <td>2018-11-13 13:09:02.723869</td>n    </tr>n    <tr>n      <th>2</th>n      <td>2018-11-13 13:29:29.981659</td>n      <td>2018-11-13 13:54:01.138963</td>n    </tr>n    <tr>n      <th>3</th>n      <td>2018-11-13 14:30:49.380554</td>n      <td>2018-11-13 14:48:50.627830</td>n    </tr>n    <tr>n      <th>4</th>n      <td>2018-11-13 14:59:26.799017</td>n      <td>2018-11-13 15:24:00.453983</td>n    </tr>n    <tr>n      <th>5</th>n      <td>2018-11-13 16:30:16.824188</td>n      <td>2018-11-13 16:48:35.346318</td>n    </tr>n    <tr>n      <th>6</th>n      <td>2018-11-13 17:15:25.486287</td>n      <td>2018-11-13 17:59:30.774629</td>n    </tr>n    <tr>n      <th>7</th>n      <td>2018-11-13 18:27:41.915379</td>n      <td>2018-11-13 18:47:26.528320</td>n    </tr>n    <tr>n      <th>8</th>n      <td>2018-11-13 19:28:12.835576</td>n      <td>2018-11-13 19:52:15.448146</td>n    </tr>n    <tr>n      <th>9</th>n      <td>2018-11-13 20:41:41.210849</td>n      <td>2018-11-13 21:07:52.249831</td>n    </tr>n    <tr>n      <th>10</th>n      <td>2018-11-13 21:11:23.529623</td>n      <td>2018-11-13 21:42:10.106951</td>n    </tr>n  </tbody>n</table>'






python pandas performance dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 12:01







Lokesh

















asked Nov 23 '18 at 11:49









LokeshLokesh

61531227




61531227













  • Try df.query()

    – Srce Cde
    Nov 23 '18 at 11:50











  • @Chirag - it is slowier :(

    – jezrael
    Nov 23 '18 at 11:50











  • I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

    – Lokesh
    Nov 23 '18 at 11:51











  • Can you add some sample data?

    – jezrael
    Nov 23 '18 at 11:52













  • If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

    – Mohamed Thasin ah
    Nov 23 '18 at 11:54





















  • Try df.query()

    – Srce Cde
    Nov 23 '18 at 11:50











  • @Chirag - it is slowier :(

    – jezrael
    Nov 23 '18 at 11:50











  • I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

    – Lokesh
    Nov 23 '18 at 11:51











  • Can you add some sample data?

    – jezrael
    Nov 23 '18 at 11:52













  • If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

    – Mohamed Thasin ah
    Nov 23 '18 at 11:54



















Try df.query()

– Srce Cde
Nov 23 '18 at 11:50





Try df.query()

– Srce Cde
Nov 23 '18 at 11:50













@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50





@Chirag - it is slowier :(

– jezrael
Nov 23 '18 at 11:50













I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51





I would have easily solved the problem if there were only 1 column by making that column DateTimeIndex.

– Lokesh
Nov 23 '18 at 11:51













Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52







Can you add some sample data?

– jezrael
Nov 23 '18 at 11:52















If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54







If I face this situation I will split this condition, First apply (df.start <= x) then save this to same df and then apply next condition(df.end >= y)(Now df size is reduced). May be I'm wrong, let's wait for expert's answers

– Mohamed Thasin ah
Nov 23 '18 at 11:54














1 Answer
1






active

oldest

votes


















1














The bottleneck is construction of the Boolean series / array used for indexing.



Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?



# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop


Setup



np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)





share|improve this answer


























  • Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

    – Lokesh
    Nov 23 '18 at 12:29













  • @Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

    – jpp
    Nov 23 '18 at 12:30













  • Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

    – Lokesh
    Nov 23 '18 at 12:32













  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

    – jpp
    Nov 23 '18 at 12:35













  • But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

    – Lokesh
    Nov 23 '18 at 12:36














Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446171%2fpandas-optimise-datetime-comparison-on-two-columns%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














The bottleneck is construction of the Boolean series / array used for indexing.



Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?



# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop


Setup



np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)





share|improve this answer


























  • Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

    – Lokesh
    Nov 23 '18 at 12:29













  • @Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

    – jpp
    Nov 23 '18 at 12:30













  • Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

    – Lokesh
    Nov 23 '18 at 12:32













  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

    – jpp
    Nov 23 '18 at 12:35













  • But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

    – Lokesh
    Nov 23 '18 at 12:36


















1














The bottleneck is construction of the Boolean series / array used for indexing.



Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?



# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop


Setup



np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)





share|improve this answer


























  • Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

    – Lokesh
    Nov 23 '18 at 12:29













  • @Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

    – jpp
    Nov 23 '18 at 12:30













  • Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

    – Lokesh
    Nov 23 '18 at 12:32













  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

    – jpp
    Nov 23 '18 at 12:35













  • But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

    – Lokesh
    Nov 23 '18 at 12:36
















1












1








1







The bottleneck is construction of the Boolean series / array used for indexing.



Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?



# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop


Setup



np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)





share|improve this answer















The bottleneck is construction of the Boolean series / array used for indexing.



Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?



# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop


Setup



np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 '18 at 12:31

























answered Nov 23 '18 at 12:20









jppjpp

102k2166116




102k2166116













  • Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

    – Lokesh
    Nov 23 '18 at 12:29













  • @Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

    – jpp
    Nov 23 '18 at 12:30













  • Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

    – Lokesh
    Nov 23 '18 at 12:32













  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

    – jpp
    Nov 23 '18 at 12:35













  • But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

    – Lokesh
    Nov 23 '18 at 12:36





















  • Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

    – Lokesh
    Nov 23 '18 at 12:29













  • @Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

    – jpp
    Nov 23 '18 at 12:30













  • Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

    – Lokesh
    Nov 23 '18 at 12:32













  • @Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

    – jpp
    Nov 23 '18 at 12:35













  • But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

    – Lokesh
    Nov 23 '18 at 12:36



















Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29







Great improvement. But I think the main problem is caused by the fact both start and index are unindexed and take O(n) time. If someone we could index both of them it would be a lot faster. What do you think?

– Lokesh
Nov 23 '18 at 12:29















@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30







@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with numba if this isn't sufficient.

– jpp
Nov 23 '18 at 12:30















Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32







Are you sure O(n) is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex and then queried it using df[date_time_object:]? Earlier I was doing df[df.time > date_time_object] and it was painfully slow.

– Lokesh
Nov 23 '18 at 12:32















@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35







@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]

– jpp
Nov 23 '18 at 12:35















But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36







But if we use binary search, comparison would reduce to O(log n) and this is what I suspect DateTimeIndex do but I'm not sure.

– Lokesh
Nov 23 '18 at 12:36






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446171%2fpandas-optimise-datetime-comparison-on-two-columns%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

"Incorrect syntax near the keyword 'ON'. (on update cascade, on delete cascade,)

Alcedinidae

Origin of the phrase “under your belt”?