Pandas optimise datetime comparison on two columns
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
How can I optimise the following operation:
df[(df.start <= x) & (df.end >= y)]
I tried using MultiIndex
but saw no significant speedup.
df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]
Sample data:
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
python pandas performance dataframe
|
show 1 more comment
How can I optimise the following operation:
df[(df.start <= x) & (df.end >= y)]
I tried using MultiIndex
but saw no significant speedup.
df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]
Sample data:
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
python pandas performance dataframe
Trydf.query()
– Srce Cde
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
I would have easily solved the problem if there were only 1 column by making that columnDateTimeIndex
.
– Lokesh
Nov 23 '18 at 11:51
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
If I face this situation I will split this condition, First apply(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers
– Mohamed Thasin ah
Nov 23 '18 at 11:54
|
show 1 more comment
How can I optimise the following operation:
df[(df.start <= x) & (df.end >= y)]
I tried using MultiIndex
but saw no significant speedup.
df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]
Sample data:
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
python pandas performance dataframe
How can I optimise the following operation:
df[(df.start <= x) & (df.end >= y)]
I tried using MultiIndex
but saw no significant speedup.
df = df.set_index(['start', 'end'])
df[(df.index.get_level_values('start') <= end) & (discon_df.index.get_level_values('end') >= start)]
Sample data:
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
'<table border="1" class="dataframe">n <thead>n <tr style="text-align: right;">n <th></th>n <th>start</th>n <th>end</th>n </tr>n </thead>n <tbody>n <tr>n <th>0</th>n <td>2018-11-13 10:28:30.304287</td>n <td>2018-11-13 10:46:28.663868</td>n </tr>n <tr>n <th>1</th>n <td>2018-11-13 12:27:32.226550</td>n <td>2018-11-13 13:09:02.723869</td>n </tr>n <tr>n <th>2</th>n <td>2018-11-13 13:29:29.981659</td>n <td>2018-11-13 13:54:01.138963</td>n </tr>n <tr>n <th>3</th>n <td>2018-11-13 14:30:49.380554</td>n <td>2018-11-13 14:48:50.627830</td>n </tr>n <tr>n <th>4</th>n <td>2018-11-13 14:59:26.799017</td>n <td>2018-11-13 15:24:00.453983</td>n </tr>n <tr>n <th>5</th>n <td>2018-11-13 16:30:16.824188</td>n <td>2018-11-13 16:48:35.346318</td>n </tr>n <tr>n <th>6</th>n <td>2018-11-13 17:15:25.486287</td>n <td>2018-11-13 17:59:30.774629</td>n </tr>n <tr>n <th>7</th>n <td>2018-11-13 18:27:41.915379</td>n <td>2018-11-13 18:47:26.528320</td>n </tr>n <tr>n <th>8</th>n <td>2018-11-13 19:28:12.835576</td>n <td>2018-11-13 19:52:15.448146</td>n </tr>n <tr>n <th>9</th>n <td>2018-11-13 20:41:41.210849</td>n <td>2018-11-13 21:07:52.249831</td>n </tr>n <tr>n <th>10</th>n <td>2018-11-13 21:11:23.529623</td>n <td>2018-11-13 21:42:10.106951</td>n </tr>n </tbody>n</table>'
python pandas performance dataframe
python pandas performance dataframe
edited Nov 23 '18 at 12:01
Lokesh
asked Nov 23 '18 at 11:49
LokeshLokesh
61531227
61531227
Trydf.query()
– Srce Cde
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
I would have easily solved the problem if there were only 1 column by making that columnDateTimeIndex
.
– Lokesh
Nov 23 '18 at 11:51
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
If I face this situation I will split this condition, First apply(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers
– Mohamed Thasin ah
Nov 23 '18 at 11:54
|
show 1 more comment
Trydf.query()
– Srce Cde
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
I would have easily solved the problem if there were only 1 column by making that columnDateTimeIndex
.
– Lokesh
Nov 23 '18 at 11:51
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
If I face this situation I will split this condition, First apply(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers
– Mohamed Thasin ah
Nov 23 '18 at 11:54
Try
df.query()
– Srce Cde
Nov 23 '18 at 11:50
Try
df.query()
– Srce Cde
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
I would have easily solved the problem if there were only 1 column by making that column
DateTimeIndex
.– Lokesh
Nov 23 '18 at 11:51
I would have easily solved the problem if there were only 1 column by making that column
DateTimeIndex
.– Lokesh
Nov 23 '18 at 11:51
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
If I face this situation I will split this condition, First apply
(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers– Mohamed Thasin ah
Nov 23 '18 at 11:54
If I face this situation I will split this condition, First apply
(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers– Mohamed Thasin ah
Nov 23 '18 at 11:54
|
show 1 more comment
1 Answer
1
active
oldest
votes
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setup
np.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
Great improvement. But I think the main problem is caused by the fact bothstart
andindex
are unindexed and takeO(n)
time. If someone we could index both of them it would be a lot faster. What do you think?
– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too withnumba
if this isn't sufficient.
– jpp
Nov 23 '18 at 12:30
Are you sureO(n)
is unavoidable because I saw considerable improvement after I moved a column toDateTimeIndex
and then queried it usingdf[date_time_object:]
? Earlier I was doingdf[df.time > date_time_object]
and it was painfully slow.
– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce toO(log n)
and this is what I suspectDateTimeIndex
do but I'm not sure.
– Lokesh
Nov 23 '18 at 12:36
|
show 6 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446171%2fpandas-optimise-datetime-comparison-on-two-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setup
np.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
Great improvement. But I think the main problem is caused by the fact bothstart
andindex
are unindexed and takeO(n)
time. If someone we could index both of them it would be a lot faster. What do you think?
– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too withnumba
if this isn't sufficient.
– jpp
Nov 23 '18 at 12:30
Are you sureO(n)
is unavoidable because I saw considerable improvement after I moved a column toDateTimeIndex
and then queried it usingdf[date_time_object:]
? Earlier I was doingdf[df.time > date_time_object]
and it was painfully slow.
– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce toO(log n)
and this is what I suspectDateTimeIndex
do but I'm not sure.
– Lokesh
Nov 23 '18 at 12:36
|
show 6 more comments
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setup
np.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
Great improvement. But I think the main problem is caused by the fact bothstart
andindex
are unindexed and takeO(n)
time. If someone we could index both of them it would be a lot faster. What do you think?
– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too withnumba
if this isn't sufficient.
– jpp
Nov 23 '18 at 12:30
Are you sureO(n)
is unavoidable because I saw considerable improvement after I moved a column toDateTimeIndex
and then queried it usingdf[date_time_object:]
? Earlier I was doingdf[df.time > date_time_object]
and it was painfully slow.
– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce toO(log n)
and this is what I suspectDateTimeIndex
do but I'm not sure.
– Lokesh
Nov 23 '18 at 12:36
|
show 6 more comments
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setup
np.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setup
np.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
edited Nov 23 '18 at 12:31
answered Nov 23 '18 at 12:20
jppjpp
102k2166116
102k2166116
Great improvement. But I think the main problem is caused by the fact bothstart
andindex
are unindexed and takeO(n)
time. If someone we could index both of them it would be a lot faster. What do you think?
– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too withnumba
if this isn't sufficient.
– jpp
Nov 23 '18 at 12:30
Are you sureO(n)
is unavoidable because I saw considerable improvement after I moved a column toDateTimeIndex
and then queried it usingdf[date_time_object:]
? Earlier I was doingdf[df.time > date_time_object]
and it was painfully slow.
– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce toO(log n)
and this is what I suspectDateTimeIndex
do but I'm not sure.
– Lokesh
Nov 23 '18 at 12:36
|
show 6 more comments
Great improvement. But I think the main problem is caused by the fact bothstart
andindex
are unindexed and takeO(n)
time. If someone we could index both of them it would be a lot faster. What do you think?
– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too withnumba
if this isn't sufficient.
– jpp
Nov 23 '18 at 12:30
Are you sureO(n)
is unavoidable because I saw considerable improvement after I moved a column toDateTimeIndex
and then queried it usingdf[date_time_object:]
? Earlier I was doingdf[df.time > date_time_object]
and it was painfully slow.
– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce toO(log n)
and this is what I suspectDateTimeIndex
do but I'm not sure.
– Lokesh
Nov 23 '18 at 12:36
Great improvement. But I think the main problem is caused by the fact both
start
and index
are unindexed and take O(n)
time. If someone we could index both of them it would be a lot faster. What do you think?– Lokesh
Nov 23 '18 at 12:29
Great improvement. But I think the main problem is caused by the fact both
start
and index
are unindexed and take O(n)
time. If someone we could index both of them it would be a lot faster. What do you think?– Lokesh
Nov 23 '18 at 12:29
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with
numba
if this isn't sufficient.– jpp
Nov 23 '18 at 12:30
@Lokesh, O(n) is unavoidable, you have to iterate every value in both series. Just doing it in NumPy is more efficient. You can get faster too with
numba
if this isn't sufficient.– jpp
Nov 23 '18 at 12:30
Are you sure
O(n)
is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex
and then queried it using df[date_time_object:]
? Earlier I was doing df[df.time > date_time_object]
and it was painfully slow.– Lokesh
Nov 23 '18 at 12:32
Are you sure
O(n)
is unavoidable because I saw considerable improvement after I moved a column to DateTimeIndex
and then queried it using df[date_time_object:]
? Earlier I was doing df[df.time > date_time_object]
and it was painfully slow.– Lokesh
Nov 23 '18 at 12:32
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
@Lokesh, Yes, I'm sure. Think about it theoretically. To compare every value in an array of size n to a scalar, you must iterate every value in n once. Don't derive complexity from performance. They aren't the same. [O(1) can be slower than O(n) too in specific instances too, which is another problem with this logic.]
– jpp
Nov 23 '18 at 12:35
But if we use binary search, comparison would reduce to
O(log n)
and this is what I suspect DateTimeIndex
do but I'm not sure.– Lokesh
Nov 23 '18 at 12:36
But if we use binary search, comparison would reduce to
O(log n)
and this is what I suspect DateTimeIndex
do but I'm not sure.– Lokesh
Nov 23 '18 at 12:36
|
show 6 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446171%2fpandas-optimise-datetime-comparison-on-two-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Try
df.query()
– Srce Cde
Nov 23 '18 at 11:50
@Chirag - it is slowier :(
– jezrael
Nov 23 '18 at 11:50
I would have easily solved the problem if there were only 1 column by making that column
DateTimeIndex
.– Lokesh
Nov 23 '18 at 11:51
Can you add some sample data?
– jezrael
Nov 23 '18 at 11:52
If I face this situation I will split this condition, First apply
(df.start <= x)
then save this to same df and then apply next condition(df.end >= y)
(Now df size is reduced). May be I'm wrong, let's wait for expert's answers– Mohamed Thasin ah
Nov 23 '18 at 11:54