Result number for Boolean queries with Apache Lucene
When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
elasticsearch search solr lucene full-text-search
add a comment |
When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
elasticsearch search solr lucene full-text-search
add a comment |
When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
elasticsearch search solr lucene full-text-search
When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
elasticsearch search solr lucene full-text-search
elasticsearch search solr lucene full-text-search
edited Nov 25 '18 at 11:36
Wolf Garbe
asked Nov 22 '18 at 17:35
Wolf GarbeWolf Garbe
15416
15416
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The explanation is that the default operator is OR
and not AND
as you assume. Searching for the who
returns documents that have either the
or who
or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who
also contains the
, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435939%2fresult-number-for-boolean-queries-with-apache-lucene%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The explanation is that the default operator is OR
and not AND
as you assume. Searching for the who
returns documents that have either the
or who
or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who
also contains the
, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)
add a comment |
The explanation is that the default operator is OR
and not AND
as you assume. Searching for the who
returns documents that have either the
or who
or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who
also contains the
, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)
add a comment |
The explanation is that the default operator is OR
and not AND
as you assume. Searching for the who
returns documents that have either the
or who
or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who
also contains the
, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)
The explanation is that the default operator is OR
and not AND
as you assume. Searching for the who
returns documents that have either the
or who
or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who
also contains the
, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)
answered Nov 25 '18 at 21:40
MatsLindhMatsLindh
25.6k32341
25.6k32341
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435939%2fresult-number-for-boolean-queries-with-apache-lucene%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown