Understanding the Elasticsearch Query DSL
Not Just Copy-Pasting From Stack Overflow But Understanding What You’re Copy-Pasting.
What This Tutorial Is About
Elasticsearch is the go-to search engine these days, but its Query DSL does have a steep learning curve. When I first started writing Elasticsearch queries, I could string together something that worked through a combination of the Elastic.co docs and Stack Overflow, but I didn’t fully understand the underlying concepts behind the syntax.
This tutorial aims to be what I wish I first read. It assumes the reader is familiar with basic Elasticsearch concepts, can write simple queries, and understands boolean logic. It aims to provide the reader with a firm conceptual basis of Elasticsearch before trying to understanding the syntax.
Filtering Exact Values vs Full Text Analyzed Search
An over-arching theme in Elasticsearch is that all queries can be classified into three types:
1. Filtering by exact values
2. Searching on analyzed text
3. A combination of the two
Every document field can be classified either as an exact values or analyzed text (also called full text). Exact values are fields like user_id
, date
, email_addresses
, etc. Analyzed text is text data like product_description
or email_body
. As the name implies, this text data has been analyzed (more on this later). It is often in a human natural language but not necessarily.
Querying documents can be done by specifying filters over exact values. In these cases, the question of whether the document gets returned is a binary yes or no. For example, is the document’s user_id
equal to 174517
? Is the document’s created_at
date within the range of the last month?
On the other hand, querying documents by searching analyzed text returns results based on relevance. A document is returned not by a yes/no criteria, but by how relevant it is. For example, if a document’s analyzed text includes Johnny Depp, it should also be returned in searches for John Depp or Johnnie Depp. A search for cook should also return results for cooking and cooked.
From this behavior, we can deduce that searching by analyzed text is a highly complex operation and involves different analyzer packages depending on the type of text data. For example, some analyzer packages are language specific which are used on analyze text in certain language. The default analyzer package is the standard analyzer which splits text by word boundaries, lowercases and removes punctuation. Because searching by analyzed text is so much more complicated than filtering by exact values, it is much less performant than just filtering by exact values. Note, we will call searching by analyzed text as analyzed search for short.
The Query DSL
Elasticsearch queries are comprised of one or many query clauses. Query clauses can be combined to create other query clauses, called compound query clauses. All query clauses have either one of these two formats:
{
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}{
QUERY_CLAUSE: {
FIELD_NAME: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
The syntax-rule is that query clauses can be repeatedly nested inside other query clauses
{
QUERY_CLAUSE {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
}
}
Here are some common query clauses.
Match Query Clause
The match query clause is the most generic and commonly used query clause. It’s fairly smart in that when it’s run on a analyzed text field, it performs an analyzed search on the text. When it is run on an exact value field, it performs a filter.
In the example below, the first query clause will perform an analyzed search since description
is an analyzed text field. While the second 2 queries are filters over exact value fields.
{ "match": { "description": "Fourier analysis signals processing" }}
{ "match": { "date": "2014-09-01" }}
{ "match": { "visible": true }}
The Match All Query Clause
The match all query clause returns all documents. It’s analogous to SELECT * FROM table
in SQL.
{ "match_all": {} }
Term/Terms Query Clause
The term and terms query clauses are used to filter by a exact value fields by single or multiple values, respectively. In the case of multiple values, the logical connection is OR
.
For example, the first query finds all documents with the tag “math”. The second query finds all documents with the tags “math” or “statistics”.
{ "term": { "tag": "math" }}
{ "terms": { "tag": ["math", "statistics"] }}
Multi Match Query Clause
The multi match query clause is a match query that is run across multiple fields instead of just one.
{
"multi_match": {
"query": "probability theory",
"fields": ["title", "body"]
}
}
Exists and Missing Filters Query Clause
The exists filter checks that documents have a value at a specified field. The missing filter checks that documents do not have have a value at a specified field. They are analogous to SQL’s IS NULL
and IS NOT NULL
clauses.
{
"exists" : {
"field" : "title"
}
}
and
{
"missing" : {
"field" : "title"
}
}
Range Filter Query Clause
The range filter query clause is used to filter number and date fields in ranges, using the operators gt
gte
lt
lte
short for greater_than
greater_than_or_equal
less_than
and less_than_or_equal
, respectively.
{ "range" : { "age" : { "gt" : 30 } } }{
"range": {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}
Bool Query Clause
Query clauses that are built from other query clauses are called compound query clauses. Note that compound query clauses can also be comprised of other compound query clauses, allowing for multi-layer nesting.
The bool query clause is an example of a compound query clause, as it is used to combine multiple query clauses using boolean operators. The three supported boolean operators are must
must_not
and should
, which correspond to AND
, NOT
, and OR
, respectively.
For example, suppose we have an index on the posts
of a popular social media site. Here is a query to find all posts
with the tag
math and the tag
hat is not probability, where it is either unread or has been favorited.
{
"bool": {
"must": [
{ "term": { "tag": "math" }},
{ "term": { "level": "beginner" }}
]
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }}
]
}
}
The analogous SQL query for the above ES query would look like this:
SELECT * FROM posts
WHERE posts.tag = 'math'
AND posts.level = 'beginner'
AND posts.tag != 'probability'
AND (posts.favorite IS true OR posts.unread IS true);
Combining Analyzed Search With Filters
We have been talking about exact field filters and analyzed search in separate contexts, but in real world applications, we often want to combine the two. We combine analyzed search and exact field filters using the filtered clause.
For example, suppose we have an index on the posts
of a popular web forum on mathematics. Here is a query to find all posts by performing an analyzed search for “Probability Theory” but we only want posts
with 20 or more upvotes and not those with that tag “frequentist”.
{
"filtered": {
"query": { "match": { "body": "Probability Theory" }},
"filter": {
"bool": {
"must": {
"range": { "upvotes" : { "gt" : 20 } }
},
"must_not": { "term": { "tag": "frequentist" } }
}
}
}
}
Using the EXPLAIN API
Often times, we will look at query results and wonder why we got the results we did (usually because we expected something else from an analyzed search).
Similar to SQL, Elasticsearch has an EXPLAIN
API. But the similarity ends there. While SQL’s EXPLAIN
describes how the query was performed. The Elasticsearch EXPLAIN
API describes why the results are the way they are.
Summary
The conceptual backdrop of the Elasticsearch query DSL is this dichotomy of filtering documents vs searching through analyzed text. Hope ya’ll found this helpful.
Feel free to leave comments, questions, suggestions, corrections below. — S