Understanding the Elasticsearch Query DSL

Not Just Copy-Pasting From Stack Overflow But Understanding What You’re Copy-Pasting.

Steven Li
6 min readApr 26, 2019

What This Tutorial Is About

Elasticsearch is the go-to search engine these days, but its Query DSL does have a steep learning curve. When I first started writing Elasticsearch queries, I could string together something that worked through a combination of the Elastic.co docs and Stack Overflow, but I didn’t fully understand the underlying concepts behind the syntax.

This tutorial aims to be what I wish I first read. It assumes the reader is familiar with basic Elasticsearch concepts, can write simple queries, and understands boolean logic. It aims to provide the reader with a firm conceptual basis of Elasticsearch before trying to understanding the syntax.

Filtering Exact Values vs Full Text Analyzed Search

An over-arching theme in Elasticsearch is that all queries can be classified into three types:

1. Filtering by exact values
2. Searching on analyzed text
3. A combination of the two

Every document field can be classified either as an exact values or analyzed text (also called full text). Exact values are fields like user_id, date, email_addresses, etc. Analyzed text is text data like product_description or email_body. As the name implies, this text data has been analyzed (more on this later). It is often in a human natural language but not necessarily.

Querying documents can be done by specifying filters over exact values. In these cases, the question of whether the document gets returned is a binary yes or no. For example, is the document’s user_id equal to 174517 ? Is the document’s created_at date within the range of the last month?

On the other hand, querying documents by searching analyzed text returns results based on relevance. A document is returned not by a yes/no criteria, but by how relevant it is. For example, if a document’s analyzed text includes Johnny Depp, it should also be returned in searches for John Depp or Johnnie Depp. A search for cook should also return results for cooking and cooked.

From this behavior, we can deduce that searching by analyzed text is a highly complex operation and involves different analyzer packages depending on the type of text data. For example, some analyzer packages are language specific which are used on analyze text in certain language. The default analyzer package is the standard analyzer which splits text by word boundaries, lowercases and removes punctuation. Because searching by analyzed text is so much more complicated than filtering by exact values, it is much less performant than just filtering by exact values. Note, we will call searching by analyzed text as analyzed search for short.

The Query DSL

Elasticsearch queries are comprised of one or many query clauses. Query clauses can be combined to create other query clauses, called compound query clauses. All query clauses have either one of these two formats:

{
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
{
QUERY_CLAUSE: {
FIELD_NAME: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}

The syntax-rule is that query clauses can be repeatedly nested inside other query clauses

{
QUERY_CLAUSE {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
QUERY_CLAUSE: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
}
}

Here are some common query clauses.

Match Query Clause

The match query clause is the most generic and commonly used query clause. It’s fairly smart in that when it’s run on a analyzed text field, it performs an analyzed search on the text. When it is run on an exact value field, it performs a filter.

In the example below, the first query clause will perform an analyzed search since description is an analyzed text field. While the second 2 queries are filters over exact value fields.

{ "match": { "description": "Fourier analysis signals processing" }}
{ "match": { "date": "2014-09-01" }}
{ "match": { "visible": true }}

The Match All Query Clause

The match all query clause returns all documents. It’s analogous to SELECT * FROM table in SQL.

{ "match_all": {} }

Term/Terms Query Clause

The term and terms query clauses are used to filter by a exact value fields by single or multiple values, respectively. In the case of multiple values, the logical connection is OR .

For example, the first query finds all documents with the tag “math”. The second query finds all documents with the tags “math” or “statistics”.

{ "term": { "tag": "math" }}
{ "terms": { "tag": ["math", "statistics"] }}

Multi Match Query Clause

The multi match query clause is a match query that is run across multiple fields instead of just one.

{
"multi_match": {
"query": "probability theory",
"fields": ["title", "body"]
}
}

Exists and Missing Filters Query Clause

The exists filter checks that documents have a value at a specified field. The missing filter checks that documents do not have have a value at a specified field. They are analogous to SQL’s IS NULL and IS NOT NULL clauses.

{
"exists" : {
"field" : "title"
}
}

and

{
"missing" : {
"field" : "title"
}
}

Range Filter Query Clause

The range filter query clause is used to filter number and date fields in ranges, using the operators gt gte lt lte short for greater_than greater_than_or_equal less_than and less_than_or_equal , respectively.

{ "range" : { "age" : { "gt" : 30 } } }{ 
"range": {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}

Bool Query Clause

Query clauses that are built from other query clauses are called compound query clauses. Note that compound query clauses can also be comprised of other compound query clauses, allowing for multi-layer nesting.

The bool query clause is an example of a compound query clause, as it is used to combine multiple query clauses using boolean operators. The three supported boolean operators are must must_not and should , which correspond to AND , NOT , and OR , respectively.

For example, suppose we have an index on the posts of a popular social media site. Here is a query to find all posts with the tag math and the tag hat is not probability, where it is either unread or has been favorited.

{
"bool": {
"must": [
{ "term": { "tag": "math" }},
{ "term": { "level": "beginner" }}
]
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }}
]
}
}

The analogous SQL query for the above ES query would look like this:

SELECT * FROM posts
WHERE posts.tag = 'math'
AND posts.level = 'beginner'
AND posts.tag != 'probability'
AND (posts.favorite IS true OR posts.unread IS true);

Combining Analyzed Search With Filters

We have been talking about exact field filters and analyzed search in separate contexts, but in real world applications, we often want to combine the two. We combine analyzed search and exact field filters using the filtered clause.

For example, suppose we have an index on the posts of a popular web forum on mathematics. Here is a query to find all posts by performing an analyzed search for “Probability Theory” but we only want posts with 20 or more upvotes and not those with that tag “frequentist”.

{
"filtered": {
"query": { "match": { "body": "Probability Theory" }},
"filter": {
"bool": {
"must": {
"range": { "upvotes" : { "gt" : 20 } }
},
"must_not": { "term": { "tag": "frequentist" } }
}
}
}
}

Using the EXPLAIN API

Often times, we will look at query results and wonder why we got the results we did (usually because we expected something else from an analyzed search).

Similar to SQL, Elasticsearch has an EXPLAIN API. But the similarity ends there. While SQL’s EXPLAIN describes how the query was performed. The Elasticsearch EXPLAIN API describes why the results are the way they are.

Summary

The conceptual backdrop of the Elasticsearch query DSL is this dichotomy of filtering documents vs searching through analyzed text. Hope ya’ll found this helpful.

Feel free to leave comments, questions, suggestions, corrections below. — S

--

--

Steven Li

Writing About Rails, React, Web Application Technology, Databases, and Software Engineering