Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Advanced Elasticsearch 7.0

You're reading from   Advanced Elasticsearch 7.0 A practical guide to designing, indexing, and querying advanced distributed search engines

Arrow left icon
Product type Paperback
Published in Aug 2019
Publisher Packt
ISBN-13 9781789957754
Length 560 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Wai Tak Wong Wai Tak Wong
Author Profile Icon Wai Tak Wong
Wai Tak Wong
Arrow right icon
View More author details
Toc

Table of Contents (25) Chapters Close

Preface 1. Section 1: Fundamentals and Core APIs FREE CHAPTER
2. Overview of Elasticsearch 7 3. Index APIs 4. Document APIs 5. Mapping APIs 6. Anatomy of an Analyzer 7. Search APIs 8. Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
9. Modeling Your Data in the Real World 10. Aggregation Frameworks 11. Preprocessing Documents in Ingest Pipelines 12. Using Elasticsearch for Exploratory Data Analysis 13. Section 3: Programming with the Elasticsearch Client
14. Elasticsearch from Java Programming 15. Elasticsearch from Python Programming 16. Section 4: Elastic Stack
17. Using Kibana, Logstash, and Beats 18. Working with Elasticsearch SQL 19. Working with Elasticsearch Analysis Plugins 20. Section 5: Advanced Features
21. Machine Learning with Elasticsearch 22. Spark and Elasticsearch for Real-Time Analytics 23. Building Analytics RESTful Services 24. Other Books You May Enjoy

Bucket aggregations

Bucket aggregation is a family that provides mechanisms for segmenting document sets into groups (buckets). Thus, each bucket is associated with a criterion, which determines whether the document in the current context falls into it. The following sections will illustrate each type of aggregation in the bucket family.

histogram

The purpose of the histogram aggregation is to group values into ranges with a fixed interval and count the documents. The starting range is 0 or a given offset. The range distance is specified by the interval value. The rounding formula is bucket_key = Math.floor((value - offset) / interval) * interval + offset. The "extended_bounds" : {"min" : lv,"max" : uv} parameter is used to specify the lower bound and upper bound of the buckets:

  • Example: The purpose is to obtain the histogram of the open price of the ACWF ETF in the cf_etf_hist_price index, with interval=1, min=24, and max=30:
"query": { "match": { "symbol": "ACWF"}},
"aggs": { "acwf_histogram": {"histogram": {"field": "open","interval": 1,"extended_bounds": { "min":24, "max":30}}}}
  • Resulting buckets: key is the open price:
"acwf_histogram": {"buckets": [[{"key": 24,"doc_count": 1},{"key": 25,"doc_count": 5},{"key": 26,"doc_count": 5},{"key": 27,"doc_count": 10},{"key": 28,"doc_count": 22},{"key": 29,"doc_count": 18},{"key": 30,"doc_count": 0}]]}

date_histogram

The purpose of the date histogram aggregation is to group date values into ranges with a fixed time unit interval and count the documents. The range distance is specified by the interval value in time units such as millisecond, 1ms, Xms, second, 1s, Xms, minute, 1m, Xm, hour, 1h, Xh, day, 1d, Xd, week, 1w, month, 1M, quarter, 1q. year, and 1y, where X is a positive integer. We can use the format parameter to specify the custom date format. The starting value is respective to the time unit and we can use the offset parameter to adjust the starting value with ±time_unit to denote a positive or negative offset of the time unit:

  • Example: The purpose is to obtain the monthly trading days of the ACWF ETF in the cf_etf_hist_price index:
"query": { "match": { "symbol": "ACWF"}},
"aggs":{"monthly_trading_days": {"date_histogram":{"field": "date","interval": "month", format": "yyyy-MM-dd"}}}
  • Resulting buckets: key is the timestamp and key_as_string is the date value according to the format:
"monthly_trading_days": {
"buckets": [[{"key_as_string": "2018-12-01","key": 1543622400000,"doc_count": 4}, {"key_as_string": "2019-01-01","key": 1546300800000,"doc_count": 21},{"key_as_string": "2019-02-01","key": 1548979200000,"doc_count": 19},{"key_as_string": "2019-03-01","key": 1551398400000,"doc_count": 17}]]}

auto_date_histogram

This is similar to the date histogram aggregation. The difference is using a given number of intervals (buckets) instead of the time interval:

  • Example: The purpose is to obtain the monthly trading days of the ACWF ETF in the cf_etf_hist_price index. Since we already know that the data involves four months, we use buckets=4:
"query": { "match": { "symbol": "ACWF"}},
"aggs":{"monthly_trading_days": {
"auto_date_histogram":{"field": "date","buckets": "4","format": "yyyy-MM-dd"}}}
  • Resulting buckets: Same as the result in the date histogram aggregation example, but an additional parameter interval value is provided as "interval":"1M" denote monthly.

ranges

The purpose of the ranges aggregation is to group values into ranges and count the documents. Each range can be randomly defined by using the from/to keyword. Note that the from value is inclusive and the to value is exclusive:

  • Example: The purpose is to demonstrate the number trading days of the ACWF ETF within different ranges specified by from/to in the cf_etf_hist_price index:
"query": { "match": { "symbol": "ACWF"}},
"aggs":{"monthly_trading_days": {
"range":{"field": "date","format": "yyyy-MM-dd",
"ranges": [{"from":"2018-12-01"},{"to":"2019-04-01"},{"from":"2019-01-01","to":"2019-02-01"}]}
  • Resulting buckets: The key is the specified date range in the specified format, andfrom/to is the epoch in milliseconds:
"monthly_trading_days": {"buckets": [
[{"key": "*-2019-04-01","to": 1554076800000,"to_as_string": "2019-04-01","doc_count": 61},
{"key": "2018-12-01-*","from": 1543622400000,"from_as_string": "2018-12-01","doc_count": 61},
{"key": "2019-01-01-2019-02-01","from": 1546300800000,"from_as_string": "2019-01-01","to": 1548979200000,"to_as_string": "2019-02-01","doc_count": 21}]]}

date_range

This is similar to the date range aggregation. The difference is using Date Math expression (please refer to the API conventions section of Chapter 1, Overview of Elasticsearch 7) for the from/to range. Note that the from value is inclusive and the to value is exclusive:

  • Example: The purpose is to demonstrate the number trading days of the ACWF ETF within different date ranges specified by from/to in the cf_etf_hist_price index:
"query": { "match": { "symbol": "ACWF"}},
"aggs":{"monthly_trading_days": {"date_range":{"field": "date","format": "yyyy-MM-dd",
"ranges": [{"from":"2018-12-01"},{"to":"2019-04-01"},{"from":"2019-01-01","to":"2019-01-01||+1M"}]}}}
  • Resulting buckets: Same as the result in the ranges aggregation.

ip_range

Similar to the date_range aggregation, this is used for the IP data type field to count the documents based on the IP range:

  • Example: The purpose is to count the documents with an IP between 192.168.0.150 and 192.168.0.255, from the ip_field IP data type field:
"aggs" : {"used_ips" : {"ip_range" : {"field" : "ip_field","ranges" : [{ "to" : "192.168.0.255" },{ "from" : "192.168.0.150" }]}}}

filter

The purpose of the filter aggregation is to screen the documents that fit a specified filter and provide a total count. It generally acts as the pre-screening step to nail down the scope of documents:

  • Example: The purpose of this is to find the average changePercent on the date of 2019-02-1 in the cf_etf_hist_price index:
"aggs":{"avg_changePercent_at_date": {"filter":{"term": {"date":"2019-02-15"}},
"aggs": {"avg_changePercent": {"avg":{"field": "changePercent"}}}}}
  • Resulting buckets: The avg_changePercent_at_date value is 0.60385. There are 314 documents involved in the aggregation:
"avg_changePercent_at_date": {"doc_count": 314,"avg_changePercent": {"value": .6038535040014299}}

filters

The purpose is to provide multiple buckets based on the specified filters. Each filter provides a document count for one bucket:

  • Example: The purpose is to count the number of times an ETF's changePercent is larger than or less than the changePercent average of 0.60385 in the cf_etf_hist_price index:
"query": { "term": {"date":"2019-02-15"}},
"aggs":{"filters_avg_changePercent": {"filters":{
"filters": {"less_than_avg": {"range": {"changePercent":{"lt":0.60385}}},
"greater_than_avg":{"range": {"changePercent":{"gte":0.60385}}}}}}}
  • Resulting buckets: There are 159 documents with changePercent greater than 0.60385 and 155 documents where it is less than that value:
"filters_avg_changePercent": {"buckets": {"greater_than_avg": {"doc_count": 159},
"less_than_avg": {"doc_count": 155}}}

term

The purpose of this is to group results in buckets based on the field values. It provides a document count for each field value:

  • Example: The purpose is to count each rating in the cf_etf index:
"aggs":{"terms_on_rating": {
"terms":{"field":"rating"}}}
  • Resulting buckets: There are five ratings and the counts are as follows. Because the aim of the method is estimation, the document counts are approximate values, so we can see the reported doc_count_error_upper_bound. We may set the show_term_doc_count_error parameter to true to get the error in a worst-case scenario:
"terms_on_rating": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets": [{"key": 3,"doc_count": 77},{"key": 4,"doc_count": 69},{"key": 2,"doc_count": 36},
{"key": 5,"doc_count": 34},{"key": 1,"doc_count": 21}]

significant_terms

The purpose of this is to identify the most relevant terms that are related to a particular set of documents. The terms in the result are not simply the most popular terms in the whole document set. Instead, they have a significant change in the popularity measured between the foreground (particular set scope) and background (index scope) sets. Here's a reminder that it does not support floating fields and the document counts are approximate values:

  • Example: The purpose is to find the significant morningstar category (in keyword datatype) of ETFs with rating as 5 in the cf_etf index:
"query":{"match":{"rating":5}},
"aggs":{"rating_vs_morningstar":{"significant_terms":{"field":"morningstar_category"}}}}}
  • Resulting buckets: There are 34 ETFs with rating=5. The significant morningstar category for ETFs in rating 5 are Mid-Cap Value and Small Blend. The doc_count and bg_count document counts ofsignificant_terms are the document counts in the foreground and background respectively:
"rating_vs_morningstar": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets": [{"key": 5,"doc_count": 34,
"terms_on_morningstar_category": {"doc_count": 34,"bg_count":314, "buckets": [
{"key": "Mid-Cap Value","doc_count": 3,"score": 0.522923875432526,"bg_count": 4},
{"key": "Small Blend","doc_count": 3,"score": 0.319204152249135,"bg_count": 6}]}}]}

significant_text

This is similar to the significant_term aggregation, but it is for free-text fields:

  • Example: The purpose is to find the significant fund_name (in the text datatype) of ETFs with rating 5 in the cf_etf index. We use the size parameter to limit the return buckets to 3 and the exclude parameter to exclude unwanted terms:
"query":{"match":{"rating":5}},
"aggs":{"rating_vs_fund_name":{"significant_text":{"field":"fund_name", "exclude": ["us", "p", "s", "cap"], "size":3}}}
  • Resulting buckets: The top three most relevant significant texts are low, volatility, and small for the fund_name field in the cf_etf index:
"rating_vs_morningstar": {"doc_count": 34,"bg_count": 314,"buckets": [
{"key": "low","doc_count": 6,"score": 1.220464656450816,"bg_count": 7},
{"key": "volatility","doc_count": 5,"score": 0.9847174163783161,"bg_count": 6},
{"key": "small","doc_count": 4,"score": 0.24452133794694353,"bg_count": 12}]}

sampler

Instead of an unbound number of documents to process, the sampler aggregation can limit its sub-aggregation results by sampling the top scores within a shard by a shard_size parameter (the default value is 100).

  • Example: The purpose is to limit the (significant_terms) sub-aggregation document size to 150 per shard in the cf_etf index:
"aggs": { "sampling": {"sampler": { "shard_size":150},
"aggs": {"significant_terms_category": {"significant_terms": {"field": "category"}}}}}
  • Resulting buckets: We can see thatdoc_count of the significant_terms_category aggregation is 150:
"sampling": {"doc_count": 150,
"significant_terms_category": {"doc_count": 150,"bg_count": 314, "buckets": [
{"key": "International","doc_count": 61,"score": 0.13425648148148153,"bg_count": 96},
{"key": "Sector","doc_count": 10,"score": 0.07288888888888888,"bg_count": 10},
{"key": "Commodity","doc_count": 9,"score": 0.05303999999999999,"bg_count": 10}]}}

diversified_sampler

Similar to using the sampler aggregation to limit the sub-aggregation results, this restricts the number of the samples to have a common value on the specified field. It reduces the bias distribution of the sample pool:

  • Example: Here we are using diversified_sampler on the family field and setting the max_docs_per_value=10 parameter (sampling 10 documents per bucket) with the same significant_terms aggregation as the sampler example for the cf_etf index:
"aggs": { "sampling": {
"diversified_sampler": {"field":"family", "shard_size":150, "max_docs_per_value" : 10},
"aggs": {"significant_terms_category": {"significant_terms": {"field": "category"}}}}}

According to the result of sampler, there are 61 documents with International, 10 with Section, and 9 with Commodity.

  • Resulting buckets: We can see that the result is different from the sampler aggregation. If we increase max_docs_per_value to 61, then the result is the same as the sampler aggregation:
"sampling": {"doc_count": 72,
"significant_terms_category": {"doc_count": 72,"bg_count": 314,"buckets": [
{"key": "Equity","doc_count": 30,"score": 0.03386134067952249,"bg_count": 121},
{"key": "Bond","doc_count": 19,"score": 0.020086780503447175,"bg_count": 77},
{"key": "Commodity","doc_count": 3,"score": 0.01284722222222222,"bg_count": 10}]}}

nested

The purpose of this is to perform aggregation on the field of the nested object pointing by the path parameter, following grouping by the field of the parent document:

  • Example: The purpose is to sum the dividend amounts for each ETF in the cf_etf_dividend_nested index. Recall that the announcement field is in the nested object data type. To make the return result smaller, we only list the first three ETFs in the alphabetical order of the symbol:
"aggs": {"per_symbol":{"terms": {"field":"symbol", "size":3},
"aggs":{"total_dividend": { "nested": {"path":"announcement"},
"aggs": {"total_amount": {"sum": {"field": "announcement.amount"}}}}}}}
  • Resulting buckets: There are 7 dividends for the ACWF ETF and the total dividend is about 1.63133:
"per_symbol": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 7,"buckets": [
{"key": "ACWF","doc_count": 1,"total_dividend": {"doc_count": 7,"total_amount": {
"value": 1.6313300058245659}}},
{"key": "ACWI","doc_count": 1,"total_dividend": {"doc_count": 11,"total_amount": {"value": 6.54360693693161}}},
{"key": "ACWV","doc_count": 1,"total_dividend": {"doc_count": 10,"total_amount": {"value": 8.262333989143372}}}]}

reverse_nested

The purpose of this is to perform aggregation on the field from the parent document after grouping by the field from the nested object:

  • Example: The purpose is to count the total ETFs according to the category based on the grouping by the type of dividend in the cf_etf_dividend_nested index:
"aggs": {"etf_category_distribution_on_dividend_type":{"nested": {"path":"announcement"},
"aggs":{"by_dividend_type": { "terms": {"field":"announcement.type"},
"aggs": {"total_by_etf_category":{"reverse_nested":{},
"aggs": { "category":{"terms": {"field": "category"}}}}}}}}}
  • Resulting buckets: We can see that there are three dividend types: Dividend income, Long term capital gain, and Short term capital gain. In each dividend type, the ETFs are counted by category:
"etf_category_distribution_on_dividend_type": {"doc_count": 287,"by_dividend_type": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0, "buckets": [
{"key": "Dividend income","doc_count": 283,"total_by_etf_category": {"doc_count": 10,
"category": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets": [{"key": "Bond","doc_count": 4},{"key": "Equity","doc_count": 3},{"key": "International","doc_count": 3}]}}},
{"key": "Long term capital gain","doc_count": 2,"total_by_etf_category": {doc_count": 2,
"category": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets": [{"key": "Bond","doc_count": 2}]}}},
{"key": "Short term capital gain","doc_count": 2,"total_by_etf_category": {
"doc_count": 2,"category": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,
"buckets": [{"key": "Bond","doc_count": 2}]}}}]}}

global

The purpose of this is to perform aggregation with all documents in the index by ignoring the query:

  • Example: The purpose is to find the average rating of the iShares family and the average rating from all ETFs in the cf_etf index:
"query":{"prefix": {"family":"iShares"}},
"aggs": {"ishares_avg_rating": {"avg": {"field":"rating"}},"overall_avg_rating": {"global":{},
"aggs":{"global_avg_rating":{"avg":{"field":"rating"}}}}}
  • Resulting buckets: The average rating of the iShares family is about 3.31 and the overall average rating is about 3.25:
"overall_avg_rating": {"doc_count": 314,"global_avg_rating": {"value": 3.2489451476793247}},
"ishares_avg_rating": {"value": 3.311111111111111}

missing

The purpose of this is to count the documents with a missing field value:

  • Example: The purpose is to count how many ETFs are not rated in the cf_etf index:
"aggs": {"not_rated_count": {"missing": {"field":"rating"}}}
  • Resulting buckets: There are 77 ETFs not rated:
"not_rated_count": {"doc_count": 77}

composite

This allows us to create bucket pairs from different sources and perform paging quickly through large aggregation result sets in order:

  • Example: The purpose is to perform aggregation to count documents for each pair of values from the family and rating fields—with ratings of 4 and 5—in the cf_etf index. We have limited ourselves to 5 results in the response:
"query": {"query_string": {"query": "rating:[4 TO 5]"}},
"aggs": { "multi_sources": {"composite": {
"sources": [{"family": {"terms": {"field":"family"}}},{"rating": {"terms": {"field":"rating"}}}],"size":5}}}
  • Resulting buckets: We can see that there are 5 results of the family and rating pairs. We can use the whole after_key expression inside the composite expression for the next patch retrieval:
"multi_sources": {"after_key": {"family": "Invesco","rating": 5},"buckets": [
{"key": {"family": "AGFiQ","rating": 4},"doc_count": 1},
{"key": {"family": "First Trust","rating": 4},"doc_count": 8},
{"key": {"family": "First Trust","rating": 5},"doc_count": 4},
{"key": {"family": "Invesco","rating": 4},"doc_count": 6},
{"key": {"family": "Invesco","rating": 5},"doc_count": 5}]}

adjacency_matrix

This allows users to define a set of filters and provide bucket results in an adjacency-matrix form among the filters, which means any two filters are applied together:

  • Example: The purpose is to discover the scenario about the two filters. The first filter is to select the rating of 5 and the second filter is to select the ETFs from the iShares family in the cf_etf index:
"aggs": { "adjacency_filters": {"adjacency_matrix": {"filters": {
"rating_gte_5": {"range":{"rating":{"gte":5}}},
"family_prefix_A": {"prefix":{"family":"iShares"}}}}}}

  • Resulting buckets: There are 34 ETFs rated at 5. There are 56 ETFs in the iShares family. Only 7 ETFs from the iShares family are rated at 5:
"adjacency_filters": {"buckets": [
{"key": "family_prefix_A","doc_count": 56},
{"key": "family_prefix_A&rating_gte_5","doc_count": 7},
{"key": "rating_gte_5","doc_count": 34}]}

parent

The purpose of this is to aggregate parent documents with the join data type field:

  • Example: The purpose is to obtain a count of symbols that have dividends for the ETF in the cf_etf_dividend_join index:
"aggs": {"dividend":{ "parent": {"type": "dividend" },
"aggs": {"count":{"value_count":{"field":"symbol"}}}}}
  • Resulting buckets: Since we have only one symbol ACWF, the count value is 1:
 "aggregations": { "dividend": { "doc_count": 1, "count": { "value": 1 } } }

children

The purpose of this is to aggregate the child documents with the join data type field:

  • Example: The purpose is to obtain the total amount of dividend for each symbol in the cf_etf_dividend_join index:
 "aggs": {
"dividends_for_symbol": {
"terms": {"field": "symbol"},
"aggs": {"dividend":{ "children": {"type": "dividend" },
"aggs": {"total_dividend":{"sum":{"field":"amount"}}}}}}
}
  • Resulting buckets: Since we have only one symbol ACWF, the result will show the total dividend of it:
"aggregations": {
"dividends_for_symbol": {
"doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,
"buckets": [ {
"key": "ACWF",
"doc_count": 1,
"dividend": {
"doc_count": 7,
"total_dividend": {
"value": 1.6313300058245659
} } } ] } }

geo_distance

This is similar to the range aggregation, but with the geo_point data type field applied. The purpose is to find the documents within different geo_distance ranges:

  • Example: "aggs" : {"aggregation_name" : {"geo_distance" : {"field" : "field_1","origin" : "x, y","ranges" : [{ "to" : value_1 }, { "from" : value_2, "to" : value_3 },{ "from" : value_4 }]}}}

geohash_grid

This aggregation works with the geo_point data type field. It groups documents into buckets defined by the precision (between 1 to 12; the default is 5), which is the geohash length, a well-defined dimension of grids:

  • Example: "aggs" : {"aggregation_name" : {"geohash_grid" : {"field" : "field_1"}}}

geotile_grid

This aggregation works with the geo_point data type field. It groups documents into buckets defined by the precision (between 0 and 29; the default is 7), which is a zoom level, a well-defined dimension of grids:

  • Example: "aggs" : {"aggregation_name" : {"geotile_grid" : {"field" : "field_1"}}}

In the next section, we'll cover the last family, pipeline aggregation, which is different from the first three. The second aggregation accepts the output of the previous one as the input in the consecutive aggregations.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime