Yahoo recently open-sourced their big data search engine - Vespa. Good alternative to Elasticsearch for search store. Here are notes on our feasibility check for Vespa comparing with Elasticsearch.

Vespa is data search engine which supports real time write with automatic big data distribution, focused on big data serving. Queries support aggregation & ranking. Vespa is Elastic and supports auto-recovery of nodes.

Setup

The Vespa documentation doesn’t yet have normal flow. Very vast and detailed documentation on many things already it supports. But, it requires some searching and surfing for finding basic details.

Vespa has concept of applications. the Application gets deployed in the Vespa cluster. It is not intuitive as Rest API to setup the index. We need search definition, hosts configuration be passed to Vespa by their deploy API. For every index we will have one search definition. (which is similar to having mapping in ES) and one application can host multiple search definitions.

Typical application package for Vespa looks like this. which includes:

  • services.xml: configures Vespa. similar to index _setting from Elasticsearch, like the replication factor & data grouping (shards) strategy & configuring vespa services across multiple nodes in a cluster.
  • hosts.xml: nodes which host the cluster in context of the application with their alias.
  • searchdefinitions: schema for the document. similar to _mapping from ES.

External apps can communicate with Vespa with HTTP Rest APIs like ES. But once can also use message bus support.

Architecture

Vespa cluster consists of content and container nodes.

  • content: these nodes run Proton core. and host index to store data. It supports distributed query execution across content nodes cluster which can run in parallel.
  • container: these nodes hosts query handlers, searchers and document processors. Every query is handled by containers. They internally routes the queries to content nodes. Vespa provides a set of components out of the box; to build custom components for our application to have custom handling of requests.

Indexing

Proton stores data in three different stores:

  • index: Document index, Dictionaries, postings and B-tree structures.
  • attribute: In-memory data structure for query, aggregation and ranking.
  • summary: The document to be returned in the result. Like _source in ES. Vespa also supports compressing for summary for documents.

Vespa supports various apis for CRUD of the data.

Distribution

Vespa supports auto distribution of buckets in data like ES. So we don’t need to manually take care of distribution like Solr. The Vespa document ids support defining logical bucket to route the document(s).

It distributes data over set of nodes, with certain replication factor, in multiple groups (like shards in ES). Unlike ES, the distribution can change over time, as Vespa search core Proton dynamically redistributes data.

Searching

Vespa has support for fields to be attribute for fast-search or fast-access. Container node acts as middleware for query. It then passes query to one or more content nodes where the 2 phase query execution happens on matched documents. Content nodes also executes ranking based on the ranking function defined in the searchdefinition. It has Search API for executing query requests.

It achieves low latency computation over large data sets by passing the execution to the data layer. It supports advanced ranking and matching operations. One can also add custom Searcher layer for custom operation for the application.

Aggregation

The query language for Vespa supports aggregation functionality for grouping on data. However, Vespa is not optimised for aggregations, unlike ES. Write and search latency in Vespa is lower than ES but aggregations are slower. ES aggregations are still best in speed.

Checklist for Aggregations

  • Always add max(x) in the group for size of buckets needed. When data is distributed across multiple content nodes this result can be inaccurate. To increase accuracy we need to use precision(x) as well to tune accuracy as we need.
  • If you only need aggregation buckets and no hits - pass limit 0 in the yql; this will save the step to load summary to be returned for container.
  • The attribute fields we are filtering/aggregating to be on mode fast-search; otherwise it is not B-tree like index - and has to be traversed.
  • Ensure constant score for docs with &ranking=unranked in the query.
  • Enable groupingSessionCache: http://docs.vespa.ai/documentation/reference/search-api-reference.html#groupingSessionCache
  • Sizing the content node for tradeoffs of latency vs no. of docs. by max-hits as described: http://docs.vespa.ai/documentation/performance/sizing-search.html
  • If memory is the bottleneck one can look at attribute flush strategy configuration. http://docs.vespa.ai/documentation/proton.html#proton-maintenance-jobs
  • If CPU is the bottleneck; increase parallelism. Ensure all cores are used in Searcher. http://docs.vespa.ai/documentation/content/setup-proton-tuning.html#requestthreads-persearch. Changes for that in service.xml: <persearch>16</persearch> Threads persearch is by default 1.

References:

  1. http://docs.vespa.ai/
  2. https://docs.google.com/presentation/d/1-LWFidciPJ5CQj3Mo3bLNyUvUfzVsor8DQXzf8cmBDY/edit