Please disable your adblock and script blockers to view this page

How We Built a Vectorized SQL Engine - Cockroach Labs

Modern OLAP
Hyper-Pipelining Query Execution
Cockroach Labs

Dan Luu’s
Dave Cheney’s



No matching tags


No matching tags

Positivity     37.00%   
   Negativity   63.00%
The New York Times
Write a review: Hacker News

In this blog post, we use example code to discuss how we built the new engine and why it results in up to a 4x speed improvement on an industry-standard benchmark.OLTP databases, including CockroachDB, store data in contiguous rows on disk and process queries a row of data at a time. Using vectorized processing in an execution engine makes more efficient use of modern CPUs by changing the data orientation (from rows to columns) to get more out of the CPU cache and deep instruction pipelines by operating on batches of data at a time. In our research into vectorized execution, we came across MonetDB/X100: Hyper-Pipelining Query Execution, a paper that outlines the performance deficiencies of the row-at-a-time Volcano execution model that CockroachDB’s original execution engine was built on. When executing queries on a large number of rows, the row-oriented execution engine pays a high cost in interpretation and evaluation overhead per tuple and doesn’t take full advantage of the efficiencies of modern CPUs. Given the key-value storage architecture of CockroachDB, we knew we couldn’t store data in columnar format, but we wondered if converting rows to batches of columnar data after reading them from disk, and then feeding those batches into a vectorized execution engine, would improve performance enough to justify building and maintaining a new execution engine.To quantify the performance improvements, and to test the ideas laid out in the paper, we built a vectorized execution engine prototype, which yielded some impressive results. We take an example query, analyze its performance in a toy, row-at-a-time execution engine, and then explore and implement improvements inspired by the ideas proposed in the MonetDB/x100 paper. These tokens’ declarations are wrapped in template comments and removed in the final generated file.For example, the multiplication function (_MULFN) is converted to a method call with the same arguments:MulFn is called when executing the template, and then returns the Go code to perform the multiplication according to type-specific information. Now that the code is a little more manageable and extensible, let’s try to improve the performance further.NOTE: To make the code in the rest of this blog post easier to read, we won’t use code generation for the following operator rewrites.Repeating our benchmarking process from before shows us some useful next steps.This part of the profile shows that approximately half of the time spent in the function is spent calling (see line 13 above). Full code examples can be found in row_based_typed_batch.go.With this batching change, the benchmarks run nearly 3x faster (and 5.5x faster than the original implementation):But we are still a long ways away from getting close to our “speed of light” performance of 19 microseconds per operation. For a fuller treatment of pipelining, branch prediction, and CPU caches see Dan Luu’s branch prediction talk notes, his CPU cache blog post, or Dave Cheney’s notes from his High Performance Go Workshop.The code below shows how we could make the loop and data orientation changes described above, and also define a few new types at the same time to make the code easier to work with.The reason we introduced the new vector type is so that we could have one struct that could represent a batch of data of any type. For the purposes of this post, we will stop our optimization efforts here, but we are always looking for ways to make our real vectorized engine faster.By analyzing the profiles of our toy execution engine’s code and employing the ideas proposed in the MonetDB/x100 paper, we were able to identify performance problems and implement solutions that improved the performance of multiplying 65,536 rows by a factor of 20x.

As said here by Alfonso Subiotto Marques