Databases
Database engines, storage systems, and query optimization
Repositories
(11)apache/cassandra
ClickHouse/ClickHouse
cockroachdb/cockroach
dgraph-io/dgraph
duckdb/duckdb
facebook/rocksdb
mongodb/mongo
postgres/postgres
redis/redis
sqlite/sqlite
yugabyte/yugabyte-db
Papers
(41)A contextual-bandit approach to personalized news article recommendation
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two re...
Algorithms for geodesics
Algorithms for the computation of geodesics on an ellipsoid of revolution are given. These provide accurate, robust, and fast solutions to the direct and inverse geodesic problems and they allow differential and integral properties of geodesics to be...
A Study of Replacement Algorithms for Virtual-Storage Computer
One of the basic limitations of a digital computer is the size of its available memory. 1 In most cases, it is neither feasible nor economical for a user to insist that every problem program fit into memory. The number of words of information in a pr...
A Tutorial on Thompson Sampling
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may ...
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.
We present a new approach for the approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy (Hierarchical NSW, HNSW). The proposed solution is fully graph-based, without any need for additional search str...
Elle: Inferring Isolation Anomalies from Experimental Observations
Users who care about their data store it in databases, which (at least in principle) guarantee some form of transactional isolation. However, experience shows [Kleppmann 2019, Kingsbury and Patella 2019a] that many databases do not provide the isolat...
Less hashing, same performance: Building a better Bloom filter
Abstract A standard technique from the hashing literature is to use two hash functions h 1 ( x ) and h 2 ( x ) to simulate additional hash functions of the form g i ( x ) = h 1 ( x ) + i h 2 ( x ). We demonstrate that this technique can be usefully a...
Low-Latency Sliding-Window Aggregation in Worst-Case Constant Time
Sliding-window aggregation is a widely-used approach for extracting insights from the most recent portion of a data stream. The aggregations of interest can usually be cast as binary operators that are associative, but they are not necessarily commut...
New cardinality estimation algorithms for HyperLogLog sketches
This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based ...
Novel Table Lookup-Based Algorithms for High-Performance CRC Generation
A framework for designing a family of novel fast cyclic redundancy code (CRC) generation algorithms is presented. Our algorithms can ideally read arbitrarily large amounts of data at a time, while optimizing their memory requirement to meet the const...
On the correct and complete enumeration of the core search space
Reordering more than traditional joins (e.g. outerjoins, antijoins) requires some care, since not all reorderings are valid. To prevent invalid plans, two approaches have been described in the literature. We show that both approaches still produce in...
RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search
Searching for approximate nearest neighbors (ANN) in the high-dimensional Euclidean space is a pivotal problem. Recently, with the help of fast SIMD-based implementations, Product Quantization (PQ) and its variants can often efficiently and accuratel...
R-trees: a dynamic index structure for spatial searching
In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, tradit...
Serializable isolation for snapshot databases
Many popular database management systems implement a multiversion concurrency control algorithm called snapshot isolation rather than providing full serializability based on locking. There are well-known anomalies permitted by snapshot isolation that...
The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.
The R-tree, one of the most popular access methods for rectangles, is based on the heuristic optimization of the area of the enclosing rectangle in each inner node. By running numerous experiments in a standardized testbed under highly varying data, ...
The serializability of concurrent database updates
ABSTRACT A sequence of interleaved user transactions in a database system may not be ser:ahzable, t e, equivalent to some sequential execution of the individual transactions Using a simple transaction model, it ~s shown that recognizing the transacti...
Weighted Random Sampling over Data Streams
In this work, we present a comprehensive treatment of weighted random sampling (WRS) over data streams. More precisely, we examine two natural interpretations of the item weights, describe an existing algorithm for each case ([2, 4]), discuss samplin...