Databases
Database engines, storage systems, and query optimization
Repositories
(11)apache/cassandra
ClickHouse/ClickHouse
cockroachdb/cockroach
dgraph-io/dgraph
duckdb/duckdb
facebook/rocksdb
mongodb/mongo
postgres/postgres
redis/redis
sqlite/sqlite
yugabyte/yugabyte-db
Papers
(41)A Deep Probabilistic Model for Customer Lifetime Value Prediction
Accurate predictions of customers' future lifetime value (LTV) given their attributes and past purchase behavior enables a more customer-centric marketing strategy. Marketers can segment customers into various buckets based on the predicted LTV and, ...
A Fast, Minimal Memory, Consistent Hash Algorithm
We present jump consistent hash, a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better...
A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution
We present a message-passing based parallel version of the Space Saving algorithm designed to solve the $k$--majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is ...
A simple algorithm for finding frequent elements in streams and bags
We present a simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ. The algorithm requires two passes, linear time, and space 1/θ. The first pass is an on-line algorithm, generalizing a well-known algo...
Computing Extremely Accurate Quantiles Using t-Digests
We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile $q$ to be computed with an acc...
Design and Analysis of a Logless Dynamic Reconfiguration Protocol
Distributed replication systems based on the replicated state machine model have become ubiquitous as the foundation of modern database systems. To ensure availability in the presence of faults, these systems must be able to dynamically replace faile...
Dynamic programming strikes back
Two highly efficient algorithms are known for optimally ordering joins while\navoiding cross products:\nDPccp, which is based on dynamic programming, and Top-Down Partition Search, \nbased on memoization.\nBoth have two severe limitations:\nThey hand...
Efficiency of a Good But Not Linear Set Union Algorithm
We questioned whether the respiratory muscles of humans contribute to systemic oxidative stress following inspiratory flow-resistive breathing, whether the amount of oxidative stress is influenced by the level of resistive load, and whether the amoun...
Efficient Evaluation of Arbitrarily-Framed Holistic SQL Aggregates and Window Functions
Window functions became part of the SQL standard in SQL:2003 and are widely used for data analytics: Percentiles, rankings, moving averages, running sums and local maxima are all expressed as window functions in SQL. Yet, the features offered by SQL'...
Faster-Than-Native Alternatives for x86 VP2INTERSECT Instructions
We present faster-than-native alternatives for the full AVX512-VP2INTERSECT instruction subset using basic AVX512F instructions. These alternatives compute only one of the output masks, which is sufficient for the typical case of computing the inters...
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Large-scale internet services aim to remain highly available and responsive in the presence of unexpected failures. Providing this service often requires monitoring and analyzing tens of millions of measurements per second across a large number of sy...
Large-Scale Metric Computation in Online Controlled Experiment Platform
Online controlled experiment (also called A/B test or experiment) is the most important tool for decision-making at a wide range of data-driven companies like Microsoft, Google, Meta, etc. Metric computation is the core procedure for reaching a concl...
Note on a Method for Calculating Corrected Sums of Squares and Products
"Note on a Method for Calculating Corrected Sums of Squares and Products." Technometrics, 4(3), pp. 419–420
One WITH RECURSIVE is Worth Many GOTOs
PL/SQL integrates an imperative statement-by-statement style of programming with the plan-based evaluation of SQL queries. The disparity of both leads to friction at runtime, slowing PL/SQL execution down significantly. This work describes a compiler...
PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension
Similarity search has been widely used in various fields, particularly in the Alibaba ecosystem. The open-source solutions to a similarity search of vectors can only support a query with a single vector, whereas real-life scenarios generally require ...
Practical String Dictionary Compression Using String Dictionary Encoding
A string dictionary is a data structure for storing a set of strings that maps them to unique IDs. It can manage string data in compact space by encoding them into integers. However, instances have recently emerged in practice where the size of strin...
Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses
Cloud data warehouses are today's standard for analytical query processing. Multiple cloud vendors offer state-of-the-art systems, such as Amazon Redshift. We have observed that customer workloads experience highly repetitive query patterns, i.e., us...
Random sampling from hash files
In this paper we discuss simple random sampling from hash files on secondary storage. We consider both iterative and batch sampling algorithms from both static and dynamic hashing methods. The static methods considered are open addressing hash files ...
TinyLFU: A Highly Efficient Cache Admission Policy
This paper proposes to use a frequency based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Given a newly accessed item and an eviction candidate from the cache, our scheme decides, based ...