An efficient algorithm for sequential random sampling
Abstract
We examine several methods for drawing a sequential random sample of n records from a file containing N records. Method D is recommended for general use. The algorithm is on-line (so that CPU time can be overlapped with I/O), has a small constant memory requirement, and is easy to program. An improved implementation is detailed in the Appendix.
Code References
freebsd/freebsd-src
1 file
contrib/jemalloc/doc_internal/PROFILING_INTERNALS.md
1
Compared to our fast paths, even a `coinflip(p)` function can be quite expensive. Having to do a random-number generation and some floating point operations would be a sizeable relative cost. However (as pointed out in [[Vitter, 1987](https://dl.acm.org/doi/10.1145/23002.23003)]), if we can orchestrate our algorithm so that many of our `coinflip` calls share their parameter value, we can do better. We can sample from the geometric distribution, and initialize a counter with the result. When the counter hits 0, the `coinflip` function returns true (and reinitializes its internal counter).
Link copied to clipboard!