Papers
Browse academic papers referenced in production code
Permutation Tests for Studying Classifier Performance
We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the c...
Printing floating-point numbers quickly and accurately with integers
We present algorithms for accurately converting floating-point numbers to decimal representation. They are fast (up to 4 times faster than commonly used algorithms that use high-precision integers) and correct: any printed number will evaluate to the...
Three layer cake for shared-memory programming
There are many different styles of parallel programming for shared-memory hardware. Each style has strengths, but can conflict with other styles. How can we use a variety of these styles in one program and minimize their conflict and maximize perform...
Feature hashing for large scale multitask learning.
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random s...
Performance Evaluation of RANSAC Family.
RANSAC (Random Sample Consensus) has been popular in regression problem with samples contaminated with outliers. It has been a milestone of many researches on robust estimators, but there are a few survey and performance analysis on them. This paper ...
Rank-Balanced Trees
Since the invention of AVL trees in 1962, many kinds of binary search trees have been proposed. Notable are red-black trees, in which bottom-up rebalancing after an insertion or deletion takes O(1) amortized time and O(1) rotations worst-case. But th...
Shrinkage Algorithms for MMSE Covariance Estimation
We address covariance estimation in the sense of minimum mean-squared error (MMSE) for Gaussian samples. Specifically, we consider shrinkage methods which are suitable for high dimensional problems with a small number of samples (large p small n). Fi...
Work-first and help-first scheduling policies for async-finish task parallelism
Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsof...
Dynamic programming strikes back
Two highly efficient algorithms are known for optimally ordering joins while\navoiding cross products:\nDPccp, which is based on dynamic programming, and Top-Down Partition Search, \nbased on memoization.\nBoth have two severe limitations:\nThey hand...
Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd
International audience
Inter-Coder Agreement for Computational Linguistics
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the...
MapReduce: simplified data processing on large clusters
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the und...
Packrat parsers can support left recursion.
Packrat parsing offers several advantages over other parsing techniques, such as the guarantee of linear parse times while supporting backtracking and unlimited look-ahead. Unfortunately, the limited support for left recursion in packrat parser imple...
Sparse inverse covariance estimation with the graphical lasso.
Abstract We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm—the graphical lasso—that is remarkably fast: I...
The keyed-hash message authentication code (HMAC)
This Standard describes a keyed-hash message authentication code (HMAC), a mechanism for message authentication using cryptographic hash functions. HMAC can be used with any iterative Approved cryptographic hash function, in combination with a shared...
Visualizing Data using t-SNE
Tese de mestrado integrado. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 2008
Gaussian random number generators
Rapid generation of high quality Gaussian random numbers is a key capability for simulations across a wide range of disciplines. Advances in computing have brought the power to conduct simulations with very large numbers of random numbers and with it...