erikbern

Benchmarking nearest neighbors

Benchmarks of approximate nearest neighbor libraries in Python
Under MIT License
By erikbern

docker benchmark nearest-neighbors

Benchmarking nearest neighbors


Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far there has not been a lot of empirical attempts at comparing approaches in an objective way.


This project contains some tools to benchmark various implementations of approximate nearest neighbor (ANN) search for different metrics. We have pregenerated datasets (in HDF5) formats and we also have Docker containers for each algorithm. There's a test suite that makes sure every algorithm works.


Evaluated

Data sets

We have a number of precomputed data sets for this. All data sets are pre-split into train/test and come with ground truth data in the form of the top 100 neighbors. We store them in a HDF5 format:


| Dataset | Dimensions | Train size | Test size | Neighbors | Distance | Download |
| ----------------------------------------------------------------- | ---------: | ---------: | --------: | --------: | --------- | -------------------------------------------------------------------------- |
| DEEP1B | 96 | 9,990,000 | 10,000 | 100 | Angular | HDF5 (3.6GB)
| Fashion-MNIST | 784 | 60,000 | 10,000 | 100 | Euclidean | HDF5 (217MB) |
| GIST | 960 | 1,000,000 | 1,000 | 100 | Euclidean | HDF5 (3.6GB) |
| GloVe | 25 | 1,183,514 | 10,000 | 100 | Angular | HDF5 (121MB) |
| GloVe | 50 | 1,183,514 | 10,000 | 100 | Angular | HDF5 (235MB) |
| GloVe | 100 | 1,183,514 | 10,000 | 100 | Angular | HDF5 (463MB) |
| GloVe | 200 | 1,183,514 | 10,000 | 100 | Angular | HDF5 (918MB) |
| Kosarak | 27983 | 74,962 | 500 | 100 | Jaccard | HDF5 (2.0GB) |
| MNIST | 784 | 60,000 | 10,000 | 100 | Euclidean | HDF5 (217MB) |
| NYTimes | 256 | 290,000 | 10,000 | 100 | Angular | HDF5 (301MB) |
| SIFT | 128 | 1,000,000 | 10,000 | 100 | Euclidean | HDF5 (501MB) |
| Last.fm | 65 | 292,385 | 50,000 | 100 | Angular | HDF5 (135MB) |


Results

These are all as of 2020-07-12, running all benchmarks on a c5.4xlarge machine on AWS with --parallelism set to 3:


glove-100-angular


sift-128-euclidean


fashion-mnist-784-euclidean


lastfm-64-dot


nytimes-256-angular


glove-25-angular


Install

The only prerequisite is Python (tested with 3.6) and Docker.



  1. Clone the repo.

  2. Run pip install -r requirements.txt.

  3. Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).


Running

  1. Run python run.py (this can take an extremely long time, potentially days)

  2. Run python plot.py or python create_website.py to plot results.


You can customize the algorithms and datasets if you want to:



Including your algorithm

  1. Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.

  2. Add a Dockerfile in install/ for it

  3. Add it to algos.yaml

  4. Add it to .github/workflows/benchmarks.yml


Principles

Authors

Built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.


Related Publication

The following publication details design principles behind the benchmarking framework: