ncd-tree: text similarity search using normalized compression distance and VP trees

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

ncd-tree is a Haskell library that implements a data structure and query layer for efficient similarity search based on the Normalized Compression Distance (NCD) metric and Vantage Point (VP) trees. It allows users to store and query large datasets of text documents, enabling fast retrieval of similar texts based on their compressed representations. This library is particularly useful for applications such as document clustering and recommendation systems. NCD is a parameter-free, universal similarity metric based on information distance, which quantifies how similar two objects are by measuring the length of their compressed concatenation relative to their individual compressed lengths. VP trees are a type of metric tree that organizes data points in a way that allows for efficient nearest neighbor searches in metric spaces.


[Skip to Readme]

Properties

Versions 0.1.0.0
Change log CHANGELOG.md
Dependencies base (>=4.7 && <5), bytestring, heaps, vector, vector-algorithms, zlib [details]
License BSD-3-Clause
Copyright 2025 Marco Zocca
Author Marco Zocca
Maintainer ocramz
Category Data
Home page https://github.com/ocramz/ncd-tree
Source repo head: git clone https://github.com/ocramz/ncd-tree
Uploaded by ocramz at 2025-12-23T13:02:36Z

Modules

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for ncd-tree-0.1.0.0

[back to package description]

ncd-tree

Tests

Overview

ncd-tree is a Haskell library for similarity-based nearest neighbor search using the Normalized Compression Distance (NCD) metric combined with Vantage-Point (VP) trees.

What it does

Why NCD?

The Normalized Compression Distance is a universal distance metric that works for any type of data (text, sequences, etc.) without requiring domain-specific features or models. It's based on the intuition that the most efficient way to describe the similarity between two objects is the length of the shortest program that computes one object given the other.

For two documents x and y:

NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

where C(s) is the compressed size of s using gzip.

This metric has been rediscovered multiple times in the research literature, first for clustering (Cilibrasi and Vitanyi 2005) and more recently in the text classification setting (Jiang et al 2023).

Why VP-trees?

VP-trees are a space-partitioning data structure that significantly accelerates nearest neighbor searches by using the triangle inequality to prune unpromising branches during traversal.

Usage

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as BL
import Data.NCDTree

-- Create documents from bytestrings
let docs = [ Document (BL.pack "hello world")
           , Document (BL.pack "hello universe")
           , Document (BL.pack "goodbye world")
           , Document (BL.pack "hello there")
           ]

-- Build the VP-tree index with a leaf threshold of 4
let tree = mkVPTree 4 docs

-- Search for the 2 nearest neighbors of "hello"
let query = Document (BL.pack "hello")
let results = knnSearch 2 query tree

-- results is a max-heap with the closest documents

Key Functions

Characteristics

Testing

The library includes a comprehensive property-based test suite covering:

Run tests with:

stack test

References