link-canonical: URL canonicalization library for semantic link identity

[ library, mit, network, web ] [ Propose Tags ] [ Report a vulnerability ]

A Haskell library that converts arbitrary URLs into canonical, semantically stable identifiers. Handles redirect resolution, tracking parameter removal, and domain-specific normalization for platforms like YouTube, Amazon, Twitter/X, and more.


[Skip to Readme]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0
Change log CHANGELOG.md
Dependencies base (>=4.18 && <5), bytestring (>=0.11 && <0.13), case-insensitive (>=1.2 && <1.3), containers (>=0.6 && <0.8), exceptions (>=0.10 && <0.11), generic-lens (>=2.2 && <2.3), http-client (>=0.7 && <0.8), http-client-tls (>=0.3 && <0.4), http-types (>=0.12 && <0.13), lens (>=5.2 && <5.4), modern-uri (>=0.3 && <0.4), text (>=2.0 && <2.2), time (>=1.12 && <1.15) [details]
License MIT
Author Nadeem Bitar
Maintainer Nadeem Bitar
Uploaded by shinzui at 2026-02-22T15:08:50Z
Category Web, Network
Source repo head: git clone https://github.com/shinzui/link-canonical.git
Distributions
Downloads 1 total (1 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]

Readme for link-canonical-0.1.0.0

[back to package description]

link-canonical

A Haskell library for converting arbitrary URLs into canonical, semantically stable identifiers. Enables URL deduplication and stable link identity in systems that ingest URLs from multiple sources.

Features

  • Tracking parameter removal - Strips UTM parameters, ad click IDs (gclid, fbclid, msclkid), and other marketing trackers
  • Redirect chain resolution - Follows URL shorteners and redirects to find the final destination
  • Domain-specific normalization - Intelligent rules for YouTube, Amazon, Twitter/X, GitHub, Instagram, and Reddit
  • RFC 3986 compliance - Proper dot segment normalization, percent-encoding, and path handling
  • Security features - Private IP blocking (SSRF prevention), HTTPS downgrade protection, redirect loop detection

Installation

Add to your cabal file:

build-depends:
    link-canonical

Or with Nix flakes:

{
  inputs.link-canonical.url = "github:shinzui/link-canonical";
}

Usage

Quick Start

import Link.Canonical
import Text.URI (mkURI)

main :: IO ()
main = do
  let Right uri = mkURI "https://youtu.be/dQw4w9WgXcQ?utm_source=twitter"
  result <- normalizeWithDefaults uri
  case result of
    Left err -> print err
    Right canonical -> print canonical
    -- Output: https://www.youtube.com/watch?v=dQw4w9WgXcQ

API Layers

The library provides three layers of functionality:

-- Pure normalization (no IO, no redirects)
normalizeUri :: NormConfig -> [DomainRule] -> URI -> Either NormError URI

-- With redirect resolution (IO)
normalizeLink :: NormConfig -> URI -> IO (Either NormError NormResult)

-- Convenient defaults
normalizeWithDefaults :: URI -> IO (Either NormError NormResult)

Configuration

import Link.Canonical
import Link.Canonical.Types
import Link.Canonical.Config

-- Customize configuration
customConfig :: NormConfig
customConfig = defaultConfig
  & #redirects . #maxRedirects .~ 5
  & #redirects . #timeout .~ 30
  & #tracking . #allowlist .~ Set.fromList ["ref"]

Configuration Options

Option Default Description
redirects.maxRedirects 10 Maximum redirect hops
redirects.timeout 10s Request timeout
redirects.allowDowngrade False Allow HTTPS to HTTP redirects
redirects.blockPrivateIPs True Block private/local IPs (SSRF protection)
tracking.denyPatterns See below Patterns for tracking parameters
stripFragment True Remove URL fragments
sortParams True Sort query parameters alphabetically

Domain Rules

YouTube

All YouTube URL formats normalize to a consistent watch URL:

Input Output
youtu.be/dQw4w9WgXcQ youtube.com/watch?v=dQw4w9WgXcQ
youtube.com/embed/dQw4w9WgXcQ youtube.com/watch?v=dQw4w9WgXcQ
youtube.com/shorts/dQw4w9WgXcQ youtube.com/watch?v=dQw4w9WgXcQ

Amazon

Amazon product URLs preserve the regional TLD and normalize to the canonical /dp/{ASIN} format:

Input Output
amazon.com/Some-Product/dp/B08N5WRWNW/ref=sr_1_1 amazon.com/dp/B08N5WRWNW
amazon.co.uk/dp/B08N5WRWNW amazon.co.uk/dp/B08N5WRWNW

Twitter/X

Twitter URLs normalize to X.com:

Input Output
twitter.com/user/status/123 x.com/user/status/123
x.com/user/status/123?s=20 x.com/user/status/123

GitHub

GitHub URLs preserve meaningful fragments (line numbers):

Input Output
github.com/owner/repo/blob/main/file.hs#L10-L20 github.com/owner/repo/blob/main/file.hs#L10-L20
github.com/owner/repo?tab=readme github.com/owner/repo

Instagram

Instagram URLs normalize subdomains:

Input Output
instagram.com/p/ABC123 www.instagram.com/p/ABC123

Reddit

Reddit URLs normalize to the main domain:

Input Output
old.reddit.com/r/haskell/comments/abc www.reddit.com/r/haskell/comments/abc

Tracking Parameters

The following tracking parameters are removed by default:

  • Google Analytics: utm_source, utm_medium, utm_campaign, utm_term, utm_content, _ga, _gl, _gid
  • Ad Platforms: gclid, fbclid, msclkid, dclid
  • Marketing: mc_*, oly_*, _hsenc, _hsmi, mkt_tok
  • Social: igshid, si, ref, source
  • Other: zanpid

Development

Prerequisites

  • GHC 9.12+
  • Cabal 3.0+
  • Or Nix (recommended)

Setup with Nix

# Enter development shell
nix develop

# Build
cabal build

# Run tests
cabal test

# Format code
treefmt

Setup without Nix

# Build
cabal build

# Run tests
cabal test

Project Structure

src/Link/Canonical/
├── Canonical.hs      # Main entry point
├── Types.hs          # Core types
├── Config.hs         # Default configuration
├── Normalize.hs      # Generic URL normalization
├── Redirect.hs       # Redirect resolution
├── Tracking.hs       # Tracking parameter stripping
└── Rules/            # Domain-specific rules
    ├── YouTube.hs
    ├── Amazon.hs
    ├── Twitter.hs
    ├── GitHub.hs
    ├── Instagram.hs
    └── Reddit.hs

Testing

cabal test

The test suite includes:

  • Generic normalization tests (scheme, host, port, path, query, fragment)
  • Tracking parameter removal tests
  • Redirect resolution tests (including loop detection, timeout handling)
  • Edge case tests (empty URLs, special characters, encoding)
  • Domain-specific rule tests for each supported platform

License

MIT License - see LICENSE for details.

Copyright 2025 Nadeem Bitar