html-charset: Determine character encoding of HTML documents/fragments

[ lgpl, library, program, web ] [ Propose Tags ] [ Report a vulnerability ]

Please see the README.md on GitHub at https://github.com/dahlia/html-charset#readme.

[Skip to Readme]

Modules

[Index]

Text
- Html
  - Encoding
    - Text.Html.Encoding.Detection

Downloads

html-charset-0.1.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

hongminhee

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0, 0.1.1
Change log	CHANGES.md
Dependencies	attoparsec (>=0.12 && <1), base (>=4.7 && <5), bytestring, charsetdetect-ae (>=1.1 && <2), html-charset, optparse-applicative (>=0.14 && <1) [details]
License	LGPL-2.1-only
Copyright	(c) 2018 Hong Minhee
Author	Hong Minhee <hong.minhee@gmail.com>
Maintainer	Hong Minhee <hong.minhee@gmail.com>
Category	Web
Home page	https://github.com/dahlia/html-charset#readme
Bug tracker	https://github.com/dahlia/html-charset/issues
Source repo	head: git clone https://github.com/dahlia/html-charset
Uploaded	by hongminhee at 2018-07-10T20:14:15Z
Distributions	NixOS:0.1.1
Executables	html-charset
Downloads	1538 total (16 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2018-07-10 [all 1 reports]

Readme for html-charset-0.1.0

[back to package description]

html-charset: Determine character encoding of HTML bytes

This provides a Haskell library and a CLI executable to determine character encoding (i.e., so-called "charset") from given HTML bytes.

The precendence order for determining the character encoding is:

A BOM (byte order mark) before any other data in the HTML document itself.
A <meta> declaration with a charset attribute or an http-equiv attribute set to Content-Type and a value set for charset. Note that it looks at only first 1024 bytes.
Mozilla's Charset Detectors heuristics. To be specific, it delegates to the charsetdetect-ae package, a Haskell implementation of that.

API

The package is available on Hackage: html-charset.

>>> import Data.ByteString.Lazy
>>> import Text.Html.Encoding.Detection
>>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd<html><head>..."
Just "UTF-8"
>>> detect "<html><head><meta charset=latin-1>..."
Just "latin-1"
>>> detect "<html><head><title>\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"

Note that the detect function takes a lazy bytestring, not strict.

Read the [API docs] for details.

CLI

We currently doesn't provide any official binaries. The CLI program can be installed using Cabal or Stack: html-charset.

$ curl https://www.haskell.org/onlinereport/ | html-charset
ASCII
$ curl http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/ | html-charset
shift_jis

Although it's less likely, html-charset may fail to determine the character encoding, and for the case it prints nothing (only a line feed, exactly). You can customize the string to print when it fails by configuring -f/--on-failure option.

Author and license

Witten by Hong Minhee. Licensed under LGPL 2.1 or higher.