html-charset-0.1.1: Determine character encoding of HTML documents/fragments
Safe HaskellSafe-Inferred
LanguageHaskell2010

Text.Html.Encoding.Detection

Synopsis

Documentation

type EncodingName = String Source #

Represent a name of text encoding (i.e., charset). E.g., "UTF-8".

detect :: ByteString -> Maybe EncodingName Source #

Detect the character encoding from a given HTML fragment. The precendence order for determining the character encoding is:

  1. A BOM (byte order mark) before any other data in the HTML document itself. (See also detectBom function for details.)
  2. A meta declaration with a charset attribute or an http-equiv attribute set to Content-Type and a value set for charset. Note that it looks at only first 1024 bytes. (See also detectMetaCharset for details.)
  3. Mozilla's Charset Detectors heuristics. To be specific, it delegates to detectEncodingName from the charsetdetect-ae package, a Haskell implementation of that.
>>> :set -XOverloadedStrings
>>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd<html><head>..."
Just "UTF-8"
>>> detect "<html><head><meta charset=latin-1>..."
Just "latin-1"
>>> detect "<html><head><title>\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"

It may return Nothing if it fails to determine the character encoding, although it's less likely.

detectBom :: ByteString -> Maybe EncodingName Source #

Detect the character encoding from a given HTML fragment by looking the initial BOM (byte order mark).

>>> :set -XOverloadedStrings
>>> detectBom "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd"
Just "UTF-8"
>>> detectBom "\xfe\xff\x4f\x60\x59\x7d"
Just "UTF-16BE"
>>> detectBom "\xff\xfe\x60\x4f\x7d\x59"
Just "UTF-16LE"
>>> detectBom "\x00\x00\xfe\xff\x00\x00\x4f\x60\x00\x00\x59\x7d"
Just "UTF-32BE"
>>> detectBom "\xff\xfe\x00\x00\x60\x4f\x00\x00\x7d\x59\x00\x00"
Just "UTF-32LE"
>>> detectBom "\x84\x31\x95\x33\xc4\xe3\xba\xc3"
Just "GB-18030"

It returns Nothing if it fails to find no valid BOM sequence.

>>> detectBom "foobar"
Nothing

detectMetaCharset :: ByteString -> Maybe EncodingName Source #

Detect the character encoding from a given HTML fragment by looking a meta declaration with a charset attribute or an http-equiv attribute set to Content-Type and a value set for charset.

>>> :set -XOverloadedStrings
>>> detectMetaCharset "<html><head><meta charset=utf-8>"
Just "utf-8"
>>> detectMetaCharset "<html><head><meta charset='EUC-KR'>"
Just "EUC-KR"
>>> detectMetaCharset "<html><head><meta charset=\"latin-1\"/></head></html>"
Just "latin-1"
>>> :{
detectMetaCharset
     "<meta http-equiv=content-type content='text/html; charset=utf-8'>"
:}
Just "utf-8"

Return Nothing if it failed to any appropriate meta tag:

>>> detectMetaCharset "<html><body></body></html>"
Nothing