| Safe Haskell | Safe-Inferred |
|---|---|
| Language | Haskell2010 |
Text.Html.Encoding.Detection
Synopsis
- type EncodingName = String
- detect :: ByteString -> Maybe EncodingName
- detectBom :: ByteString -> Maybe EncodingName
- detectMetaCharset :: ByteString -> Maybe EncodingName
Documentation
type EncodingName = String Source #
Represent a name of text encoding (i.e., charset). E.g., "UTF-8".
detect :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment. The precendence order for determining the character encoding is:
- A BOM (byte order mark) before any other data in the HTML document itself.
(See also
detectBomfunction for details.) - A
metadeclaration with acharsetattribute or anhttp-equivattribute set toContent-Typeand a value set forcharset. Note that it looks at only first 1024 bytes. (See alsodetectMetaCharsetfor details.) - Mozilla's Charset
Detectors
heuristics. To be specific, it delegates to
detectEncodingNamefrom the charsetdetect-ae package, a Haskell implementation of that.
>>>:set -XOverloadedStrings>>>detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd<html><head>..."Just "UTF-8">>>detect "<html><head><meta charset=latin-1>..."Just "latin-1">>>detect "<html><head><title>\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."Just "EUC-KR"
It may return Nothing if it fails to determine the character encoding,
although it's less likely.
detectBom :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment by looking the initial BOM (byte order mark).
>>>:set -XOverloadedStrings>>>detectBom "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd"Just "UTF-8">>>detectBom "\xfe\xff\x4f\x60\x59\x7d"Just "UTF-16BE">>>detectBom "\xff\xfe\x60\x4f\x7d\x59"Just "UTF-16LE">>>detectBom "\x00\x00\xfe\xff\x00\x00\x4f\x60\x00\x00\x59\x7d"Just "UTF-32BE">>>detectBom "\xff\xfe\x00\x00\x60\x4f\x00\x00\x7d\x59\x00\x00"Just "UTF-32LE">>>detectBom "\x84\x31\x95\x33\xc4\xe3\xba\xc3"Just "GB-18030"
It returns Nothing if it fails to find no valid BOM sequence.
>>>detectBom "foobar"Nothing
detectMetaCharset :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment by looking
a meta declaration with a charset attribute or an http-equiv
attribute set to Content-Type and a value set for charset.
>>>:set -XOverloadedStrings>>>detectMetaCharset "<html><head><meta charset=utf-8>"Just "utf-8">>>detectMetaCharset "<html><head><meta charset='EUC-KR'>"Just "EUC-KR">>>detectMetaCharset "<html><head><meta charset=\"latin-1\"/></head></html>"Just "latin-1">>>:{detectMetaCharset "<meta http-equiv=content-type content='text/html; charset=utf-8'>" :} Just "utf-8"
Return Nothing if it failed to any appropriate meta tag:
>>>detectMetaCharset "<html><body></body></html>"Nothing