Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Synopsis
- type EncodingName = String
- detect :: ByteString -> Maybe EncodingName
- detectBom :: ByteString -> Maybe EncodingName
- detectMetaCharset :: ByteString -> Maybe EncodingName
Documentation
type EncodingName = String Source #
Represent a name of text encoding (i.e., charset
). E.g., "UTF-8"
.
detect :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment. The precendence order for determining the character encoding is:
- A BOM (byte order mark) before any other data in the HTML document itself.
(See also
detectBom
function for details.) - A
meta
declaration with acharset
attribute or anhttp-equiv
attribute set toContent-Type
and a value set forcharset
. Note that it looks at only first 1024 bytes. (See alsodetectMetaCharset
for details.) - Mozilla's Charset
Detectors
heuristics. To be specific, it delegates to
detectEncodingName
from the charsetdetect-ae package, a Haskell implementation of that.
>>>
:set -XOverloadedStrings
>>>
detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd<html><head>..."
Just "UTF-8">>>
detect "<html><head><meta charset=latin-1>..."
Just "latin-1">>>
detect "<html><head><title>\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"
It may return Nothing
if it fails to determine the character encoding,
although it's less likely.
detectBom :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment by looking the initial BOM (byte order mark).
>>>
:set -XOverloadedStrings
>>>
detectBom "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd"
Just "UTF-8">>>
detectBom "\xfe\xff\x4f\x60\x59\x7d"
Just "UTF-16BE">>>
detectBom "\xff\xfe\x60\x4f\x7d\x59"
Just "UTF-16LE">>>
detectBom "\x00\x00\xfe\xff\x00\x00\x4f\x60\x00\x00\x59\x7d"
Just "UTF-32BE">>>
detectBom "\xff\xfe\x00\x00\x60\x4f\x00\x00\x7d\x59\x00\x00"
Just "UTF-32LE">>>
detectBom "\x84\x31\x95\x33\xc4\xe3\xba\xc3"
Just "GB-18030"
It returns Nothing
if it fails to find no valid BOM sequence.
>>>
detectBom "foobar"
Nothing
detectMetaCharset :: ByteString -> Maybe EncodingName Source #
Detect the character encoding from a given HTML fragment by looking
a meta
declaration with a charset
attribute or an http-equiv
attribute set to Content-Type
and a value set for charset
.
>>>
:set -XOverloadedStrings
>>>
detectMetaCharset "<html><head><meta charset=utf-8>"
Just "utf-8">>>
detectMetaCharset "<html><head><meta charset='EUC-KR'>"
Just "EUC-KR">>>
detectMetaCharset "<html><head><meta charset=\"latin-1\"/></head></html>"
Just "latin-1">>>
:{
detectMetaCharset "<meta http-equiv=content-type content='text/html; charset=utf-8'>" :} Just "utf-8"
Return Nothing
if it failed to any appropriate meta
tag:
>>>
detectMetaCharset "<html><body></body></html>"
Nothing