html-charset: Determine character encoding of HTML bytes ======================================================== [![Hackage](https://img.shields.io/hackage/v/html-charset.svg)][html-charset] This provides a [Haskell library][html-charset] and a CLI executable to determine character encoding (i.e., so-called "charset") from given HTML bytes. The precendence order for determining the character encoding is: 1. A BOM (byte order mark) before any other data in the HTML document itself. 2. A `` declaration with a `charset` attribute or an `http-equiv` attribute set to `Content-Type` and a value set for `charset`. Note that it looks at only first 1024 bytes. 3. [Mozilla's Charset Detectors][chardet] heuristics. To be specific, it delegates to the [charsetdetect-ae] package, a Haskell implementation of that. [html-charset]: https://hackage.haskell.org/package/html-charset [chardet]: https://www-archive.mozilla.org/projects/intl/chardet.html [charsetdetect-ae]: https://hackage.haskell.org/package/charsetdetect-ae API --- The package is available on Hackage: *[html-charset]*. ~~~~ haskell >>> import Data.ByteString.Lazy >>> import Text.Html.Encoding.Detection >>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd
..." Just "UTF-8" >>> detect "..." Just "latin-1" >>> detect "