html-charset: Determine character encoding of HTML bytes ======================================================== [![Hackage](https://img.shields.io/hackage/v/html-charset.svg)][html-charset] This provides a [Haskell library][html-charset] and a CLI executable to determine character encoding (i.e., so-called "charset") from given HTML bytes. The precendence order for determining the character encoding is: 1. A BOM (byte order mark) before any other data in the HTML document itself. 2. A `` declaration with a `charset` attribute or an `http-equiv` attribute set to `Content-Type` and a value set for `charset`. Note that it looks at only first 1024 bytes. 3. [Mozilla's Charset Detectors][chardet] heuristics. To be specific, it delegates to the [charsetdetect-ae] package, a Haskell implementation of that. [html-charset]: https://hackage.haskell.org/package/html-charset [chardet]: https://www-archive.mozilla.org/projects/intl/chardet.html [charsetdetect-ae]: https://hackage.haskell.org/package/charsetdetect-ae API --- The package is available on Hackage: *[html-charset]*. ~~~~ haskell >>> import Data.ByteString.Lazy >>> import Text.Html.Encoding.Detection >>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd..." Just "UTF-8" >>> detect "..." Just "latin-1" >>> detect "\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..." Just "EUC-KR" ~~~~ Note that the `detect` function takes a lazy bytestring, not strict. Read the [API docs] for details. [API docs]: https://hackage.haskell.org/package/html-charset/docs/Text-Html-Encoding-Detection.html CLI --- We currently doesn't provide any official binaries. The CLI program can be installed using Cabal or Stack: *[html-charset]*. ~~~~ $ curl https://www.haskell.org/onlinereport/ | html-charset ASCII $ curl http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/ | html-charset shift_jis ~~~~ Although it's less likely, `html-charset` may fail to determine the character encoding, and for the case it prints nothing (only a line feed, exactly). You can customize the string to print when it fails by configuring `-f`/`--on-failure` option. Author and license ------------------ Witten by [Hong Minhee]. Licensed under [LGPL 2.1] or higher. [Hong Minhee]: https://hongminhee.org/ [LGPL 2.1]: https://www.gnu.org/licenses/lgpl-2.1.html