Determine the Encoding of a HTML Byte Stream
Last updated
Was this helpful?
Last updated
Was this helpful?
This package implements the HTML Standard's in all its glory. The most interesting part of this is how it pre-scans the first 1024 bytes in order to search for certain <meta charset>
-related patterns.
The returned value will be a canonical (not a label). You might then combine this with the package to decode the result:
You can pass two potential options to htmlEncodingSniffer
:
These represent two possible inputs into the :
transportLayerEncodingLabel
is an encoding label that is obtained from the "transport layer" (probably a HTTP Content-Type
header), which overrides everything but a BOM.
defaultEncoding
is the ultimate fallback encoding used if no valid encoding is supplied by the transport layer, and no encoding is sniffed from the bytes. It defaults to "windows-1252"
, as recommended by the algorithm's table of suggested defaults for "All other locales" (including the en
locale).
This package was originally based on the excellent work of , . It has since been pulled out into this separate package.