HTML5’s ‘encoding sniffing algorithm’ says that the default encoding of an HTML document is: “an implementation-defined or user-specified default character encoding … typically dependent on the user’s locale“.
However, it turns out that Web browsers operate with a second default — namely the parent browsing context-defined default. This default, which overrides the locale default, steps in when the document of a ‘child browsing context’ fails to provide encoding information.
Another — and hot — spot of HTML encoding is the proposal to make the Byte Order Mark override any other encoding setting — HTTP/MIME and meta element. And they might even do this for XML files.
A third hot issue is the way some browsers (partly) applies its HTML encoding sniffing algorithm on XML documents.
The second and third issue can be seen together: An XML document in an iframe of a HTML document should be unaffected by the encoding of the HTML document — even if the XML document is lacking encoding info. Because, in that case, the XML document defaults to UTF-8. But, alas, the Webkittens currently ignore this. Which makes it tempting to solve the problem with a BOM in XML files too.
To clarify — or eventually blur — my mind, I decided to go through all the browsers I have access to and document their behaviors. Here you have it! It is at least the most complete table I myself have ever made. (But it is still not 100% complete: It lacks info on how encoding of documents kept within data URIs, for instance.)
| Layer | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| document is XML format | user override | inherit override from parent | explicit MIME charset | BOM signature | native markup label | markup label of the sister language | info on “the likely encoding” | UTF-8 detection etc. | parent browsing context | implicit markup language default | info from OS | locale default | ||
| HTML implementation | HTML5 | opts out | 1. | 2. | 3. bug 15359 | 4. | 5. ask Hixie | 6. | 7. | |||||
| Firefox 10 | opts out | 1. | 2. | 3. | 4. | 5. | ask vendors | 6. some locales | 7. | 8. | ||||
| IE 10 pre-release | opts out | 1. | 2. | 3. | 4. | 5. | 6. check! | 7. check! | 8. | |||||
| IE 9 | opts out | 2. | 3. | 4. | 1. | 5. | check! | 7. check! | 8. | |||||
| IE 6 | 2. | 3. | 1. | 4. | 5. | |||||||||
| Chrome 22 | 2. | 3. | 4. | 1. | 5. | 6. | 8. | 7. | 9. | |||||
| Safari 5 | 2. | 3. | 4. | 1. | 5. | 6. | 7. | 8. | 9. | |||||
| Opera 10 | 1. | 2. | 3. | 4. | 5. | 6. some locales | 7. | 8. | ||||||
| Opera 12 | 1. | 2. | 4. | 3. | 5. | 6. | 7. all locales | 8. | 9. | |||||
| Who wins? Your own tests. PS: In as much as your UA allows it at all(!), the tests require that you yourself override the encoding | XML vs your encoding override xml vs override take 2 | BOM signature vs your encoding override | iframe doc vs your encoding override of the parent doc | HTTP vs meta element | BOM vs meta vs HTTP vs chardet | meta vs xml encoding declaration vs chardet | xml encoding declaration vs locale default | declared encoding vs reload after an encoding override - does it stick? | UTF-8 file vs locale default | locale default vs doc of parent browsing context | Does not apply to HTML | OSX: Via file://, load a TextEdit made UTF-8 file in Safari 5 | check the locale default | |
Because I made the HTML table (or, actually i twas the other way around …), I wanted to document how XML is handled too.
| Layer | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user override | explicit MIME charset | BOM signature | native markup label | markup label of the sister language | info on “the likely encoding” | UTF-8 detection etc. | parent browsing context | implicit markup language default | info from OS | locale default | ||
| XML implementation | HTML5 | HTML5 defers encoding determination of XML to XML 1.0 | ||||||||||
| XML1.0 | 1. | 2. | 3. | 4. | 5. | |||||||
| Firefox 10 | 1. | 2. | 3. | 4. check! | 5. | |||||||
| IE 9 | check! | 2. | 1. | 3. | 4. check! | check! | 5. | |||||
| IE 10 pre-release | check! | 1. | 2. | 3. | 4. check! | check! | 5. | |||||
| Chrome 22 | 2. | 3. | 1. | 4. | 6. check! | 5. | 7. | |||||
| Safari 5 | 2. | 3. | 1. | 4. | 5. check! | 6. | 7. | |||||
| Opera 10 | 1. | 2. | 3. | 4. some locales | 5. | 6. | ||||||
| Opera 12 | 2. | 3. | 1. | 4. | 5. all locales | 6. | 7. | |||||
NB regarding XML! Note that e.g. SVG images inside the <img> element, does not count as browsing contexts, whereas an SVG in an iframe does count as brwsing context. Hence, for SVG images, then the parent encoding does not impact the encoding of the SVG image. Whereas if the same SVG is served inside an iframe, then it may inherit the encoding of the parent document rather than fallling back to the encoding default of the format (UTF-8).
The consequences I draw from this tiresome research are that HTML5’s encoding sniffing algorithm should be updated with
PROPOSAL: Add a new, 2nd last step, like so:
PROPOSAL: Add a new step after the current first step (about user overriding), like so:
PROPOSAL: Add this step as a step zero:
[Purpose: to avoid that the/an HTML encoding sniffing algorithm (sometimes) is applied to XML.]
« Quirky IE10 and uppercased source code | My demo of an accessible, canvas-like picture element »
© 2013 Leif Halvard Silli