Skip to content Skip to sidebar Skip to footer

How To Read The Encoding Header Without Knowing The Encoding?

If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curi

Solution 1:

Here's what W3C has to say about it:

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases.

http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

Solution 2:

The encoding name is limited to ([A-Za-z0-9._] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-*, ISO 10646/Unicode).

Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...

If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.

Solution 3:

For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)

Post a Comment for "How To Read The Encoding Header Without Knowing The Encoding?"