Encoding is just one of those things that need to be done right. If done wrong, everything seems to be broken and nothing works. If done right, no one will notice. This makes dealing with encoding so annoying.
Nevertheless, we are quite lucky and most of the things are already really well-prepared. We only need to ensure that our documents are saved (and transmitted) with the right encoding. The right encoding is the one we specify. It could be anything, as long as it contains all characters we need, and as long as we stay consistent.
There are three important text encoding rules for HTML:
- Load the content with the right encoding.
- Transmit the content with the same encoding.
- Ensure that the client reads the content with the specified encoding.
In this article we will have a closer look at all three rules in more detail, especially the second and the third. In the end we will also look at form encoding, which has nothing to do with text encoding directly, but does indirectly. We will see why there is some connection.
Choosing the Right Encoding
Either we know directly that our content should be delivered in some exotic encoding or we should just pick UTF-8. There are many reasons why we would want to use UTF-8. It is not a great format for storing characters in memory, but it is just wonderful as a basis for data exchange and content transmission. It is basically a no-brainer. Nevertheless, one of the more common mistakes is to save files without proper encoding. As there is no text without encoding, we should choose the encoding carefully.
Users of Sublime Text and most other text editors have probably never faced a problem with wrong encoding, since these editors save in UTF-8 by default. There are editors, mostly for the Windows platform, which a use different default format, e.g., Windows-1252.
Even in Sublime Text it is one of the more standard operations to change the encoding of the file. In the File menu we select Save with Encoding and select the one we want. That’s it!
In principle every more advanced editor should have such options. Sometimes they are contained in an advanced save menu. For instance, the editor for Microsoft’s Visual Studio triggers a special dialog after clicking Advanced Save Options… in the File menu.
We should make sure to use the right encoding. This will use the corresponding bytes for our content. UTF-8 has the major advantage of only requiring a single byte if we do not use a special character. At most 4 bytes per character are consumed. This is dynamic and makes UTF-8 an ideal format for text storage and transmission. The caveat is, however, that UTF-8 is not the best format for using strings from memory.
Controlling the Transmission
The HTTP protocol transmits data as plain text. Even if we decide to encode the transmitted content as GZip or if we use HTTPS, which encrypts the content, the underlying content is still just plain text. We’ve already learned that there is no such thing as just plain text. We always need to associate the content with some encoding to get a text representation.
An HTTP message is split in two parts. The upper part is called the headers. Separated by an empty line is the lower part: the body.
There are always at least two HTTP messages: a request and its associated response. Both types of messages share this structure. The body of a response is the content we want to transmit. The body of a request is only of interest for form submission, which we’ll care about later. If we want to provide some information on the encoding of the content, we have to supply some information in the header.
The following header tells the receiving side that the body contains a special text format called HTML, using the UTF-8 character set.
Content-Type: text/html; charset=utf-8
There is also the
Content-Encoding header. We can easily confuse the content encoding with the actual text encoding of the content. The former is used to specify encoding of the whole package, e.g. GZip, while the latter is used as an initial setting for, e.g., parsing the provided content.
If we care about the correctness of this step we have to make sure that our web server sends the correct header. Most web frameworks offer such an ability. In PHP we could write:
In Node.js we may want to use the following, where
res is the variable representing the request:
res.setHeader('Content-Type', 'text/html; charset=utf-8');
The transmitted header will set the text scanner of the HTML input to the provided setting. In the case of the previous example we use UTF-8. But wait: Initial setting! There are many ways to override this. If the actual content is not UTF-8, the scanner may recognize this and change the setting. Such a change may be triggered by Byte-Order-Mark (known as BOM) detection or by finding encoding-specific patterns in the content. In contrast, the former looks for artificially prepended patterns.
Finally, the encoding may change due to our HTML code. This can only be changed once.
Fixing the Encoding
Once the DOM constructor hits a
meta tag, it will look for a
charset declaration. If one is found, the character set will be extracted. If we can extract it successfully and if the encoding is valid, we set the new encoding for scanning further characters. At this point the encoding will be frozen, and no further changes are possible.
There is just one caveat. To check if the previous scanning was alright, we need to compare the characters that have already been scanned with the characters that would have been scanned. Hence we need to see if changing the encoding earlier would have made some difference. If we find a difference, we need to restart the whole parsing procedure. Otherwise the whole DOM structure may be wrong up to this point.
As a consequence we’ve already learned two lessons:
- Place the
<meta charset=utf-8>(or some other encoding) tag as soon as possible.
- Only use ASCII characters before specifying the
charsetattribute in HTML.
Finally, a good starter for a boilerplate looks as follows. As we learned in the previous article, we can omit the
body tags. The snippet does two things right: It uses the correct document type, and it selects the character set as soon as possible.
```html <!DOCTYPE html>