HTML5 Mastery: Encoding

Encoding is just one of those things that need to be done right. If done wrong, everything seems to be broken and nothing works. If done right, no one will notice. This makes dealing with encoding so annoying.

Nevertheless, we are quite lucky and most of the things are already really well-prepared. We only need to ensure that our documents are saved (and transmitted) with the right encoding. The right encoding is the one we specify. It could be anything, as long as it contains all characters we need, and as long as we stay consistent.

There are three important text encoding rules for HTML:

Load the content with the right encoding.
Transmit the content with the same encoding.
Ensure that the client reads the content with the specified encoding.

In this article we will have a closer look at all three rules in more detail, especially the second and the third. In the end we will also look at form encoding, which has nothing to do with text encoding directly, but does indirectly. We will see why there is some connection.

Choosing the Right Encoding

Either we know directly that our content should be delivered in some exotic encoding or we should just pick UTF-8. There are many reasons why we would want to use UTF-8. It is not a great format for storing characters in memory, but it is just wonderful as a basis for data exchange and content transmission. It is basically a no-brainer. Nevertheless, one of the more common mistakes is to save files without proper encoding. As there is no text without encoding, we should choose the encoding carefully.

Users of Sublime Text and most other text editors have probably never faced a problem with wrong encoding, since these editors save in UTF-8 by default. There are editors, mostly for the Windows platform, which a use different default format, e.g., Windows-1252.

Even in Sublime Text it is one of the more standard operations to change the encoding of the file. In the File menu we select Save with Encoding and select the one we want. That’s it!

In principle every more advanced editor should have such options. Sometimes they are contained in an advanced save menu. For instance, the editor for Microsoft’s Visual Studio triggers a special dialog after clicking Advanced Save Options… in the File menu.

We should make sure to use the right encoding. This will use the corresponding bytes for our content. UTF-8 has the major advantage of only requiring a single byte if we do not use a special character. At most 4 bytes per character are consumed. This is dynamic and makes UTF-8 an ideal format for text storage and transmission. The caveat is, however, that UTF-8 is not the best format for using strings from memory.

Controlling the Transmission

The HTTP protocol transmits data as plain text. Even if we decide to encode the transmitted content as GZip or if we use HTTPS, which encrypts the content, the underlying content is still just plain text. We’ve already learned that there is no such thing as just plain text. We always need to associate the content with some encoding to get a text representation.

An HTTP message is split in two parts. The upper part is called the headers. Separated by an empty line is the lower part: the body.

There are always at least two HTTP messages: a request and its associated response. Both types of messages share this structure. The body of a response is the content we want to transmit. The body of a request is only of interest for form submission, which we’ll care about later. If we want to provide some information on the encoding of the content, we have to supply some information in the header.

The following header tells the receiving side that the body contains a special text format called HTML, using the UTF-8 character set.

1
2	Content-Type: text/html; charset=utf-8

There is also the Content-Encoding header. We can easily confuse the content encoding with the actual text encoding of the content. The former is used to specify encoding of the whole package, e.g. GZip, while the latter is used as an initial setting for, e.g., parsing the provided content.

If we care about the correctness of this step we have to make sure that our web server sends the correct header. Most web frameworks offer such an ability. In PHP we could write:

1
2	header('Content-Type:text/html; charset=utf-8');

In Node.js we may want to use the following, where res is the variable representing the request:

1
2	res.setHeader('Content-Type', 'text/html; charset=utf-8');

The transmitted header will set the text scanner of the HTML input to the provided setting. In the case of the previous example we use UTF-8. But wait: Initial setting! There are many ways to override this. If the actual content is not UTF-8, the scanner may recognize this and change the setting. Such a change may be triggered by Byte-Order-Mark (known as BOM) detection or by finding encoding-specific patterns in the content. In contrast, the former looks for artificially prepended patterns.

Finally, the encoding may change due to our HTML code. This can only be changed once.

Fixing the Encoding

Once the DOM constructor hits a meta tag, it will look for a charset declaration. If one is found, the character set will be extracted. If we can extract it successfully and if the encoding is valid, we set the new encoding for scanning further characters. At this point the encoding will be frozen, and no further changes are possible.

There is just one caveat. To check if the previous scanning was alright, we need to compare the characters that have already been scanned with the characters that would have been scanned. Hence we need to see if changing the encoding earlier would have made some difference. If we find a difference, we need to restart the whole parsing procedure. Otherwise the whole DOM structure may be wrong up to this point.

As a consequence we’ve already learned two lessons:

Place the <meta charset="utf-8"> (or some other encoding) tag as soon as possible.
Only use ASCII characters before specifying the charset attribute in HTML.

Finally, a good starter for a boilerplate looks as follows. As we learned in the previous article, we can omit the head and body tags. The snippet does two things right: It uses the correct document type, and it selects the character set as soon as possible.


<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Title here</title>
<!-- ... -->

The only remaining question is: What happens if I forget one of these three steps? Well, the first and third steps are the most important ones. The transmission is actually not that bad. If no initial encoding is given from the HTTP headers, the browser will select the initial encoding based on the user’s locale. With a German locale we get Windows-1252. This is actually the default for most countries. Some countries, like Poland or Hungary, select Latin2, also known as iso-8859-2.

In principle we do not have to worry about this initial encoding if we followed the best practices described earlier. ASCII is a subset of Unicode, and most of the listed encodings are actually just ASCII extensions to satisfy the specific needs of one or more countries. If we only use basic ASCII characters until the character set is specified, we should be fine.

Much more severe is a conflict between the stored / read or generated data, which is delivered to the client, and the statement in the meta tag. If something went wrong we may see renderings like the following. This is not a pleasant user experience.

Coming back to determining the right encoding, there are many reasons why UTF-8 would be the best choice. Any other encoding should at least be sufficient for the characters we want to display. However, if we provide form input fields, we may be in trouble. At this point we do not control the characters that are used any more. Users are allowed to input anything here. Let’s see how we can control the encoding for form input.

Submitting Forms

A form is submitted with a certain encoding type, which is not the same as the encoding type of a server’s response, e.g. GZip. The form’s encoding type determines how the form is serialized before sending it to the server. It is particularly useful in conjunction with the HTTP verb.

Ordinary form submissions use POST as HTTP verb, but GET, PUT and DELETE are also common. Only POST and PUT are supposed to use the body for content transmission in the request. The browser will construct the content with respect to the choice of the enctype attribute of the <form> element, specifying the encoding type. The encoding type is transported by setting the Content-Type header in the HTTP request.

There are three well-established encoding types:

URL encoded (default value, explicitly application/x-www-form-urlencoded)
Plain text (text/plain)
Multipart (multipart/form-data)

The first and the second are quite similar, but they have subtle (and very important) differences. The third variant is the most powerful method. It even allows the transporting of arbitrary files as attachments.

The key difference between the first two types is that URL encoded form transmission percent-encodes all names and values, which is not done by plain text. The percent-encoding guarantees that the receiving side can distinguish between names and values. This guarantee does not exist with plain text form submission. The third variant uses a boundary string to separate the entries, which is unique by construction.

Let’s visualize the differences by submitting a simple form. The form contains the following code:





Multi
lines
rock

Submitting the form without specifying any encoding type transmits the following body:

1
2	first=With+spaces%2Bsigns&second=H%E4llo+D%FC%3F&third=Multi%0D%0Alines%0D%0Arock

The URL encoding transforms the white-space characters to plus signs. Existing plus signs, like all “special” characters, are transformed by the percent-encoding rules. This especially applies to new lines, originally represented by \r\n, which are now displayed as %0D%0A.

Let’s see what the outcome for plain text encoding looks like.


first=With spaces+signs
second=Hällo Dü?
third=Multi
lines
rock

The pairs are split by new lines. This is especially problematic for multi-line content and may lead to incorrect representations.

In a way the multipart encoding combines the advantages of plain text submission with a defined boundary, which essentially solves the problems of the plain text encoding. The only drawback is the increased content length.


------WebKitFormBoundaryzQRASBvDO1bUB5Lp
Content-Disposition: form-data; name="first"

With spaces+signs
------WebKitFormBoundaryzQRASBvDO1bUB5Lp
Content-Disposition: form-data; name="second"

Hällo Dü?
------WebKitFormBoundaryzQRASBvDO1bUB5Lp
Content-Disposition: form-data; name="third"

Multi
lines
rock
------WebKitFormBoundaryzQRASBvDO1bUB5Lp--

The last two form encoding methods also displayed special characters exactly as we’ve entered them. Form transmission primarily uses the accept-charset attribute of the corresponding </form> <form> element. If no such attribute is given, the encoding of the page is used. Again, setting the correct encoding is important.

In the future we will see a fourth encoding type, called application/json. As the name suggests, it will pack the form content into a JSON string.

Conclusion

Choosing the right encoding can be as easy as just picking UTF-8. Typical problems can be avoided by using the same encoding consistently. Declaring the encoding during transport is certainly useful, although not required, especially if we follow best practices for placing a <meta> element with the charset attribute.

Form submission is a process that relies on the right encoding choice—not only for the text, but for the submission itself. In general we can always choose multipart/form-data as enctype, even though the default encoding type might be better (smaller) in most scenarios. In production we should never use text/plain.