Base What? A Practical Introduction to Base Encoding

At a young age, we learn to count on our fingers—starting out with 1 to 5, then 1 to 10, and maybe, if you're particularly enterprising as a toddler, you will learn to count to 20, 30, and beyond. No one ever attempts to enlighten us that we are actually making some more complex mathematical assumptions; we all know base 10, to be precise.

In this article, we'll start by gaining a more rounded understanding of base 10 and its structure, then we will discuss binary (base 2, the building blocks of computing). Finally, we'll finish things up by talking about base 32 and base 64. At each stage we will discuss the advantages and uses for each type.

Why Base 10?

Why did we choose base 10? It's not because the letterforms 0-9 exist; that was actually a result of the choice to use base 10. In fact, it is most likely because of the learning process I mentioned above—we have 10 fingers. This makes it much easier to understand the system.

Let's talk a bit about how base 10 is structured. This will be the foundation of understanding that we'll use in the subsequent discussion.

Starting at 0, we count up to 9, filling the "1's" column. Once the ones column is full (has 9), that is the maximum for the column. So we move to the next column (to the left) and start at 1. For all intents and purposes, we can postulate that there are an infinite number of leading zeros before our first significant column. In other words, "000008" is the same as "8". So as each column fills up, the next column is then increased by one, and we start back at the previous column to fill it up again in the same manner as before. Specifically, the 1s column increases from 0 to 9, and then another ten is added to the tens column. This is continued, and if the tens column is at 9 and the 1s column is at 9, 1 is added to the 100's column, and so forth. We all know this piece of the puzzle.

Consider the number 1020. Starting from the right, we can understand this as "0*1 + 2*10 + 0*100 + 1*1000". Now, consider the number 5,378. We can understand this as "8*1 + 7*10 + 3*100 + 5*1000". A generalized function to understand base 10, then, is as follows:

1	(10 raised to the power of the column from the right -1) * (the number found in the column)

Therefore, if there is a 6 in the 5th column from the right, 10^4*6 = 60,000.

We can see that there is a generalized formula for understanding all base systems.

Base 2 (Binary)

The next system we will talk about is base 2, or binary. Binary consists of two digits, 0 and 1. This lends itself well to computing for many reasons, most fundamentally because computers rely on switches that have two states: on or off. Binary is the most basic system needed for all logical operations (think "true" and "false").

How does binary work? Take the formula above, and instead of using ten, use two. And on that note, this is why these systems are referred to as base N.

1	(2 raised to the power of the column from the right -1) * (the number found in the column)

So, let's take the arbitrary number 1001101 in binary, and apply this formula.

1	(1 * 1) + (0 * 2) + (14) + (1 8) + (16 * 0) + (32 * 0) + (64 * 1) = 77

"Wait!", you're thinking. "If binary is all that computers are made of, how would you write letters in binary?" Good question. This actually brings us to our introduction of base 16.

Base 16

Let's, for a moment, imagine that we had 11 fingers. We would be naturally using a system of base 11. Besides it seeming uncomfortably hard to imagine currently, what other implications would this have? Perhaps the most important implication is that we would have had another increment beyond 9 in the 1s column. But it wouldn't be a "10", because 10 isn't confined to the 1s column. It would instead be a single-digit representation of 10. And, in fact, that is exactly how letters function in base systems beyond base 10 up to base 62, with some caveats (which we'll get to later when we talk about base 32).

Let's imagine using base 11, but substitute a capital A for the single-digit "10" we discussed above. How would we write the number 54?

Since we know the first column from the left is the "11's" column, we would begin by dividing 54 by eleven, which gives us 4 with a remainder of 10. If "A" represents 10, in base 11 the number 54 would be represented as 4A.

Let's do that in reverse, with the formula we used previously.

1	(11 raised to the power of the column from the right - 1) * (the number found in the column)

In this case, that would mean:

1	(1 * A) + (4 * 11)

Now, substitute 10 for A:

1	(110) + (411) = 54

Hexadecimal

How is this useful, you're probably wondering. Base 11 may not necessarily be useful (unless you have some kind of data structure that would benefit from a base 11 system). However, base 16 is used throughout computer systems for multiple purposes. Also known as hexadecimal, base 16 uses the numbers 0 to 9 followed by the letters a to f (not case-sensitive). In particular, you will see hexadecimals used to define RGB colors in CSS (and in most color-picker widgets on desktop software), with two digits for each of the channels red, green, and blue.

So, for instance, #A79104 would produce r = A7, g = 91, b = 04. In decimals, this would be equivalent to r = 167, g = 145, b = 4; the resulting color would be a golden yellow. Two hexadecimal digits put together can represent 256 different numbers, and thus there are 256^3 (16,777,216) possible number combinations in the RGB hexadecimal system, represented by only 6 characters (or 3 if you use the shortcut method, where each of three digits is implicitly doubled; e.g. #37d == #3377dd).

Base 16 is often used in assembly languages, which is the lowest-level accessible programming language. Because hexadecimals are easy to convert to binary, they are an easier way to write assembly code instructions.

The same is generally true of the popularity of base 32 and base 64; these encodings are used because they are naturally better for binary data (because they are powers of 2), and because there are, at least, 64 safe characters (and there aren't 128 safe characters) on almost every computer.

For a hexadecimal example, take the number 1100 in hexadecimal, which is equivalent to 4352 in decimal. The same number in binary is 0001 0001 0000 0000. Converting from hexadecimal to binary is a simple operation of using a conversion table, where 0 in hexadecimal is 0000 in binary and F in hexadecimal is 1111 in binary.

Note that the 0's to the left of the first number denote that the binary number is in bits, whereas the 0's to the far left are simply empty columns. Fundamentally, these are not needed; however, you will encounter binary written this way almost exclusively. This practice is called padding, and it's practiced because the length of the data is unknown, and thus could cause problems when multiple data transmissions occur; by padding the final string, the data size is guaranteed to be, for instance, 4 bits long (for binary). Padding also occurs in other commonly used and specification-based encoding schemes; in particular, base 32 and base 64 both use the equals sign ("=") for padding.

Base 32

Remember when we mentioned the caveat above? This is the caveat: the most commonly accepted base 32 definition is actually an encoding that starts with the first 26 letters of the alphabet and ends with the numbers 2 to 7. This is defined in The Internet Engineering Task Force's Request for Comments (RCFC) 4648, which also defines base 16 and base 64. Note, the difference is that the encoding for 0 is A, not 0. To encode a string in base 32, the following instructions happen.

First, the string to be encoded is split into 5-byte blocks (40 bits in binary). Letters are represented by 8-bit blocks in ASCII (the standard for computers), so for every 5 letters, there are 40 bits. (This 8-bit definition for each letter allows for a total of 255 characters in ASCII.)

Next, divide these 40 bits into 8 five-bit blocks; so, for every 5 letters, there are 8 blocks to encode in base 32. Map each of these blocks to a 5-bit character mapping in the base 32 alphabet. For instance, if the five-bit block is 00010 (or decimal 2), the mapped character is the letter c. If the five-bit block is 01010 (decimal 10), the mapped character is the letter K.

Let's apply these steps to the string "yessir".

Character	ASCII Decimal	8-bit ASCII Binary
y	89	01111001
e	101	01100101
s	115	01110011
s	115	01110011
i	105	01101001
r	114	01110010

Let's take the binary representations and concatenate them now, splitting them into 5-bit groups:

1	01111 00101 10010 10111 00110 11100 11011 01001
2	01110 010(00) null null null null null null

Because the specification defines that the encoding must be done in chunks of 8 5-bit pieces, we have to pad with 0 if the number of bits isn't divisible by 5 (hence the 010(00) on the second line) and with = if the number of chunks isn't divisible by 8. The "null" values will be replaced by the padding character, "=".

Each of these 5-bit binary numbers map to a character in the 32-bit alphabet; specifically, the output for yessir would be PFSXG43JOI======.

A similar process is followed for base 64. There are a few fundamental differences between base 32 and base 64. Base 64 includes the letters A to Z, a to z, the numbers 0 to 9, and the symbols + and /. As mentioned previously, the "=" symbol is used for padding. The differences are mainly that all letters are case-sensitive, and all digits are used (instead of the subset 2 to 7). The symbols + and / are also added.

The base 64 encoding process takes 24-bit strings (3 letters) and breaks them into four 6-bit chunks, mapping the resulting binary number to the base 64 alphabet. So let's take a look at our previous example, the string "yessir".

1	8-bit binary: 01111001 01100101 01110011 01110011 01101001 01110010
2	6-bit chunks: 011110 010110 010101 110011 011100 110110 100101 110010
3	base 64: eWVzc2ly

There are a few important things to note. First, base 64 is case-sensitive. Second, because the number of bits (48) was divisible by 6, no bit-padding was necessary. The number of 6-bit chunks was divisible by four as well (which also means that the number of input characters was divisible by 3), so no null ("=") padding was necessary either.

A Summary of Base 16, Base 32, and Base 64

These binary-friendly bases are leveraged throughout programming structures. Binary data is encoded in these bases to ensure the fidelity of the transfer and guard against errors that might arise out of accidental unencoded binary data transfer. They rely on standards-based tables of characters, and are only guaranteed to work if both the encoder and decoder use the same table; for instance, there are widely accepted modified versions of base 32, including one by Douglas Crockford that changes some of the acceptable characters, including the letter "u", in order to avoid unintentional obscenity.

Encoding in Practice

In addition to using hexadecimal numbers on a regular basis for CSS colors, base 32 and base 64 are used on the web consistently. Though the official encoding process for base 32 and base 64 bloat the size of the string, encoding numbers in base 64 or base 32 can be very beneficial for things like URL shortening, where a URL might point to /foo/id. Consider the following decimal numbers and their base 32 and base 64 equivalents.

Decimal	Base 16	Base 32
20	U	U
50	bs	y
967	6h	PH
745619	WYET	C2CT
7241930	G5AGK	boDK
798312345192	xhpr7lti	LnfH65o

As you can see, there are significant advantages to using base 64 or base 32 for number shortening. When every character counts, using these base encodings allows you to save characters. In many cases, the encoded number is about half the length of the non-encoded number.

Base Encoding and Decoding Functions in PHP

There are some built-in functions and third-party utilities in PHP that you can use to do base conversions or base encoding or decoding of binary data. They can prove useful every now and then.

The base_convert() Function

Let's say you want to convert a number between arbitrary bases. You can use the base_convert() function to do so. The following code converts the largest ten-digit prime number to other bases.

1	<?php
2
3	$number = '9999999967';
4
5	for($i = 1; $i <= 5; $i++) {
6	$base = pow(2, $i);
7	echo "Base ".$base." : ".base_convert($number, 10, $base)."\n";
8	}
9	/* Outputs:
10	Base 2 : 1001010100000010111110001111011111
11	Base 4 : 21110002332033133
12	Base 8 : 112402761737
13	Base 16 : 2540be3df
14	Base 32 : 9a0nouv
15	*/
16
17	?>

It's important to remember that this function does not handle encoding or decoding. It only helps with conversion of the base for a particular number.

The `base64_encode()` and `base64_decode()` Functions

You can use the base64_encode() function to encode your binary data as a string. The string will have MIME base64 encoding. The base64_decode() function will decode the encoded string back to the original data. Here is an example:

<?php

$string = 'Hasta La Vista';

$encoded = base64_encode($string);
echo $encoded;
// Outputs: SGFzdGEgTGEgVmlzdGE=

$decoded = base64_decode($encoded);
echo $decoded;
// Outputs: Hasta La Vista

?>

Using a Third-Party Utility

You can also consider using the Base2n class for handling all your binary-to-text encoding with ease. Let's say you want to encode the previous string using base 32 encoding and its most common definition in The Internet Engineering Task Force's Request for Comments (RCFC) 4648.

This encoding starts with all 26 letters of the alphabet and then contains the digits 2, 3, 4, 5, 6, and 7. There is no 0 or 1 in this encoding. These characters are supplied as the second parameter of the constructor. Here's an example that will encode our string to base 32.

<?php

include("Base2n.php");

$base32 = new Base2n(5, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ234567', false, true, true);
$encoded = $base32->encode('Hasta La Vista');

echo($encoded);
// Outputs: JBQXG5DBEBGGCICWNFZXIYI=

?>

A Note on Base 62 and URL-Modified Base 64

If you base 64 encode the number 959, the result is O/. Of course, this isn't a URL-safe value because of the "/", so a URL pointing to O/ would not be decoded as O/, but as O (which is the decimal value 14). It would defeat the purpose, also, to encode the "/" as its ASCII code equivalent (%47%), as that lengthens the URL significantly.

Two main solutions have arisen to combat this issue. One is a URL-safe variant of base 64 that replaces the + and / with - and _, respectively. It also removes the specification of adding = characters for padding. The other option is to go to a base 62 encoding, which retains almost all of the benefits of base 64 and removes the + and /. However, base 62 encoding is not as easily applicable as a binary transmission substitute, and therefore is far less popular.

Conclusion

That wraps it up! Now, you have a fundamental knowledge of base systems, particularly as they apply to the encoding of binary data. In what other types of web applications would you find uses for these encodings?

This post has been updated with contributions from Monty Shokeen. Monty is a full-stack developer who also loves to write tutorials and learn about new JavaScript libraries.