Docker From the Ground Up: Understanding Images

Docker containers are on the rise as a best practice for deploying and managing cloud-native distributed systems. Containers are instances of Docker images. It turns out that there is a lot to know and understand about images.

In this two-part tutorial I'll cover Docker images in depth. In this part, I'll start with the basic principles, and then I'll move on to design consideration and inspecting image internals. In part two, I'll cover building your own images, troubleshooting, and working with image repositories.

When you come out on the other side, you'll have a solid understanding of what Docker images are exactly and how to utilize them effectively in your own applications and systems.

Understanding Layers

Docker manages images using a back-end storage driver. There are several supported drivers such as AUFS, BTRFS, and overlays. Images are made of ordered layers. You can think of a layer as a set of file system changes. When you take all the layers and stack them together, you get a new image that contains all the accumulated changes.

The ordered part is important. If you add a file in one layer and remove it in another layer, you'd better do it in the right order. Docker keeps track of each layer. An image can be composed of dozens of layers (the current limit is 127). Each layer is very lightweight. The benefit of layers is that images can share layers.

If you have lots of images based on similar layers, like base OS or common packages, then all these common layers will be stored just once, and the overhead per image will be just the unique layers of that image.

Copy on Write

When a new container is created from an image, all image layers are read-only and a thin read-write layer is added on top. All the changes made to the specific container are stored in that layer.

Now, it doesn't mean that the container can't modify files from its image layer. It definitely can. But it will create a copy in its top layer, and from that point on, anyone who tries to access the file will get the top layer copy. When files or directories are removed from lower layers, they become hidden. The original image layers are identified by a cryptographic content-based hash. The container's read-write layer is identified by a UUID.

This allows a copy-on-write strategy for both images and containers. Docker reuses the same items as much as possible. Only when an item is modified will Docker create a new copy.

Design Considerations for Docker Images

The unique organization in layers and the copy-on-write strategy promotes some best practices for creating and compositing Docker images.

Minimal Images: Less Is More

Docker images get enormous benefits from the point of view of stability, security and loading time the smaller they are. You can create really tiny images for production purposes. If you need to troubleshoot, you can always install tools in a container.

If you write your data, logs and everything else only to mounted volumes then you can use your entire arsenal of debugging and troubleshooting tools on the host. We'll see soon how to control very carefully what files go into your Docker image.

Combine Layers

Layers are great, but there is a limit, and there is overhead associated with layers. Too many layers could hurt file system access inside the container (because every layer may have added or removed a file or directory), and they just clutter your own file system.

For example, if you install a bunch of packages, you can have a layer for each package, by installing each package in a separate RUN command in your Dockerfile:

RUN apt-get update

RUN apt-get -y install package_1

RUN apt-get -y install package_2

RUN apt-get -y install package_3

Or you can combine them into one layer with a single RUN command.

RUN apt-get update && \

    apt-get -y install package_1 && \

    apt-get -y install package_2 && \

    apt-get -y install package_3    

Choosing a Base Image

Your base image (practically nobody builds images from scratch) is often a major decision. It may contain many layers and add a lot of capabilities, but also a lot of weight. The quality of the image and the author are also critical. You don't want to base your images on some flaky image where you're not sure exactly what's in there and if you can trust the author.

There are official images for many distributions, programming languages, databases, and runtime environments. Sometimes the options are overwhelming. Take your time and make a wise choice.

Inspecting Images

Let's look at some images. Here is a listing of the images currently available on my machine:

REPOSITORY      TAG    IMAGE ID     CREATED      SIZE
python          latest 775dae9b960e 12 days ago  687 MB
d4w/nsenter     latest 9e4f13a0901e 4 months ago 83.8 kB
ubuntu-with-ssh latest 87391dca396d 4 months ago 221 MB
ubuntu          latest bd3d4369aebc 5 months ago 127 MB
hello-world     latest c54a2cc56cbb 7 months ago 1.85 kB
alpine          latest 4e38e38c8ce0 7 months ago 4.8 MB
nsqio/nsq       latest 2a82c70fe5e3 8 months ago 70.7 MB

The repository and the tag identify the image for humans. If you just try to run or pull using a repository name without specifying the tag, then the "latest" tag is used by default. The image ID is a unique identifier.

Let's dive in and inspect the hello-world image:

> docker inspect hello-world
[
    {
        "Id": "sha256:c54a2cc56cbb2f...e7e2720f70976c4b75237dc",
        "RepoTags": [
            "hello-world:latest"
        ],
        "RepoDigests": [
            "hello-world@sha256:0256e8a3...411de4cdcf9431a1feb60fd9"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2016-07-01T19:39:27.532838486Z",
        "Container": "562cadb4d17bbf30b58a...bf637f1d2d7f8afbef666",
        "ContainerConfig": {
            "Hostname": "c65bc554a4b7",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "/bin/sh",
                "-c",
                "#(nop) CMD [\"/hello\"]"
            ],
            "Image": "sha256:0f9bb7da10de694...5ab0fe537ce1cd831e",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {}
        },
        "DockerVersion": "1.10.3",
        "Author": "",
        "Config": {
            "Hostname": "c65bc554a4b7",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "/hello"
            ],
            "Image": "sha256:0f9bb7da10de694b...b0fe537ce1cd831e",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {}
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 1848,
        "VirtualSize": 1848,
        "GraphDriver": {
            "Name": "aufs",
            "Data": null
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:a02596fdd012f22b03a...079c3e8cebceb4262d7"
            ]
        }
    }
]

It's interesting to see how much information is associated with each image. I will not go over each item. I'll just mention an interesting tidbit that the "container" and "containerConfig" entries are for a temporary container that Docker creates when it builds the image. Here, I want to focus on the last section of "RootFS". You can get just this part using the Go templating support of the inspect command:

1	> docker inspect -f '{{.RootFS}}' hello-world
2	{layers [sha256:a02596fdd012f22b03af6a...8357b079c3e8cebceb4262d7] }

It works, but we lost the nice formatting. I prefer to use jq:

> docker inspect hello-world | jq .[0].RootFS
{
  "Type": "layers",
  "Layers": [
    "sha256:a02596fdd012f22b03af6a...7507558357b079c3e8cebceb4262d7"
  ]
}


You can see that the type is "Layers", and there is just one layer.

Let's inspect the layers of the Python image:

> docker inspect python | jq .[0].RootFS
{
  "Type": "layers",
  "Layers": [
    "sha256:a2ae92ffcd29f7ede...e681696874932db7aee2c",
    "sha256:0eb22bfb707db44a8...8f04be511aba313bdc090",
    "sha256:30339f20ced009fc3...6b2a44b0d39098e2e2c40",
    "sha256:f55f65539fab084d4...52932c7d4924c9bfa6c9e",
    "sha256:311f330fa783aef35...e8283e06fc1975a47002d",
    "sha256:f250d46b2c81bf76c...365f67bdb014e98698823",
    "sha256:1d3d54954c0941a8f...8992c3363197536aa291a"
  ]
}

Wow. Seven layers. But what are those layers? We can use the history command to figure that out:

IMAGE        CREATED     CREATED BY                             SIZE 
775dae9b960e 12 days ago /bin/sh -c #(nop)  CMD ["python3"]     0 B
<missing>    12 days ago /bin/sh -c cd /usr/local/bin  && { ... 48 B
<missing>    12 days ago /bin/sh -c set -ex  && buildDeps=' ... 66.9 MB
<missing>    12 days ago /bin/sh -c #(nop)  ENV PYTHON_PIP_V... 0 B
<missing>    12 days ago /bin/sh -c #(nop)  ENV PYTHON_VERSI...   0 B
<missing>    12 days ago /bin/sh -c #(nop)  ENV GPG_KEY=0D96... 0 B
<missing>    12 days ago /bin/sh -c apt-get update && apt-ge... 7.75 MB
<missing>    12 days ago /bin/sh -c #(nop)  ENV LANG=C.UTF-8    0 B
<missing>    12 days ago /bin/sh -c #(nop)  ENV PATH=/usr/lo... 0 B
<missing>    13 days ago /bin/sh -c apt-get update && apt-ge... 323 MB
<missing>    13 days ago /bin/sh -c apt-get update && apt-ge... 123 MB
<missing>    13 days ago /bin/sh -c apt-get update && apt-ge... 44.3 MB
<missing>    13 days ago /bin/sh -c #(nop)  CMD ["/bin/bash"... 0 B
<missing>    13 days ago /bin/sh -c #(nop) ADD file:89ecb642... 123 MB

OK. Don't be alarmed. Nothing is missing. This is just a terrible user interface. The layers used to have an image ID before Docker 1.10, but not anymore. The ID of the top layer is not really the ID of that layer. It is the ID of the Python image. The "CREATED BY" is truncated, but you can see the full command if you pass --no-trunc. I'll save you from the output here because of page-width limitations that will require extreme line wrapping.

How do you get images? There are three ways:

Pull/Run
Load
Build

When you run a container, you specify its image. If the image doesn't exist on your system, it is being pulled from a Docker registry (by default DockerHub). Alternatively, you can pull directly without running the container.

You can also load an image that someone sent you as a tar file. Docker supports it natively.

Finally, and most interestingly, you can build your own images, which is the topic of part two.

Conclusion

Docker images are based on a layered file system that offers many advantages and benefits for the use cases that containers are designed for, such as being lightweight and sharing common parts so many containers can be deployed and run on the same machine economically.

But there are some gotchas, and you need to understand the principles and mechanisms to utilize Docker images effectively. Docker provides several commands to get a sense of what images are available and how they are structured.