Relational Databases for Dummies

Web applications can be split into two main parts:

the front-end: displays and collects information
the back-end: stores and provides data for use by the front-end

Since a web application back-end is concerned with storing and retrieving data, often for many thousands or even millions of users, it is natural that a database is one of the major back-end components.

In this article, I'll explain what a relational database is and how to properly design your database to store your app's information. I'll also explain what database normalization entails, with practical examples.

A database stores data in an organised way so that it can be searched and retrieved later. The database is made up of one or more tables. A table is much like a spreadsheet, in that it's made up of rows and columns. All rows have the same columns, and each column contains the data itself. If it helps, think of a table as being like a spreadsheet in Excel or Google Sheets.

Data can be inserted, retrieved, updated, and deleted from a table. The word created is generally used instead of inserted, so, collectively, these four functions are affectionately abbreviated as CRUD.

What Is a Relational Database?

A relational database is a type of database that organises data into tables and creates links between these tables, based on defined relationships. These relationships enable you to retrieve and combine data from one or more tables with a single query.

But to truly understand a relational database, you need to make one yourself! Let's get started by getting some real data with which we can work.

1. Get Some Data

Let's start with some Twitter data. I searched Twitter for #databases and took the following sample of ten tweets:

Table 1

full_name	username	text	created_at	following_username
"Boris Hadjur"	"_DreamLead"	"What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases?"	"Tue, 12 Feb 2013 08:43:09 +0000"	"Scootmedia", "MetiersInternet"
"Gunnar Svalander"	"GunnarSvalander"	"Bill Gates Talks Databases, Free Software on Reddit https://t.co/ShX4hZlA #billgates #databases"	"Tue, 12 Feb 2013 07:31:06 +0000"	"klout", "zillow"
"GE Software"	"GEsoftware"	"RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: https://t.co/S6RBUNxq via @rxin Fascinating."	"Tue, 12 Feb 2013 07:30:24 +0000"	"DayJobDoc", "byosko"
"Adrian Burch"	"adrianburch"	"RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory."	"Tue, 12 Feb 2013 06:58:22 +0000"	"CindyCrawford", "Arjantim"
"Andy Ryder"	"AndyRyder5"	"http://t.co/D3KOJIvF article about Madden 2013 using AI to prodict the super bowl #databases #bus311"	"Tue, 12 Feb 2013 05:29:41 +0000"	"MichaelDell", "Yahoo"
"Andy Ryder"	"AndyRyder5"	"http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311"	"Tue, 12 Feb 2013 05:24:17 +0000"	"MichaelDell", "Yahoo"
"Brett Englebert"	"Brett_Englebert"	"#BUS311 University of Minnesota's NCFPD is creating #databases to prevent "food fraud." http://t.co/0LsAbKqJ"	"Tue, 12 Feb 2013 01:49:19 +0000"	"RealSkipBayless", "stephenasmith"
Brett Englebert	"Brett_Englebert"	"#BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm"	"Tue, 12 Feb 2013 01:31:52 +0000"	"RealSkipBayless", "stephenasmith"
"Nimbus Data Systems"	"NimbusData"	"@NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory"	"Mon, 11 Feb 2013 23:15:05 +0000"	"dellock6", "rohitkilam"
"SSWUG.ORG"	"SSWUGorg"	"Don't forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29"	"Mon, 11 Feb 2013 22:15:37 +0000"	"drsql", "steam_games"

Here's what each column name means:

full_name: the user's full name
username: the user's Twitter handle
text: the tweet itself
created_at: the timestamp of the tweet
following_username: a list of people this user follows, separated by commas. For brevity, I limited the list length to two.

This is all real data; you can search Twitter and actually find these tweets.

This is good. The data is all in one place, so it's easy to find, right? Not exactly. There are a couple of problems with this table.

First, there is repetitive data across columns. The username and following_username columns are repetitive, because both contain the same type of data: Twitter handles.
Second, there is another form of repetition within the following_username column. The fields should only contain one value, but each following_username field contains two.
Third, there is repetitive data across rows. @AndyRyder5 and @Brett_Englebert each tweeted twice, so the rest of their information has been duplicated.

Duplicates are problematic because they make the CRUD operations more challenging. For example, it would take longer to retrieve data because time would be wasted going through duplicate rows. Also, updating data would be an issue; if a user changes their Twitter handle, we would need to find every duplicate and update it.

Repetitive data is a problem. We can fix this problem by splitting Table 1 into separate tables. Let's proceed by first resolving the issue of repetition across columns.

2. Remove Repetitive Data Across Columns

As noted above, the username and following_username columns in Table 1 are repetitive. This repetition occurred because I was trying to express the follow relationship between users. Let's improve on Table 1's design by splitting it up into two tables: one just for the following relationships and one for the rest of the information.

Splitting repetitive data across columns

Because @Brett_Englebert follows @RealSkipBayless, the following table will express that relationship by storing @Brett_Englebert as the from_user and @RealSkipBayless as the to_user. Let's go ahead and split table 1 into these two tables:

Table 2: The following Table

from_user	to_user
_DreamLead	Scootmedia
_DreamLead	MetiersInternet
GunnarSvalander	klout
GunnarSvalander	zillow
GEsoftware	DayJobDoc
GEsoftware	byosko
adrianburch	CindyCrawford
adrianburch	Arjantim
AndyRyder	MichaelDell
AndyRyder	Yahoo
Brett_Englebert	RealSkipBayless
Brett_Englebert	stephenasmith
NimbusData	dellock6
NimbusData	rohitkilam
SSWUGorg	drsql
SSWUGorg	steam_games

Table 3: The users Table

full_name	username	text	created_at
"Boris Hadjur"	"_DreamLead"	"What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases?"	"Tue, 12 Feb 2013 08:43:09 +0000"
"Gunnar Svalander"	"GunnarSvalander"	"Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases"	"Tue, 12 Feb 2013 07:31:06 +0000"
"GE Software"	"GEsoftware"	"RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating."	"Tue, 12 Feb 2013 07:30:24 +0000"
"Adrian Burch"	"adrianburch"	"RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory."	"Tue, 12 Feb 2013 06:58:22 +0000"
"Andy Ryder"	"AndyRyder5"	"http://t.co/D3KOJIvF article about Madden 2013 using AI to prodict the super bowl #databases #bus311"	"Tue, 12 Feb 2013 05:29:41 +0000"
"Andy Ryder"	"AndyRyder5"	"http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311"	"Tue, 12 Feb 2013 05:24:17 +0000"
"Brett Englebert"	"Brett_Englebert"	"#BUS311 University of Minnesota's NCFPD is creating #databases to prevent "food fraud." http://t.co/0LsAbKqJ"	"Tue, 12 Feb 2013 01:49:19 +0000"
Brett Englebert	"Brett_Englebert"	"#BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm"	"Tue, 12 Feb 2013 01:31:52 +0000"
"Nimbus Data Systems"	"NimbusData"	"@NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory"	"Mon, 11 Feb 2013 23:15:05 +0000"
"SSWUG.ORG"	"SSWUGorg"	"Don't forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29"	"Mon, 11 Feb 2013 22:15:37 +0000"

This is looking better. Now, in the users table, there is only one column with Twitter handles. In the following table, the is only one Twitter handle per field in the to_user column.

Edgar F. Codd, the computer scientist who laid down the theoretical basis of relational databases, called this step of removing repetitive data across columns the first normal form (1NF) in database normalization.

What Is Database Normalization?

Database normalization is a database design method that avoids data duplication and gets rid of undesired aspects like insertion, update, and deletion anomalies. Using relationships, database normalization rules break up larger tables into smaller ones. Normalization serves the dual purposes of removing unnecessary (repetitive) data and ensuring logical data storage.

In the first step of removing repetitive data across columns, we employed the first normal form (1NF).

When every attribute in a relation is a single valued attribute and has a unique name for each attribute or column, then we can say that relation is in its first normal form.

3. Remove Repetitive Data Across Rows

Now that we have fixed repetitions across columns, we need to fix repetitions across rows. Since the users @AndyRyder5 and @Brett_Englebert each tweeted twice, their information is duplicated in the users table. This indicates that we need to pull out the tweets and place them in their own table.

As before, "text" stores the tweet itself. Since the "created_at" column stores the timestamp of the tweet, it makes sense to pull it into this table as well. I also include a reference to the "username" column so we know who published the tweet. Here is the result of placing the tweets in their own table:

Table 4: The tweets Table

text	created_at	username
"What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases?"	"Tue, 12 Feb 2013 08:43:09 +0000"	"_DreamLead"
"Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases"	"Tue, 12 Feb 2013 07:31:06 +0000"	"GunnarSvalander"
"RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating."	"Tue, 12 Feb 2013 07:30:24 +0000"	"GEsoftware"
"RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory."	"Tue, 12 Feb 2013 06:58:22 +0000"	"adrianburch"
"http://t.co/D3KOJIvF article about Madden 2013 using AI to prodict the super bowl #databases #bus311"	"Tue, 12 Feb 2013 05:29:41 +0000"	"AndyRyder5"
"http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311"	"Tue, 12 Feb 2013 05:24:17 +0000"	"AndyRyder5"
"#BUS311 University of Minnesota's NCFPD is creating #databases to prevent "food fraud." http://t.co/0LsAbKqJ"	"Tue, 12 Feb 2013 01:49:19 +0000"	"Brett_Englebert"
"#BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm"	"Tue, 12 Feb 2013 01:31:52 +0000"	"Brett_Englebert"
"@NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory"	"Mon, 11 Feb 2013 23:15:05 +0000"	"NimbusData"
"Don't forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29"	"Mon, 11 Feb 2013 22:15:37 +0000"	"SSWUGorg"

Table 5: The users Table

full_name	username
"Boris Hadjur"	"_DreamLead"
"Gunnar Svalander"	"GunnarSvalander"
"GE Software"	"GEsoftware"
"Adrian Burch"	"adrianburch"
"Andy Ryder"	"AndyRyder5"
"Brett Englebert"	"Brett_Englebert"
"Nimbus Data Systems"	"NimbusData"
"SSWUG.ORG"	"SSWUGorg"

After the split, the users table has unique rows for users and their Twitter handles.

Edgar F. Codd called this step of removing repetitive data across rows the second normal form (2NF).

The next stage of normalizing a database is called the second normal form (2NF). 2NF advances the first normal form (1NF). The 2NF makes sure that every piece of information with a many-to-many link is sorted and stored in a separate table.

A table must be in the first normal form and also free of partial dependencies in order to be in the second normal form.

When an attribute in a table depends solely on a portion of a composite primary key rather than the entire primary key, this is known as partial dependence.

We can separate the table, remove the property that is generating partial dependency (text), and relocate it to another table (tweets) where it will fit in nicely to eliminate partial dependency.

4. Linking Tables With Keys

Data can be inserted, retrieved, updated, and deleted from a table.

So far, table 1 has been split into three new tables: following, tweets, and users. But how is this useful? Repetitive data has been removed, but now the data is spread out across three independent tables. In order to retrieve the data, we need to draw meaningful links between the tables. This way, we can express queries like "what a user has tweeted and who a user is following".

The way to draw links between tables is to first give each row in a table a unique identifier, termed a primary key, and then reference that primary key in the other table to which you want to link.

We've already done this in users and tweets. In the users table, the primary key is the username column because no two users will have the same Twitter handle. In tweets, we reference this key in the username column so we know who tweeted what. Since it is a reference, the username column in tweets is called a foreign key. In this way, the username key links the users and tweets tables together.

But is the username column the best idea for a primary key for the users table?

On one hand, it's a natural key—it makes sense to search using a Twitter handle instead of assigning each user a numerical ID and searching on that. On the other hand, what if a user wants to change their Twitter handle? That could cause errors if the primary key and all referencing foreign keys aren't updated accurately, errors that could be avoided if a constant numerical ID was used. Ultimately, the choice depends on your system. If you want to give your users the ability to change their username, it's better to add a numerical auto-incrementing id column to users and use that as the primary key. Otherwise, username should do just fine. I'll continue to use username as the primary key for users.

Let's move on to tweets. A primary key should uniquely identify each row, so what should the primary key be here? The created_at field won't work because if two users tweet at the same time, their tweets would have an identical timestamp. The text has the same problem in that if two users both tweet "Hello world," we couldn't distinguish between the rows. The username column is the foreign key that defines the link with the users so let's not mess with that. Since the other columns are not good candidates, it makes sense here to add a numerical auto-incrementing id column and use that as the primary key.

Table 6: The tweets Table With an id Column

id	text	created_at	username
1	"What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases?"	"Tue, 12 Feb 2013 08:43:09 +0000"	"_DreamLead"
2	"Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases"	"Tue, 12 Feb 2013 07:31:06 +0000"	"GunnarSvalander"
3	"RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating."	"Tue, 12 Feb 2013 07:30:24 +0000"	"GEsoftware"
4	"RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory."	"Tue, 12 Feb 2013 06:58:22 +0000"	"adrianburch"
5	"http://t.co/D3KOJIvF article about Madden 2013 using AI to prodict the super bowl #databases #bus311"	"Tue, 12 Feb 2013 05:29:41 +0000"	"AndyRyder5"
6	"http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311"	"Tue, 12 Feb 2013 05:24:17 +0000"	"AndyRyder5"
7	"#BUS311 University of Minnesota's NCFPD is creating #databases to prevent "food fraud." http://t.co/0LsAbKqJ"	"Tue, 12 Feb 2013 01:49:19 +0000"	"Brett_Englebert"
8	"#BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm"	"Tue, 12 Feb 2013 01:31:52 +0000"	"Brett_Englebert"
9	"@NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory"	"Mon, 11 Feb 2013 23:15:05 +0000"	"NimbusData"
10	"Don't forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29"	"Mon, 11 Feb 2013 22:15:37 +0000"	"SSWUGorg"

Finally, let's add a primary key to the following table. In this table, neither the from_user column nor the to_user column uniquely identifies each row on its own. However, from_user and to_user together do, since they represent a single relationship. A primary key can be defined on more than one column, so we'll use both these columns as the primary key for the following table.

As for the foreign key, from_user and to_user are each foreign keys since they can be used to define a link with the users table. If we query for a Twitter handle on the from_user column, we'll get all the users they follow. Correspondingly, if we query for a Twitter handle on the to_user column, we'll get all the users following them.

We've accomplished a lot so far. We removed repetitions across columns and rows by separating data into three different tables, and then we chose meaningful primary keys to link the tables together. This process is called normalization, and its output consists of data that is clean and organized according to the relational model. The consequence of this organization is that rows will appear in the database only once moving forward, which in turn will make the CRUD operations easier.

The figure below shows the finalized database schema. The three tables are linked, and the primary keys are highlighted.

Relational Database Management Systems

Now that we know how to design a relational database, how do we actually implement one? Relational database management systems (RDBMS) are software that let you create and use relational databases. There are several commercial and open-source vendors to choose from. On the commercial side, Oracle Database, IBM DB2, and Microsoft SQL Server are three well-known solutions. On the free and open-source side, MySQL, SQLite, and PostgreSQL are three widely used solutions.

MySQL is used at just about every Internet company you have heard of. In the context of this article, Twitter uses MySQL to store its users' tweets.

SQLite is common in embedded systems. iOS and Android let developers use SQLite to manage their app's private database. Google Chrome uses SQLite to store your browsing history, cookies, and your thumbnails on the "Most visited" page.

PostgreSQL is also a widely used RDBMS. Its PostGIS extension supplements PostgreSQL with geospatial functions that make it useful for mapping applications. A notable user of PostgreSQL is OpenStreetMap.

Structured Query Language (SQL)

Once you've downloaded and set up an RDBMS on your system, the next step is to create a database and tables inside it in order to insert and manage your relational data. The way you do this is with Structured Query Language (SQL), which is the standard language for working with RDBMSs.

There are small variations in SQL between each RDBMS vendor, termed SQL dialects.

Here's a brief overview of common SQL statements that are relevant to the example Twitter data above.

Create a Database Named `development`

1	CREATE DATABASE development;

Create a Table Named `users`

1	CREATE TABLE users (
2	full_name VARCHAR(100),
3	username VARCHAR(100)
4	);

RDBMSs require that each column in a table is given a data type. Here I have assigned the "full_name" and "username" columns the data type VARCHAR, which is a string that can vary in width. I've arbitrarily set a max length of 100. A full list of the various data types can be found on Wikipedia.

Insert a Record

1	INSERT INTO users (full_name, username)
2	VALUES ("Boris Hadjur", "_DreamLead");

Retrieve All Tweets Belonging to a User

1	SELECT text, created_at
2	FROM tweets
3	WHERE username="_DreamLead";

Update a User's Name

1	UPDATE users
2	SET full_name="Boris H"
3	WHERE username="_DreamLead";

Delete a User

1	DELETE FROM users
2	WHERE username="_DreamLead";

SQL is pretty similar to regular English sentences. There are small variations in SQL between each RDBMS vendor, termed SQL dialects, but the differences are not so dramatic that you can't easily transfer your SQL knowledge from one to the other.

Conclusion

In this article, we learned how to design a relational database. We took a collection of data and organised it into related tables. We learned about normalizing the database with the first and second normal forms.

We also briefly looked at RDBMS solutions and SQL. So get started by downloading an RDBMS and normalizing some of your data into a relational database today.

Preview Image Source: FindIcons.com/Barry Mieny