How to design database for Relational Database Management System (RDBMS)

How to Design Relational Database

The first thing comes to my mind when I think about database design is normalization. It's easy to talk about it in theory but very confusing for newbies when applying to reality. My personal experience about learning database design is that you can't bypass the theory part of it but you really don't care or consider about which normalization form (First Normal Form, Second Normal Form, and Third Normal Form) you are at when you have gathered some experience and really get down to relational database design in details.

So let's start the database design theories. As I'm not an academic type of person, I will want to talk about things in plain English.

Components in a relational database model

A relational database has these four basic components:

Tables that store data
Columns in tables so that we can store different types of data.
Primary key in a table to uniquely identify each row in a table.
Relationships between tables. As we can't store all data in one table, we break them down and put into different tables and link them by common columns (known as foreign keys) so that we can have some sort of relationships to be used to pull out related data later from these tables.

Database Normalization

Normalization is the process of breaking down our data and store them in different tables consistently.

There are mainly three levels of normalization (first, second, and third normal form). Further normal forms are available but not commonly used. You can ignore them completely.

Before we move onto the theories about normal forms, let's assume you are the company owner of Northwind Trader and you buy and sell goods. How would you store your company sales data in the old days?

By intuition, you would want to store data separately into different draws or folders. Products data in one draw and suppliers data in another draw. Naturally you want to relate products to suppliers to find out who supplied what. This type of thinking is the basis of relational database design.

First Normal Form (1st NF)

In theory, this is the first thing you need to do to break down your data. In this step, we want to achieve the following three goals:

Each row in a table must be uniquely identifiable.
Each column must only contain one piece of information.
No repeating columns - no two columns should contain the same data.

Goal #1: Each row in a table must be uniquely identifiable

What are the consequences of unable to identify each row? We could either store the same information twice in the table, or update or delete wrong records as we are not able to distinguish between rows.

To uniquely identify a row, we define a single column or a group of columns as primary key. Below is customers table in northwind database. Each customer is assigned a unique value for identification.

Customers table in Northwind

Goal #2: Each column must only contain one piece of information.

It's easier to explain this requirement by illustration. We have four address data as below:

Obere Str. 57
Berlin, 12209
Germany
120 Hanover Sq.
London, WA1 1DP
UK
23 Tsawassen Blvd.
Tsawassen, T2F 8M4
Canada
54, rue Royale
Nantes, 44000
France
Via Monte Bianco 34
Torino, 10100
Italy

As we see above, if data is grouped together like this in a single column, it would be impossible to search postcode by country efficiently.

To solve this problem, the First Normal Form requires that each of these items must be broken down and stored in its own column. This is known as storing atomic data. One column for street address, one column for City, one column for postcode, and one column for country.

Atomic data for First Normal Form

Goal #3: No repeating columns - no two columns should contain the same data.

In other words, every column in a row should store data that is different from all other columns.

To illustrate this, think of the order_details table that lists products purchased. We could structure a table shown as below:

Repeating columns for First Normal Form

For the same order (OrderID 10255), the table above stores two products (ProductID = 2 and ProductID2 = 26) in the first row and another two products in the second row. The repeated columns are ProductID vs ProductID2, UnitPrice vs UniPrice2, and Quantity vs Quantity2. This structure presents two problems. First, it's difficult to search for information. Second, it creates redundant columns for the same information.

To solve the problem, we split the columns and replace them with multiple rows.

No repeating columns for First Normal Form

What is functional dependency?

We introduce functional dependency here because it's going to be used in the next two normal forms.

Functional dependency describes the concept that all other columns in a table must depend completely on the primary key column.

For example, the values in the CompanyName column and ContactName column depend completely on the value in the CustomerID column. In other words, if we know CustomerID, then we can tell the customer's company name and contact name. This reads like CustomerID determines CompanyName and ContactName.

Customers table in Northwind

The effect of achieving functional dependency is that each related set of data is put into its own table. In other words, each table only represents one subject.

It should be noted that functional dependencies are not limited to depending on a single column. We can have a combination of several columns determine other columns in the table.

Second Normal Form (2nd NF)

For a table to be in the Second Normal Form, it must already be in the First Normal Form.

After 1st NF, every table has got a primary key. Sometimes the primary key consists of multiple columns (also known as composite primary key). When this happens, we should pay attention to other columns (non-key columns) in this table to make sure all these other non-key columns must fully depend on the whole primary key. If not, we will have to action on the following three steps:

Remove partial dependencies on the composite primary key.
Then build separate tables for each set of removed data.
Then build relationships between these tables.

If we can't find partial dependencies of non-key columns on the composite primary key or the table uses a single column as primary key, we are already at Second Normal Form.

The pre-requisite for 2nd Normal Form is that the primary key in the table must consist of multiple columns.

2nd Normal Form (partial dependency) example

Table below has a composite primary key that consists of OrderID and ProductID column. The combination of OrderID and ProductID uniquely identifies each row in this table. Note that UnitPrice and Quantity are non-key column and are fully dependent on this composite primary key.

Order_details table:
Partial dependency in Second Normal Form

Now, look at OrderDate and RequiredDate column. They are non-key column as well. We notice that:

OrderDate 1996-07-02 was repeated three times and 1996-07-05 was repeated two times. The same is for RequiredDate.
OrderDate and RequiredDate only depend on OrderID column and has nothing to do with ProductID column.

OrderDate and RequiredDate column are not fully dependent on the combination of OrderID and ProductID, which form the composite primary key. We just discovered a partial dependency. Therefore, this table is not in 2nd NF. Let's remove the partial dependency by splitting this table into the following two tables.

Orders table:
Split table 1 in Partial dependency in Second Normal Form

Order_details table:
Split table 2 in Partial dependency in Second Normal Form

Our new Orders table only stores data related to orders and we created a new table Order_details.

If we want to find out order date for order 10248 in Order_details table, we use the relationship between the two tables to work out that information. This relationship is established by OrderID in both tables. This relationship is called foreign key relationship.

One-to-Many relationship:
One OrderID 10248 in Orders table is mapped to three OrderIDs in Order_details table.
One OrderID 10249 in Orders table is mapped to two OrderIDs in Order_details table.

Foreign key relationship (one-to-many relationship):
Foreign key relationship diagram

Third Normal Form (3rd NF)

The 3rd NF is also related to solving a dependency problem. But unlike 2nd NF, here the dependency is on non-primary key column(s). Third Normal Form requires that if any column that is not DIRECTLY dependent on primary key (either single column primary key or composite primary key), it should be removed and placed into a different table.

Table below shows that ProductID is the primary key of the table. SupplierName is dependent on SupplierID column, which in turn depends on the primary key ProductID. This is called transitive dependency.

The rule for the 3rd NF requires that each column be dependent on the primary key only, and not any other columns. Here SupplierName clearly violated this rule.

Third Normal Form problem

To resolve the problem, we take out SupplierName and put it into a separate suppliers table. See below.

Suppliers table:
Third Normal Form problem

Now products table only contains SupplierID column which can be linked to suppliers table to get SupplierName.

Products table:
Third Normal Form problem

One-to-Many relationship:
One SupplierID 1 in Suppliers table is mapped to three SupplierIDs in Products table.
One SupplierID 4 in Suppliers table is mapped to three SupplierIDs in Products table.

One-to-Many relationship:
Third Normal Form problem fixed

The benefits of normalization

So far we have covered the three normal forms but we haven't explicitly listed why we need to normalize. Here are the two main benefits you get from normalization.

Benefit #1: Reduce data redundancy.

Unless absolutely necessary, storing redundant data is a waste of system resources. Nowadays hard disk space is cheap but is not free. More importantly, maintaining very large size of databases requires more work on database administrators and network engineers.

Benefit #2: Reduce data inconsistency.

The data redundancy issue can further cause date inconsistency issues. As we keep the same piece of data across different locations, we have to make sure that they are kept exactly the same all the time. For example, when one piece of data is updated, the same data in other locations has to be updated as well.

The issues of data inconsistency are collectively called Data Anomaly. There are four types of possible errors they could cause. These four types of errors are the causes of data anomaly:

Select anomaly (also known as join anomalies):
This happens when we select the same piece of data from different tables but they could produce different results.
Update anomalies
This occurs when we update the same piece of data in more than one place. If care not taken, we could end up with updating the data in one place but forgot to update it in another place.
Insertion anomalies
This can happen when we are adding a new record for the data. If we forgot to insert it to other places for the same data, data inconsistency occurs.
Deletion anomalies
This happens when we delete data. If the deletion does not remove the same data from all places, data anomalies occur.

Summary

Database normalization is clearly going to make our data management more efficiently. When we look back at what we have discussed so far, we can summarize the key points for the three normal forms. Remembering the following three short sentences can help you quickly recall what each normalization is all about. This was what I did when attending the exam for System Analysis course.

1st NF: Unique identifier, atomic data, no repeating columns.
2nd NF: Remove partial dependencies on composite primary key.
3rd NF: Remove transitive dependencies on non-primary key column(s).

We have now concluded data normalization theory. Next logical step in the learning is understanding what data integrity is and how to use various database features (we use MySQL for illustration) to enforce data integrity - Enforce data integrity by databases constraints.

Other tutorials in this category

1. Enforce Data Integrity by Database Constraints

2. How to Enforce Data Type Constraint

3. How to Enforce Default Constraint and Nullability Constraint

4. Primary Key Constraint and Unique Constraint

5. Foreign Key Relationships and Considerations

Back to Tutorial Index Page

	Custom Search