Chapter 4 – Database Design – Introduction to Database Management Systems

Chapter 4





Database Design

Designing databases is a tricky job. Several rules can be applied to make it simpler. If we apply them correctly, data can be stored, updated and retrieved most efficiently.

Chapter Highlights

  • Database Design Principles
  • Functional and Transitive Dependencies
  • Normalisation Principles
  • Various Normal Forms
  • Entity/Relationship (E/R) Modeillng

4.1 DESIGN CONSIDERATIONS

Good database design is critical for ensuring that the data that we want to store is consistent and can be retrieved or updated in the most efficient manner. The problem of database design deals with fitting of a given piece of data into an appropriate logical organisation. Several factors need to be considered during the process, some of which are as follows.

  • How many tables should we have?
  • How should they be inter-related?
  • How many columns should each table have?
  • Which columns should be key columns?
  • What sort of constraints should we have?

As before, we shall focus our attention to relational databases. Moreover, when we speak of database design here, we are talking about the logical aspects of the design and not its physical aspects.

4.2 FUNCTIONAL DEPENDENCY

Functional dependency is critical in formulating database design. For two columns, A and B, of a table, we can define the functional dependency between them as follows.

Column B is said to be functionally dependent on column A if, given A we can precisely determine B.

For example, consider the following Employee table, shown in Table 4.1.

Table 4.1 Employee table

Emp_ID    Emp_Name Age        
A101 Anand Joshi 29
A102 Shilpa Kamat 23
A103 Reshama Joshi 34

Do we see any functional dependence between any two columns of the table? Let us consider a combination of the Emp_ID and the Emp_Name columns. Are any of the following statements true?

  • Given an Emp_ID, we can certainly determine a precise Emp_Name
  • Given an Emp_Name, we can certainly determine a precise Emp_ID

We will observe that the first statement is true, but the second one is not. This is because, Emp_ID is the primary key of the table and, therefore, always unique. However, Emp_Name can repeat and, therefore, can be a duplicate in the table. Thus, we can determine with confidence an employee's name, given her employee ID. However, the reverse is not true.

Thus, Emp_Name is functionally dependent on Emp_ID. In symbolic terms, this is written as follows.

Emp_ID → Emp_Name

(This should be read as Emp_ID functionally determines Emp_Name).

In general, if column A determines the value of column B, then we write it as follows:

A → B

(This should be read as A functionally determines B).

Of course, there can be several functional dependencies between various columns of a table. For example, in the same table, we can certainly have the following functional dependency:

Emp_ID → Age

Let us expand the Employee table to add a few more columns and try and deduce some more functional dependencies. The modified Employee table is shown in Table 4.2.

Table 4.2 Modified Employee table

Now let us list down all the functional dependencies in this table.

  • Emp_ID → Emp_Name
  • Emp_ID → Age
  • Emp_ID → Salary
  • Emp_ID → Project_ID
  • Project_ID → Completion_Date

To further clarify, let us also list what cannot be termed as functional dependencies.

  • Age NOT → Emp_ID (because more than one employee can have the same age)
  • Salary NOT → Emp_ID (because more than one employee can have the same salary)

Moreover, the functional dependency need not always be between two columns. It can be between:

  • Two columns
  • A column and a set of columns
  • Two sets of columns

Physical design of databases refers to the organisation of data on the disk. It is related to aspects such as the cluster size, index creation and cleanup. Logical design refers to the design of the tables, their interrelationships, key definitions, and so on.

4.3 NORMALISATION AND NORMAL FORMS

4.3.1 Decomposition

Decomposition refers to the breaking down of one table into multiple tables.

Any database design process involves decomposition. Interestingly, decomposition is not greatly different from the RDBMS projection operation. As we know, during a projection operation, we choose only the needed columns of a table, discarding the rest. We do the same in decomposition. Of course, by discarding, we do not mean that the columns are lost forever. It is just that they are placed in another table, where they should logically belong.

To understand the importance of decomposition, we should first revisit the idea of redundancy. We know that redundancy means unwanted and uncontrolled duplication of data. Redundancy not only leads to duplication of data it also has other side effects such as loss of data integrity and data consistency.

Let us consider a simple example.

Suppose we have a table that holds information regarding students and the marks they have obtained in an examination. The table has the following columns:

Roll_number

Student_name

Subject_code

Subject_name

Marks

Let us consider some hypothetical data for this table, as shown in Table 4.3.

Table 4.3 Examination table

We can certainly see some redundancy here. For example, the student names appear three times in the table. Similar is the case with the subject. In other words, every time we have more than one row for a student or for a subject, the student name and the subject name would repeat respectively. Thus, is it not a better idea to store the student name and the subject name somewhere else? But where do we store them?

This is precisely the problem of decomposition. What we are essentially trying to do is to minimise data redundancy by way of following a simple rule:

Keep one fact in one place

How we can decompose the Examination table. Let us consider that we now break the original table into two tables, as shown in Fig. 4.1.

Fig. 4.1 Possible decomposition of the Examination table

Let us now study what we have achieved by performing this decomposition. We now have information on students in the Student table, information about various subjects in the Subject table. Let us assume that Roll_number and Subject_code are the primary keys of the Student table and the Subject table respectively. This certainly seems to have taken care of the problem of duplication. Now, for a given student, there will be only one entry (in the Student table) and, therefore, student name cannot repeat. Similarly, for a given subject, there will also be exactly one entry (in the Subject table), such that, the subject name cannot repeat. This would take care of the problem of redundancy.

However, if we observe the modified table design carefully, we will note that although the problem of redundancy has been solved, we have also introduced another very undesirable problem. Where do we now have the information as to which student has obtained how many marks in which subject? We have lost it! We now simply know that some students exist and some subjects can be chosen. However, we do not know how these two are interlinked.

Loss of information due to decomposition is called
lossy decomposition.

Clearly, decomposition cannot be used for redundancy if it leads to loss of information. In other words, lossy decomposition is not acceptable in any case. Therefore, we need to have some form of decomposition in which we do not lose information on either the data in our tables or the interrelationships between the various data items.

How do we figure out if our decomposition is lossy? It is actually quite simple. We need to observe for the possibility (or otherwise) of reversibility of the decomposition. In other words, we need to check if, after decomposition, we can reassemble the data in the original form — a process called recomposition. If we are not able to take back the data the original form, it means that we have lost some of the information from the original database design. In other words, we have performed a lossy decomposition. On the other hand, if we are able to successfully reassemble the split tables so that we can recreate the original data, then the decomposition is desirable and successful.

When all information found in the original database is preserved after decomposition, we call it lossless decomposition or
non-lossy decomposition
.

Let us now try to preserve information while we reduce redundancy. Let us observe the modified table design shown in Fig. 4.2, which illustrates a way of achieving lossless decomposition.

Fig. 4.2 Another possible decomposition of the Examination table

We see that three tables are created as a result of the decomposition process, instead of two. The Student and the Subject tables continue to exist as before, but the structure of the Student table is changed — the Marks column no longer exists. The newly created table is the Result table. This table shows the marks obtained by a student in a particular subject.

Let us now try to recreate the original information as was available in the Examination table. We can obtain a student's roll number, name, subject code and subject opted for, and the marks obtained therein. Can we draw the same information from the three tables? If we observe carefully, we will realise that there are referential integrity relationships as follows:

  • Between Student and Result tables, based on the Roll_number
  • Between the Subject and the Result tables, based on the Subject_code

Fig. 4.3 shows these relationships diagrammatically.

Fig. 4.3 Referential integrity relationships between the three tables

If we join these three tables to perform a recomposition, we would get the following columns (after eliminating duplicate columns):

Roll_number

Student_name

Subject_code

Subject_name

Marks

This is precisely what was contained in the original Examination table. Thus, we have been able to preserve the data and the interrelationships between the data elements even after decomposition. Needless to say, this is an example of lossless decomposition.

Another interesting point regarding lossy versus lossless decomposition is its close relation with the concept of functional dependency. Let us list down the functional dependencies in the original table (Examination Table) and after we perform decomposition (both lossy and lossless). This is shown in Fig. 4.4.

We will notice that the original Examination table has three functional dependencies. However, when we decompose it into the two-table structure, we lose one of them. Thus, another indicator for identifying whether the decomposition is lossy or lossless is the loss of functional dependencies. Interestingly, when we create the three-table structure (which is a lossless decomposition), we regain the lost functional dependency. Thus, one way to comprehend the nature of decomposition (lossy versus lossless) is to search for any missing functional dependencies, as compared to the original list of functional dependencies. We can summarise this:

In lossy decomposition, we lose some of the functional dependency relationships while in lossless decomposition, all functional dependency relationships are preserved.

Fig. 4.4 List of functional dependencies in the original, lossy and lossless relationships

4.3.2 What is Normalisation?

After the overview of functional dependencies, redundancy and decomposition, it is time to take a look at normal forms. Whenever there is talk of good database design, the discussion invariably focuses more on the concept of normal forms. Indeed, good database design involves considerable amount of effort in trying to put a table through a series of normal forms. Before we discuss normal forms though, we need to formally define the normalisation process it-self.

Normalisation is the process of successive reduction of a given set of relations to a better form.

In other words, normalisation is not very different from decomposition. The only difference between the two is that decomposition does not abide by any formal rules, whereas normalisation does. When we apply a normalisation rule, the database design takes the next logical form – called the normal form.

We will consider the example of an Order table to illustrate the concept of normal forms. The structure of the table is as follows.

As we can see, the table contains information about the orders received from various customers. The table identifies an order based on the Order_number column. Similarly, Customer_number can identify a customer and Item_number can identify an item uniquely.

Notice that for a given order several items repeat. Therefore, many columns associated with an item, such as the item number, item name, quantity, unit price and the bill amount, also repeat.

4.3.3 First Normal Form (1NF)

A table is in the first normal form (1NF) if it does not contain any repeating columns or repeating groups of columns.

We have noted that in the Order table, one order can contain duplicate items. Clearly, this violates the principle of first normal form. Therefore, we can conclude that the Order table does not conform to the first normal form in its present state.

As we know, in order to bring the table into the first normal form, we have to take aid of the decomposition technique. More importantly, we need to ensure that the decomposition is lossless.

We shall state a simple process with which a table not initially conforming to the rules of the first normal form can be altered so as to follow the first normal form:

Move all the repeating columns to another table.

What are the repeating columns in the Order table? Clearly, they are:

Item_number

Item_name

Quantity

Unit_price

Bill_amount

These columns can repeat many times in the same order. Therefore, we have to move these columns to another table in order to bring the Order table in the first normal form. As a result, we now have two tables: (a) the modified Order table, and (b) a new table, called Order_item, which contains the repeating columns from the original table.

At this stage, let us find out if this decomposition is lossless. As we know, the simplest way to judge this is to check if we are able to produce the original information after joining the new tables. We now have the following two tables:

Order Order-item
Order_number Item_number
Order_date Item_name
Customer_number Quantity
  Unit_price
  Bill_amount

Let us join these two tables. But what should the process of joining be based on? There has to be some commonality between the two tables if we have to join them. Quite clearly, there is no such commonality here. Thus, this is a case for lossy decomposition.

The way to achieve this commonality is to add another column to the Order_item table, so that an item sold is linked to a particular order. Therefore, we need to add the Order_number column to the Order_item table. This will cause a referential integrity relationship based on this column, to be established between the two tables. Order_number will be the primary key in the Order table and the foreign key in the Order_item table. Therefore, the modified table structures will be as shown in Fig. 4.5, in which the referential integrity relationship is also illustrated.

Fig. 4.5 Illustration of first normal form

We can see that it is now possible to join the two tables, based on the Order_number such that all information about an order is preserved as well. Thus, this is a lossless decomposition. Incidentally, what should be the primary key of the Order_item table? It cannot be Order_number, because there can be multiple rows for the same Order_number in this table. Therefore, here the primary key should be a combination of the Order_number and the Item_number. Thus, the primary key is a composite or concatenated primary key.

Now, are there any repeating values in these two tables? Observing the tables carefully, will show that the redundancy conditions or repeating values have been removed. Therefore, we can say that our tables are in the first normal form.

4.3.4 Second Normal Form (2NF)

A table is in the second normal form (2NF) if it is in the first normal form and if all non-key columns in the table depend on the entire primary key.

Based on the rule stated above, let us examine if the tables illustrated in the previous section are in the second normal form. The prerequisites for a database to be in the second normal form are:

All the tables should be in the first normal form and

All the non-key columns in all the tables should depend on the entire primary key

We know that the first condition is already satisfied. Let us study the second condition for both the tables.

(a) Let us start with the Order table. This table now contains three columns, namely Order_number, Order_date and Customer_number. The primary key of the table is Order_number. From this column (i.e. Order_number), we can derive the other non-key columns, namely Order_date and Customer_number. Similarly, these non-key columns do not depend on each other at all. Therefore, we can state that this table is in the second normal form.

(b) Now, let us study the Order_item table. This table contains the columns Order_number, Item_number, Item_name, Quantity, Unit_price and Bill_amount. We know that the primary key here is a composite key, namely Order_number + Item_number. Can we determine the values of the other (i.e. non-key) columns of the table by using this composite primary key? Not quite! We can determine Item_name based on the Item_number alone. We do not need the Order_number for this purpose. Thus, all non-key columns do not depend on the entire primary key, but some of them depend only on a part of the primary key. In other words, the principle of the second normal form is violated. Same is the case with the Unit_price column — this can also be determined with the Item_number alone.

Therefore, we conclude that our database does not follow the second normal form. We need to take some action to rectify this situation. We can follow a simple strategy as follows:

Move columns that do not depend on the entire primary key to another table.

However, before we perform decomposition on the table to bring it into the second normal form, think as to why we need to do so. The technical name for problems associated with such a table structure is update anomalies. It refers to the problems that we are likely to face if we maintain the existing table structure. These problems are classified into three SQL operations — INSERT, UPDATE, and DELETE.

  • INSERT problems: Imagine that we have a new item, which is not yet available in the Order table. Can we add information regarding this item in the Order table? We cannot. This is because until an order is placed for that item, we cannot insert a row for that item in the Order table.
  • UPDATE problems: Suppose we have a product with Item_number = 100 and Item_name = Soap. Let us now assume that we wish to assign another product name — Toothbrush — to this Item_number, instead of ‘Soap’. We will need to make this change in many rows of this table. This is because this item could be a part of many orders. We will have to locate all of them and make this amendment.
  • DELETE problems: Let us assume that we currently have just one order for an item with Item_number as 105. Further imagine that for some reason, this order gets cancelled and therefore we delete this order record from the Order table. As a side effect, we will also end up losing all information about this item itself (such as Item_code, Item_name and Unit_price). This is because there is no other order, which contains information about this item. The only order that we had for this item no longer exists.

The columns that do not depend on the entire primary key are Item_name and Unit_price. They depend only on the Item_number. Therefore, we need to move these two columns to a new table, say Item. By now, you would have realised that we also need Item_number as one column in this table, and it should also act as the primary key of that table. Consequently, the tables would now look as shown in Fig. 4.6.

Fig. 4.6 Illustration of second normal form

There is a referential integrity relationship between the Order_item and Item tables based on the column Item_number.

We will not elaborate on the proofs in order to make sure that the table is in the second normal form. It should be easy to check this with the recompositions or joins, as before. However, let us ensure that the update anomalies discussed earlier are taken care of with the modified design.

  • INSERT problems: We can now insert a new item in the Item table without needing to have even a single order for this item.
  • UPDATE problems: We can now easily update the item name and unit price for an item in the Item table. We need to make this change only in this one place (i.e. in exactly one row) and need not touch the Order or Order_item tables at all.
  • DELETE problems: Even if we delete an order from the Order table, the information about that item is not completely lost. This is because the information on items is now held in a separate table (Item table).

4.3.5 Third Normal Form (3NF)

A table is in the third normal form (3NF) if it is in the second normal form and if all non-key columns in the table depend non-transitively on the entire primary key.

A simpler way to state the same thing is:

The third normal form requires a table to be in the second normal form. Additionally, every non-key column in the table must be independent of all other non-key columns.

The third normal form definition may sound complex. However, it is very easy to understand. We shall try and demystify it.

So far, we have discussed the concept of functional dependency. To reiterate, a column (B) in a table is said to be functionally dependent on another column (A) if, given A, we can determine precisely one value of B. In other words, there is a direct relationship between A and B.

Another kind of relationship also exists in relational databases. This type of relationship is called transitive dependency. It is an indirect relationship between two columns. Let us formally define it.

Given the following conditions:

1. There is a functional dependency between two columns, A and B, such that B is functionally dependent on A.

2. There is a functional dependency between two columns, B and C, such that C is functionally dependent on B.

We say that C transitively depends on A.

We can represent this symbolically as:

A → B → C

This should be read as: C transitively depends on A. Of course, in the larger picture, B functionally depends on A and C functionally depends on B.

The third normal form states that we should identify all such transitive dependencies and get rid of them.

Let us look for any possible transitive dependencies in our current table structure.

  • Orders table: There is no such dependency in the Order table. The Order_date and Customer_number columns are functionally dependent on the Order_number column.
  • Order_item table: In this table, the Item_number and Quantity columns are functionally dependent on the Order_number column. However, we cannot say the same thing about the Bill_amount column. After all, Bill amount would be calculated as:

Bill amount = Quantity × Unit price

The quantity is available in the Order_item table, and the unit price is available in the Item table. Thus, the column Bill_amount does not depend directly on the primary key of the Order_item table. Therefore, this is a case for transitive dependency. Hence, we need to get rid of the Bill_amount column.

  • Item table: Here, the non-key columns Item_name and Unit_price functionally depend on the primary key of the table, that is Item_number. Therefore, there is no transitive dependency in this table.

In summary, we need to get rid of the Bill_amount column from the Order_item table in order to bring the table into the third normal form. Do we lose any information because of this? Not really. We can always find out the unit price for a given item from the Item table. We can multiply that by the Quantity column in the Order_item table, to find out the bill amount.

Let us be methodical and try to find out the update anomalies for this database design.

  • INSERT problems: Imagine that we have a new order which contains four items. We would not know the total bill amount beforehand, as it would depend on the items ordered (i.e. the unit price of each) and the quantities supplied. Therefore, we cannot fill the column Bill_amount. Thus, we have a problem in insert.
  • UPDATE problems: Let us assume that after an order is placed, there are some alterations in terms of the quantity or unit price of some of the items. In such a case, we would need to recalculate the total bill amount and update it for all the rows of the order even if one row of the order is changed.
  • DELETE problems: If we delete one order row for an order consisting of multiple rows (i.e. multiple items), we need to recalculate the total bill amount and update it for all the remaining rows of this order.

To avoid such problems, we can bring a table into the third normal form. The rule for this is:

Get rid of any transitive dependencies from the table.

Stated another way:

Remove those non-key columns that depend on other non-key columns.

Thus, our table structure in the third normal form would look as shown in Fig.4.7.

Let us revisit the update anomalies to see if they are now taken care of.

  • INSERT problems: Because there is no such column as Bill_amount in the Order_item table, there is no question of calculating and storing the total bill amount. So, the INSERT anomaly no longer exists.

Database design is a specialised activity. Here, we determine what data needs to go into which tables, and how tables are related with each other. The idea is to minimise data duplication and also to ensure that there are no problems in updating data in any of the tables.

Fig. 4.7 Illustration of third normal form

  • UPDATE problems: Even if we make changes to any Order row, we need not recalculate or store the total bill amount. Therefore, this UPDATE anomaly is also taken care of.
  • DELETE problems: Even if we delete one Order row for an order consisting of multiple rows (i.e. multiple items), we need not recalculate the total bill amount or update it. Thus, we have got rid of the DELETE anomaly as well.

4.3.6 Boyce-Codd Normal Form (BCNF)

The first three normal forms provide means to ensure that a table is reasonably normalised. However, all design anomalies are not taken care of yet. The next normal form in sequence is not called fourth normal form, as we would have expected. Instead, it is called a Boyce-Codd normal form (BCNF), named after the creators of this normal form.

Before we study BCNF, we need to familiarise ourselves with a simple term, that is, determinant. We have studied the concept of functional dependencies in great detail. We know that if B is functionally dependent on A, that is, A functionally determines B, then the following notation holds good:

A → B

In such a case, we call A the determinant.

Based on this definition, let us define BCNF.

A table is in Boyce-Codd normal form (BCNF) if the only determinants in the table are the candidate keys.

A more complete but complicated definition is as follows.

A table is in Boyce-Codd normal form (BCNF) if every column, on which some other column is fully functionally dependent, is also a candidate for the primary key of the table.

Let us now consider an example to understand BCNF. We have a School table consisting of three columns: Student, Subject, and Teacher. One student can study zero or more subjects. For a given student-subject pair, there is always exactly one teacher. However, there can be many teachers teaching the same subject (to different students). Finally, one teacher can teach only one subject. Fig.4.4 shows a sample School table.

Table 4.4 Sample School table

Student Subject   Teacher
Amol English Meena
Amol Hindi   Shrikant
Mahesh English Prasad
Naren Science Mona
Naren English Meena

What are the functional dependencies in this table?

  1. Given a student and a subject, we can find out the teacher who is teaching that subject. Therefore, we have:
    {Student, Subject} → Teacher
  2. Given a student and a teacher, we can find out the subject that is being taught by the teacher to the student. Therefore, we have:
    {Student, Teacher} → Subject
  3. Given a teacher, we can find out the subject that the teacher teaches.

Therefore, we have:

          Teacher → Subject

Now, let us find out the primary key candidates (i.e. candidate keys) for this table.

  1. The Student and Subject columns together can constitute a composite candidate key. This is because, by using these two columns in combination we can determine the remaining column, that is, the Teacher column. Thus, one candidate key is {Student, Subject}.
  2. The Student and Teacher columns together can constitute a composite candidate key. This is because by using these two columns in combination, we can determine the remaining column, i.e. the Subject. Thus, one candidate key is {Student, Teacher}.
  3. Is Teacher a candidate key? It is not, because given a teacher name, we can only determine the subject, but not the student name.

We summarise our observations as shown in Table 4.5.

Table 4.5 Functional dependencies and candidate keys for the School table

Functional dependency Candidate key
{Student, Subject} -> Teacher {Student, Subject}
{Student, Teacher} -> Subject {Student, Teacher}
Teacher -> Subject None

It can be seen that we have three functional dependencies but only two of them lead to candidate keys. Thus, we have a situation that can be described as:

There is a functional dependency (i.e. a determinant) without it being a candidate key.

This violates the principle of BCNF. Thus, we conclude that the table is not in BCNF. Let us examine the theoretical and practical aspects (i.e. update anomalies) of this statement.

  • Theory: Whenever we draw a functional dependency diagram, the determinant (the item on the left side of the arrow) has to be a candidate key. There will be arrows from the candidate keys to the non-key columns. BCNF reaffirms this statement and states that there cannot be arrows from any other column to the non-key columns. If this is not the case, then the table is not in BCNF.
  • Practice (i.e. update anomalies):
    • INSERT problem: With reference to the School table, information about a new teacher cannot be added until we have a student assigned. Similar is the case with subjects.
    • UPDATE problem: If the name of a teacher is misspelt, it has to be corrected at many places.
    • DELETE problem: With reference to the School table, imagine that student Mahesh leaves the school for some reason. Now we need to delete the row for student Mahesh. This can have a very undesirable side effects such that we also lose the information that teacher Prasad teaches English.

Let us now try to convert the data to BCNF. For this purpose, we perform a decomposition operation on the School table to create the following two tables: Student_Subject, and Subject_Teacher. Their definitions are shown in Fig.4.8.

Fig. 4.8 Database in BCNF

What are the functional dependencies in this new database organisation? As it will be clear soon, there is just one functional dependency now. Given a teacher name, we can identify the subject that she teaches. The other two functional dependencies from the original School table no longer exist. What is the impact of the loss of these functional dependencies? Analysing the tables will show that, we can determine the following:

Functional dependency is very high amount of dependency. Given something, we can determine another thing with almost certainty. For instance, given a married man's name, we can determine (i.e. it is functionally dependent on) the wife. Of course, it would not be true if the husband has more than one wife!

  • Which student is learning which subjects?
  • Which teacher is teaching which subject?

However, we cannot determine the following:

  • Which teacher is teaching which subject to which student?

This proves that, the decomposition that we performed is lossy. We should be careful while performing BCNF operations as we may end up losing information in the process. In a nutshell, we may not be able to implement BCNF all the time.

It is worth pointing out that BCNF is actually simpler to understand and implement than 3NF. Unlike 3NF, BCNF does not depend on 2NF. This also means that BCNF does not have the concepts of transitive dependencies.

4.3.7 Fourth Normal Form (4NF)

A table is in the fourth normal form (4NF) if it is in BCNF and
does not have any independent multi-valued parts of the
primary key.

The fourth normal form is related to the concept of a multi-valued dependency (MVD). In simple terms, if there are two columns — A and B — and if for a given A, there can be multiple values of B, then we say that an MVD exists between A and B. If there are two or more such unrelated MVD relationships in a table, then it violates the fourth normal form.

Examples showing MVD (and, therefore, 4NF) tend to be contrived. However, we need to use them in order to illustrate the fourth normal form.

The fourth normal form is theoretical in nature. In practice, normalisation up to and including the third normal form are generally adequate. In certain situations, the designers may also have to look at the BCNF. However, rarely do we see the 4NF being employed for any real-life use.

Let us consider a table in which a student name, the subjects she learns and the languages she knows are stored. One student can learn zero or more subjects and can simultaneously know zero or more languages. Quite clearly, this is a strange relationship but we will overlook this fact in order to understand the theory behind 4NF. We can see that there are two independent MVD facts about this relationship:

(a) A student can study many subjects.

(b) A student can learn many languages.

Table 4.6 shows some sample data from this Student table.

Table 4.6 Student table

Student Subject Language
Geeta Mythology English
Geeta Psychology English
Geeta Mythology Hindi
Geeta Psychology Hindi
Shekhar Gardening English

We can see that Geeta is studying two subjects and knows two languages. Because these two facts about Geeta (namely subjects that she is studying and the languages that she knows) are independent of each other, we need to have four rows in the table to capture this information. If Geeta starts studying a third subject (say gardening), we would need to add two rows in the Student table to depict this fact (one for Language = ‘English’ and another for Language = ‘Hindi’) as shown in Table 4.6. This is certainly cumbersome.

The process of bringing this table the fourth normal form is:

Split the independent multi-valued components of the primary key
into two tables.

The primary key for the Student table is currently a composite key made up of all the three columns in the table — Student, Subject and Language. In other words, the primary key of the table is Student + Subject + Language. Clearly, this primary key contains two independent multi-valued dependencies. Hence, the table violates conditions of 4NF.

Consequently, we need to split these two independent multi-valued dependencies into two separate tables. Let us call them as Student_Subject and Student_Language tables, respectively. The resulting tables would be as shown in Table 4.7.

Table 4.7 (a) Student_Subject table

Student Subject
Geeta Mythology
Geeta Psychology
Shekhar Gardening

Table 4.7 (b) Student_language table

Student Subject
Geeta English
Geeta Hindi
Shekhar English

We can see that this decomposition reduces redundancy with respect to both the independent MVD relationships, that is subjects and languages. Also, if we need to add a new subject for Geeta now, we have to add just one row to the Student_subject table. Contrast this with the case, where we had to add two rows for this purpose to the Student table. Thus, the update anomaly is taken care of.

4.3.8 Fifth Normal Form (5NF)

A table is in the fifth normal form (5NF), also called as project-join normal form, if it is in the fourth normal form and every join dependency in the table is implied by the candidate keys.

The fifth normal form is quite theoretical in nature. It is almost never required in practice.

All normal forms up to 5NF perform normalisation with the help of lossless decomposition. This decomposition usually splits one table into two by using a projection operation. However, in 5NF, the decomposition splits the original table into at least three tables.

Consider a SPL table shown in Table 4.8. It shows suppliers (S), the parts they supply (P) and the locations to which they supply these parts (L).

Table 4.8 SPL table

S P L
S1 P1 L2
S1 P2 L1
S2 P1 L1
S1 P1 L1

Note that this table does not contain any independent MVD relationships. The parts (P) are supplied to specific locations (L). Hence, they are dependent MVD relationships with the suppliers (S). Therefore, the table is in 4NF. All the same, the table does have some amount of redundancy. For instance, the row S1-P1 occurs twice but for two different locations (L2 and L1). Therefore, we cannot decompose the table into two tables to remove this redundancy. This would lead to loss of information and, therefore, would be a lossy decomposition. Let us illustrate this with actual decomposition of the SPL table into SP and PL tables, as shown in Table 4.9.

Table 4.9 (a) SP

S P
S1 P1
S1 P2
S2 P1

Table 4.9 (b) PL

P L
p1 L2
P2 L1
P1 L1

Why did we say that this decomposition is lossy? The reason is quite simple. We can still determine:

  • Which suppliers supply which parts?
  • Which suppliers supply to which locations?

However, we cannot determine:

  • Which suppliers supply which parts to which locations?

We can prove this by carrying out a natural join on the two tables (SP and PL). Quite clearly, this join will be based on the common column P. The result of the join operation is shown in Table 4.10. Let us call it SPL2.

Table 4.10 SPL2 table after natural join of the SP and PL tables

S P L
S1 P1 L2
S1 P2 L1
S2 P1 L1
S2 P1 L2
S1 P1 L1

Notice that the fourth row, shown in italics (namely S2-P1-L2), is bogus. Compare the data in this table with the data in the original SPL table. The new table contains this additional row, which was not present in the original SPL table. Therefore, the natural join operation has produced one unwanted spurious row. This proves that our decomposition is lossy.

Interestingly, if we had decomposed the original SPL table into three tables, namely SP, PL, and LS, then the decomposition would have been lossless. Let us prove this. We have already seen the SP and PL tables. Let us also view the LS table, as shown in Table 4.11.

Table 4.11 LS table

L S
L2 S1
L1 S1
L2 S2

Now, let us join together the SP, PL, and LS tables. This process is shown in Fig. 4.9.

Fig. 4.9SPL table after joining SP, PL and LS tables

We are able to obtain the original SPL table with the process and the bogus row too disappeares. This proves our earlier theory that in order to bring a table in 5NF, we need to decompose the original table into at least three tables.

In this case, what we are saying is this:

The relation SPL is equivalent to the three projections SP, PL and SL under the following criteria for a given S, given P and given L.

If

S and P appear in SP          and

P and L appear in PL          and

L and S appear in LS

Then the triplet

S, P and L appears in SPL

4.3.9 Normalisation Summary

Let us summarise what we have studied in normalisation.

(a) Take the projection of the database in the first normal form to eliminate functional dependencies that are not irreducible. This gives us the database in the second normal form.

(b) Eliminate the transitive dependencies from the database in the second normal form. This will give us the database in the third normal form.

(c) Eliminate the functional dependencies where the determinant is not a candidate key. This will give us the database in the Boyce-Codd normal form.

(d) Eliminate any multi-valued dependencies that are not functional dependencies. This will give us the database in the Boyce-Codd normal form.

(e) Eliminate any join dependencies that are not implied by candidate keys. This will give us the database in the fifth normal form.

However, at the cost of repetition, it is worth pointing out that for most practical applications, it is sufficient to carry out only steps (a) and (b) above.

Let us also list down the objectives of normalisation.

(a) Normalisation helps us eliminate (or at least control) certain types of redundancies. We have studied what the normalisation process can control, and what it cannot.

(b) It helps us avoid certain kinds of update anomalies.

(c) It produces a good design or representation of the real world.

(d) It helps us enforce integrity constraints.

The first three normal forms are really useful even in practical situations. But when it comes to the Boyce-Codd Normal Form and the fourth and fifth normal forms, the situation changes. These are more theoretical in nature and can be ignored in most practical situations.

4.3.10 Denormalisation

Interestingly, while we have discussed the benefits of normalisation and glorified it endlessly, a completely opposite concept exists. It is called denormalisation. We can define denormalisation informally as follows:

Denormalisation is a process in which we retain or introduce some amount of redundancy for faster data access.

The argument in favour of denormalisation is:

  1. Full normalisation is a decomposition process. This causes a relation (i.e. table) to be broken down into many separate or distinct relations.
  2. In general, one table at the logical level would internally map to one file in the physical level. Thus, whereas we would have required to access only one physical file for data retrieval/manipulation operations earlier, we would now need to access several files.
  3. As a result of the above, the number of I/O operations would also increase by a sizeable proportion.
  4. Thus, through normalisation we would get a cleaner database design at the cost of performance.

Let us consider an example of denormalisation. We had considered a database, as shown in Fig. 4.10, as a part of the normalisation process.

Fig. 4.10 Order database organisation revisited

Let us assume that we have to find the bill amount for each order. For this we would need to:

(a) Read all the records from the Order_item table and find out the item number and quantity ordered.

(b) Based on the item number, we find its unit price from the Item table.

(c) Multiply the quantity and unit price and arrive at the bill amount for that item.

(d) Add the amount for all items in the order.

Instead, if we add the Unit_price column to the Order_item table (making it slightly more redundant), we would be able to calculate the bill amount for each item quickly. This would allow us to skip step (b). This would certainly be faster than the above execution, since we would not need to go to the Item table, thus saving one I/O operation.

We can think of many such examples, where denormalisation can help speed up the execution but cause redundancy.

However, experts argue against denormalisation in general and this theory in particular. They state that the introduction of redundancy has no theoretical basis. Normalisation is based on strict and proven rules. There is no such concrete basis for denormalisation. Also, the big question is, how much denormalisation would we want? When do we decide to stop denormalisation? There are no criteria to decide this. We can go on and on to any level – even to the first normal form.

Consequently, it is not advisable to perform denormalisation unless we have some real-life data to back arguments in favour of denormalisation.

4.4 ENTITY/RELATIONSHIP (E/R) MODELLING

4.4.1 Aspects of E/R Modelling

In the context of database design, there is always reference to the technique of Entity/Relationship (E/R) modelling. It is a technique that helps us model real-world systems in the form of software constructs. Several terms need to be discussed in the context of E/R modelling, as follows.

  • Entity: An entity is an object that has its own unique identity. Examples of entity are my car, a book, a road, a window, a dream and so on. The reason we have mentioned so many diverse entities is that an entity can be concrete or abstract. This is the E portion of E/R modelling.
  • Entity set: An entity set is a set of entities of the same type. For example, all account holders in a bank can form an entity set called customer.
  • Attribute: An attribute is a characteristic of an entity. For example, the colour, size and shape of a car are its attributes. Similarly, the subject, number of pages and author name are the attributes of a book entity.
  • Relationship: The association between several entities forms a relationship. This is the R portion of E/R modelling. For example, customer A (an entity) could be associated with account 101 (another entity) to form a relationship between these two entities.
  • Relationship set: A relationship set is a collection or set of relations of similar type.
  • Binary relationship: The relationship between two entities is called a binary relationship.

Normalisation is breaking down of a table or that of multiple tables into smaller tables to ensure that there is minimum or no anomaly in the design. The idea is to store data in such a manner that there is minimum redundancy of information, and that we are able to find out all the data quickly or update it.

Let us understand E/R modelling with a simple example. Consider that we want to represent the details of the customers of a bank and it branch. Fig.4.11 shows the E/R diagram for this relationship.

Note that we have two entities in this relationship, namely Customer and Branch. Both of these entities have their own set of attributes. For example, a customer has name, address and phone number as its attributes. On the other hand, a branch has name, city and manager as its attributes. The two entities themselves are linked to each other via a relationship set. The relationship set here is the customer's account.

Fig. 4.11 E/R diagram showing the relationship between a bank's branch and its customers

In other words, the branch of a bank has many customers. Each one of them has an account with the branch.

The following points regarding symbols in E/R diagrams should be noted:

  • A rectangle represents an entity set. An entity set roughly maps to a table in the RDBMS world.
  • An ellipse represents an attribute. It roughly maps to a column of a table in RDBMS.
  • A diamond represents a relationship set. It roughly maps to the relation between two tables in RDBMS.
  • Finally, a line links an attribute to an entity set or an entity set to a relationship set.

This information is tabulated in Table 4.12.

Table 4.12 Symbols in E/R diagrams

Symbol Purpose in E/R diagram Equivalent in RDBMS
Rectangle Entity set Table
Ellipse Attribute Column
Diamond Relationship set Relation
Line Linking (a) attributes to entity sets or (b) entity sets to relationship sets Relation

Technically, if a relationship set contains a primary key, it is called a strong entity set. Otherwise, it is called as a weak entity set.

4.4.2 Types of Relationships

Any form of relationship, and especially when we use the E/R modelling technique for the same, can be classified into three types, as shown in Fig. 4.12.

Fig. 4.12 Types of relationships

Let us define these types of relationships.

  • One-to-one (1:1) – The one-to-one relationship, also denoted as 1:1, is simple. Here, for a given occurrence of an entity (i.e. its value), there is exactly one occurrence of another entity. For example, for a given account number, we can have only one customer. Therefore, there is a 1:1 relationship between an account number and a customer. However, the converse is not true. That is, one customer can hold multiple accounts. Therefore, a 1:1 relationship from a customer to an account.
  • One-to-many (1:m) – In a one-to-many relationship, also denoted as 1:m, for a given occurrence of an entity (i.e. its value), there can be one or more occurrences of another entity. For example, a branch of a bank can have one or more customers. Therefore, there is a 1:m relationship between a branch and its customers.
  • Many-to-many (m:n) – In a many-to-many relationship, also denoted as m:n, there can be one or more values on both sides of a relationship. That is, there can be multiple instances of both the entities that are being related. For example, take another example of a school in which many students learn many subjects. Clearly, there is a m:n relationship here.

The types of relationships can have a great bearing on database modelling. In other words, when we create E:R relationships to depict real-world situations, or translate them into database designs, we must know the types of relationships we are trying to model. If we do not do so, we would not be able to decide whether we should create separate tables while transforming the database models into actual table structures or not.

Attribute

Boyce-Codd Normal Form (BCNF)

Decomposition

Entity

Entity/Relationship diagram (E/R diagram)

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Functional dependency

Lossy decomposition

Multi valued Dependency (MVD)

One-to-many relationship (1:m)

Project-join normal form

Weak entity set

Normal forms

Binary relationship

Database modeling

Denormalisation

Entity set

Fifth Normal Form (5NF)

Fourth Normal Form (4NF)

Strong entity set

Transitive dependency

Lossless decomposition

Many-to-many relationship (m:n)

Non-loss decomposition

Normalisation

One-to-one relationship (1:1)

Relationship

  • Database design is related to the logical view of a database.
  • Functional dependency is an important concept in database design. If column A of a table can determine the value of another column — B, then B is functionally dependent on A.
  • Decomposition means breaking one table into multiple tables. If we lose information in the process, it is lossy decomposition. Otherwise, it is lossless or non-lossy decomposition.
  • Normalisation is the process of successive reduction of a given set of relations to a better form. It helps create a very good database design.
  • A normal form specifies the actions needed to remove the drawbacks in the current design of database. Overall, there are six normal forms, of which the first three are most widely used.
  • The First Normal Form (1NF) specifies that there must not be any repeating columns or groups of columns in a table.
  • The Second Normal Form (2NF) specifies that all the non-key columns in table should depend on the primary key.
  • A transitive dependency is an indirect dependency relationship. If column C functionally depends on column B, and column B functionally depends on column A, then column C transitively depends on column A.
  • The Third Normal Form (3NF) specifies that all the non-key columns in the table should non-transitively depend on the entire primary key.
  • A table is in the Boyce-Codd Normal Form (BCNF) if only candidate keys are the determinants.
  • If a table has no independent multi-valued parts of the primary key, then it is in the Fourth Normal Form (4NF).
  • If the candidate keys imply every join dependency in a table, then it is in the Fifth Normal Form (5NF).
  • Denormalisation is a process of adding some redundancy to a database purposefully so as to improve performance. It has no theoretical basis, and should be discouraged.
  • Entity/Relationship modeling (E/R modeling) is a technique for modeling real-life systems in the form of software constructs.
  • Relationships between two columns can be one-to-one (1:1), one-to-many (1:m) or many-to-many (m:m).

  1. Column B is said to be functionally independent of column A, if given A, we can precisely determine B.
  2. Decomposition means breaking one table down into multiple tables.
  3. When we lose information due to decomposition, we call it lossless decomposition.
  4. In lossy decomposition, all functional dependency relationships are preserved.
  5. In lossless decomposition, we lose some of the functional dependency relationships.
  6. Normalisation is the process of successive reduction of a given set of relations to a better form.
  7. When we apply a normalisation rule, the database design takes the next logical form called the normal form.
  8. A table is in the second normal form if it does not contain any repeating columns or repeating groups of columns.
  9. A table is in the third normal form if it is in the second normal form and if all non-key columns in the table depend non-transitively on the entire primary key.
  10. A table is in the Boyce-Codd normal form if the only determinants in the table are the candidate keys.

1. _________ is critical in formulating database design.
  (a) Row-column order
(b) Number of tables
(c) Functional dependency
(d) None of the choices offered
2. If column A determines the value of column B, then we write it as ________.
  (a) A -> B
(b) B ->A
(c) A <- B
(d) B <- A
3. Breaking one table into multiple tables is called _________.
  (a) normalisation
(b) composition
(c) denormalisation
(d) decomposition
4. When we preserve information after decomposition, we call it as ________.
  (a) lossy decomposition
(b) lossless decomposition
(c) non-lossy decomposition
(d) both b & c
5. ________ is the process of successive reduction of a given set of relations to a better form.
  (a) Normalisation
(b) Denormalisation
(c) Composition
(d) Decomposition
6. A table is in the ________ normal form if it is in the ________ normal form and if all non-key columns in the table depend on the entire primary key.
  (a) first, second
(b) second, third
(c) third, fourth
(d) second, first
7. The __________ normal form requires a table to be in the second normal form.
  (a) first
(b) third
(c) fourth
(d) fifth
8. The ________ normal form is also called as project-join normal form.
  (a) Boyce-Codd normal form
(b) first normal form
(c) second normal form
(d) none of the above
9. _________ is a process in which we retain or introduce some amount of redundancy for faster data access.
  (a) Normalisation
(b) Composition
(c) Denormalisation
(d) Decomposition
10. _________ technique helps us model real-world systems in the form of software constructs.
  (a) Relationship design
(b) Attribute design
(c) Database design
(d) Entity/Relationship modeling.

1. Explain the concept of functional dependency.
2. What is decomposition?
3. Define the term normalisation.
4. Explain the first normal form.
5. Write a note on the second normal form.
6. Describe the third normal form.
7. Explain the concept of determinant with the help of Boyce-Codd normal form.
8. What is the difference between fourth and fifth normal forms?
9. What is denormalisation?
10. Explain the concept of E/R modelling.

1. In the following column combinations, identify the functional and transitive dependencies:
  (a) Item ID, Item name, Item price, Quantity sold, Bill amount
(b) Student ID, Name, Address, Subject, Marks
2. What are the candidate keys in the following relationships?
  (a) Employee code, Employee name, Driving license number, Passport number, Address
(b) Part ID, Part name, Unit price, Colour, Make
3. What are the primary keys in the following relationships?
  (a) Movie ID, Name, Actor name, Actress name, Director name
(b) Player name, Runs made, Match number, Ground, Date
4. Consider the following Student table. Is it in the second normal form? Provide reasons for your answer. If you think that the database is not in the second normal form, bring it in the second normal form by modifying the design suitably.
  Roll number Number Maximum length 5
  Student name Character Maximum length 30
  Subject Character Maximum length 20
  Marks Number Maximum length 3
  Total marks Number Maximum length 3
5. Is the following modified Student table in the third normal form? Why?
  Roll number Number Maximum length 5
  Subject Character Maximum length 20
  Marks Number Maximum length 3
  Teacher ID Character Maximum length 5
  Teacher Name Character Maximum length 20
6. Consider the following table.
  Player Character Maximum length 25
  Sport Character Maximum length 15
  Coach Character Maximum length 25
  The primary key is Player + Sport. This table is not in BCNF. Bring it into BCNF. Would it lead to any problems?
7. Consider the following table:
  Child Character Maximum length 25
  Hobby Character Maximum length 25
  Teacher Character Maximum length 25
  One child can have many hobbies. Many teachers can teach the child. There is no dependency between the hobby and the teacher columns. Is this table in 5NF? If not, how would you bring it into 5NF?
8. Draw an E/R diagram for modelling the relationship between students who learn many subjects taught by many teachers. We need to maintain information such as the students’ names, addresses and divisions; and the teachers’ names and addresses.
9. Would you call a 1:1 relationship a functional dependency or a transitive dependency? Justify your answer.
10. Is a m:1 relationship the same as a 1:m relationship? Why?