Chapter 8 – Distributed Databases – Introduction to Database Management Systems

Chapter 8






Distributed Databases

Distributed databases allow sharing of data across cities, states, countries, and continents. It is truly a magnificient technology, which was unthinkable even a few years ago.

Chapter Highlights

  • Data Distribution
  • Principles of Distributed Databases
  • Issues in Distributed Databases
  • Client/Server Computing
  • Date's 12 Rules for Distributed Databases

8.1 DISTRIBUTED DATABASE CONCEPTS

The technology behind computer networks and data communications has had profound implications on the way computers work. Computer networks make it possible to connect multiple computers to each other. These computers can be located in the same/different buildings, cities, states, countries and even continents. This provides great flexibility to applications that make use of the computers and the networks that connect them. This flexibility is in terms of the distribution of processing logic and data across these networks, unlike earlier, when everything was centralised. Many modern applications and databases are distributed. In simple terms, this means that the processing logic and the data used by it need not be physically located on a single computer. Instead, it can be spread across multiple computers, which are connected to each other in some form.

The great idea of distributed databases was born when the technology of databases was married to the concept of networking.

The idea is illustrated in Fig. 8.1.

Fig. 8.1 Distributed databases = Databases + Networking

As we can see, the computers connected by a network have their own databases. At this point, however, we will not describe the intricacies, and instead, focus on the broad level concepts. We shall examine the internal details of such schemes and all the issues therein later in this chapter.

Data distribution is no longer a feature that is good to have. It is almost mandatory now, with the globalisation of data, widening of boundaries, and the tremendous impact of the Internet on our life.

The concept of distributed databases took off with the advent of wired and wireless data communication facilities. It assumed greater significance when the Internet was born. Now that the Internet has become an extremely important medium for conducting business worldwide, distributed databases have become the most effective technology for sharing data across distances.

At this stage, it is important to distinguish between two terms: distributed computing and distributed databases. Although related, these terms mean different things.

8.1.1 Distributed Computing

In the case of distributed computing, also called distributed processing, there are a number of processing elements connected by a computer network. These processing elements cooperate with each other in order to perform a task (i.e. the overall application logic). The idea is to divide a big problem into small chunks, and execute these chunks on different computers, thus solving it in an efficient manner.

The main benefits of this scheme are as follows:

  1. The power of many computers connected together by a network is used. Thus, the overall computing resources in use are far higher as compared to a centralised system.
  2. Tasks can be distributed, managed and executed fairly independently of one another.

The conceptual distinction between centralised computing and distributed computing is shown in Fig. 8.2.

8.1.2 Distributed Databases

The term distributed databases has nothing to do with the processing logic as such. A distributed database is a collection of multiple logically-interrelated databases, connected to one another over a computer network. This also leads us to the definition of a Distributed Database Management System (DDBMS).

A DDBMS is a set of programs that manages a distributed database.

The fundamental concept behind a distributed database is: the fact that a database is physically distributed across many computers should be hidden from the users. The users perceive a distributed database as a single database which is physically located on a single computer. The users have no idea – in fact, they should not have an idea – that the database is scattered across many computers. This concept is shown in Fig. 8.3.

Fig. 8.2 Centralised versus distributed processing

Fig. 8.3 The illusion of centralised database in a distributed database environment

8.2 DISTRIBUTED DATABASE ARCHITECTURES

Three possible architectures emerge when we think of distributed databases, as shown in Fig. 8.4.

Fig. 8.4 Distributed database architectures

We shall now discuss these possible architectures one-by-one.

  • Shared nothing

    In this type of architecture, different computers are connected to one another to form a distributed environment. However, as the name (shared nothing) suggests, every computer has its own database. No computer shares its database with any other computer. Thus, although this is a distributed environment containing computers connected together by a network, the data itself is not shared at all. This idea is shown in Fig.8.5.

    Fig. 8.5 Shared nothing architecture

  • Networked architecture with one centralised database

    In the networked architecture with one centralised database, there is one common database which is shared by all the computers in the distributed environment. Thus, all the applications on the distributed computers share a single database. This architecture is shown in Fig. 8.6.

    Fig. 8.6 Networked architecture with one centralised database

  • Truly distributed database

    The truly distributed database architecture is shown in Fig. 8.7. Here, every computer in the distributed environment has its own database. However, all these databases are shared, unlike in shared nothing architecture.

    Fig. 8.7 Truly distributed database

8.3 ADVANTAGES OF DISTRIBUTED DATABASES

Let us now take a look at the major advantages of distributed databases.

  1. Management of distributed data with different transparency levels

    A distributed database provides various kinds of transparencies, or abstraction levels. In other words, the internal complexities of certain things are hidden from the user. There are three kinds of transparencies, as shown in Fig. 8.8.

    Fig. 8.8 Types of transparencies

    (a) Distribution/Network transparency: The principle of distribution transparency, also called network transparency, hides the details of where the actual tables are physically stored. The users need not be aware of these physical locations of the databases. From the user's point of view, there is a single, unified database to work with. This kind of transparency can be sub-divided into two categories, as shown in Fig.8.9.

    Fig. 8.9 Distribution/Network transparency types

    • Location transparency: Here, the user is not aware that the command entry and execution happen on different computers. In other words, the user may enter a command (such as an SQL SELECT query) on computer A. Internally, because of the distributed nature, the query may actually be routed to another computer, B where it would be executed. The result of the query would be returned back to computer A, and presented to the user. However, the user would not be aware of this at all and may think that the query was actually executed on computer A. Thus, the location of the execution is transparent to the user. Hence it is called location transparency. The idea is shown in Fig. 8.10.

      The portion shown inside the box (where computer A forwards the query of the user to computer B, and the result returned by computer B to computer A) is hidden from the user. All that the user feels she is doing is to enter a query and get an answer.

    • Naming transparency: Apart from the fact that the user is not aware of the location of the actual database, there is one more point of substance. The user must not need to add any extra information to help the DDBMS identify that the Employee table is located on computer B, and not on computer A. For example, the user need not specify by putting in like Employee@B. The DDBMS should take care of this fact, thus providing a naming transparency to the user.

      Fig. 8.10 Location transparency

      (b) Replication transparency: In a distributed database environment, it is quite possible that the same set of data elements exist at multiple locations/sites. This is to provide better availability, performance and reliability. As before, the user need not be aware of this replication transparency. We shall later study the significance of replication transparency.

      (c) Fragmentation transparency: In replication, the same sets of data elements are present on multiple computers. In fragmentation, a table is divided either horizontally (i.e. sliced into rows) or vertically (certain columns are selected) to create fragments of the original database. These fragments are then distributed across the DDBMS. These two methods are respectively called as horizontal fragmentation (as shown in Fig.8.11) and vertical fragmentation (as shown in Fig. 8.12). Regardless of the type of fragmentation, the user should not be aware of its presence.

      Fig. 8.11 Horizontal fragmentation

      Fig. 8.12 Vertical fragmentation

      We can see that the rows of a table are split and distributed across the sites in horizontal fragmentation.

      We can see that the columns of a table are split and distributed across the sites in vertical fragmentation.

  2. Better reliability and availability

    Reliability ensures that a system is always running. Availability means the continuous availability of a system in a finite time interval.

    Because multiple computers share the responsibility and burden of database processing, even if one of them fails, the others can continue working. This is in stark contrast to a centralised system, where if the main computer that holds the database fails, the whole system comes to its knees. Apart from reliability and availability, distributed database systems have one more advantageous feature. Because data can be replicated, even some data of the failed site may be available with other sites that are still functioning.

  3. High performance

    A trick to ensure high performance can be employed in distributed database systems. The idea is simple: Keep data closest to the place where it is most often required. This is referred to as data localisation. Because of this, the requirements in terms of the CPU access, I/O transfer, and communication delays are minimised. Moreover, because of fragmentation every site has a smaller database rather than the complete database. This also helps in performance and throughput.

  4. Easy to expand/change

    Because of its localised nature, distributed databases are easier to expand/change. The scope of expansion and changes is local, and does not impact other portions of the system too much. This is quite unlike a centralised database system, where changes can be very complex and time-consuming.

Data distribution is in the form of either maintaining copies of the same data at various locations (replication) or splitting it in such a manner that we can always join these broken portions to reconstruct the data in the original form (fragmentation).

8.4 DISTRIBUTED DATABASE REQUIREMENTS

Listed below are the requirements of distributed database systems.

  1. Tracking data – As compared to a non-distributed (i.e. simple) DDBMS, a DDBMS has to perform several additional tasks. These tasks come up because of the nature of distributed databases. As we know, distributed databases can contain data elements that are distributed, fragmented and/or replicated. As such, the DBMS must keep track of exactly where the fragments or replicated pieces of data elements exist. Accordingly, it needs to maintain additional information about these aspects. This is useful when one needs to retrieve data from the database.
  2. Distributed query processing – Apart from managing the distribution, fragmentation and replication of data, the DBMS also needs to perform the distributed query processing carefully. Data is distributed and, therefore, so should be the query. This is because the query needs to go to the various sites or locations of the database and perform the query operations on the data stored there. In addition, it must merge the results of these operations in order to create the final output. As we can imagine, this would involve not only the query parsing and execution logic, but also other aspects such as optimisation. The idea of distributed query execution is shown in Fig.8.13.

    Fig. 8.13 Distributed query processing

  3. Distributed transaction management – Distributed query processing is the first challenge for DDBMS in terms of coordinating the activities of the various sites participating in a distributed environment. The tougher challenge is to coordinate all of them in order to perform distributed transactions.

    A distributed transaction is one in which multiple sites participate to perform a logical unit of work.

    We have studied the two-phase commit protocol earlier. This protocol enables the sites participating in a distributed database environment to carry out transaction processing. We shall not discuss it here again.

  4. Management of replicated data – One of the options for implementing distributed databases is to replicate data. This poses its own challenges. For example, suppose we have three sites, A, B and C containing replicated data. What should be done if a user modifies some of this data at site A? Should we perform the same modifications to sites B and C? If not, the synchronisation of data would be lost. If we do perform the same modifications to sites B and C though, it may lead to poor performance because every such update and its replication means overheads in terms of database locking, communication overheads and transaction processing. Clearly, a DDBMS needs to evaluate all the possible options, their pros and cons and take a decision accordingly.
  5. Distributed data recovery – Data recovery in non-distributed databases is challenging enough and when data is distributed, it becomes even worse! We now need to know what happens to the various data elements that are replicated or distributed across multiple sites for each transaction may span across many sites. For example, in a distributed Funds Transfer transaction, the debit may happen on site A, but the credit may need to be effected on a different site, say B. This means that the database logging mechanisms are more complex and have to gear up for data recovery as well. The undo and redo processing discussed earlier now applies to the databases stored on different sites. Therefore, there must be coordination between these sites. Perhaps one master log can be made responsible for the overall recovery operation and can drive the logs on the other sites. This is shown in Fig. 8.14.

    Fig. 8.14 Distributed data recovery through logging

  6. Security – Maintaining security of data in a distributed environment is not easy. Now users are not located at a single place and therefore, uniform procedures, practices and access control mechanisms are not simple to implement. Determining who can access which data and perform which actions needs to be carefully thought out and implemented. Even worse, due to replication and fragmentation of data, sites can have data that does not belong to them. Therefore, it becomes even more important to safeguard data and other database objects from possible attacks.

8.5 DISTRIBUTED DATABASE TECHNIQUES

We have studied that there are two primary ways in which a database can be distributed.Fig. 8.15 shows these approaches.

Fig. 8.15 Distributed database techniques

We shall recap these concepts in brief.

8.5.1 Data Fragmentation

In data fragmentation, the tables of a database are physically fragmented and distributed across a network. No data items are replicated (i.e. duplicated). Data fragmentation can be further classified into two categories, as shown in Fig.8.16.

Fig. 8.16 Data fragmentation classification

Let us discuss these.

  • Horizontal fragmentation: In the horizontal fragmentation approach, we take a subset of the rows in a table. A simple WHERE condition can do this job. In addition, a projection operation by using the SELECT clause can also retrieve either the required or all columns. In many practical situations, it is common to see only one column being retrieved. For example, a horizontal fragmentation on the Employee table can retrieve employee names working in departments named Admin, Projects or Finance. Note that when we retrieve only the employee name from all the columns in the Employee table, it is a projection operation. On the other hand, the retrieval of only the employees working in the three named departments is a selection operation (i.e. horizontal fragmentation).

    A special case of horizontal fragmentation is derived horizontal fragmentation. In this case, the fragmentation of a table also causes the other related tables to be fragmented. For example, when we perform horizontal fragmentation on the Employee table as discussed above, we may also bring in the locations for the relevant rows (i.e. the rows for the three departments) from the Department table to the same site. That way, all the information regarding those departments and their employees will be available at one place. The idea is shown in Fig. 8.17.

    Fig. 8.17 Derived horizontal fragmentation

  • Vertical fragmentation: In the vertical fragmentation approach, we take a subset of the columns in a table. This is a projection operation. For example, we could select only the employee name and salary from the Employee table and maintain this information at a site. Importantly, vertical fragmentation should be done in such a way that it should be possible to join or merge back the fragments to construct the original table. For example, suppose the Employee table contains the following columns:

    Employee_number, Name, Salary, Age, Sex, Department

    We may then fragment this table at two sites, containing the following columns respectively:

    Site 1 – Employee_number, Name, Salary

    Site 2 – Employee_number, Age, Sex, Department

    Note that we have maintained the primary key column (i.e. Employee_number) at both the sites. This is done in order to be able to recreate the original Employee table (by joining the above two tables based on the Employee_number column) as and when needed.

    Thus, we can state the following:

    In vertical fragmentation, the primary key is a part of all the fragmented tables at various sites.

    Another approach to fragment data is called hybrid fragmentation, which is essentially a combination of horizontal fragmentation and vertical fragmentation, as shown in Fig. 8.18. In fact, we have discussed this approach while studying about horizontal fragmentation.

    Fig. 8.18 Hybrid fragmentation

8.5.2 Data Replication

In data replication, the same copy of data is available at more than one site.

In other words, we create some amount of redundancy. Two most important reasons for employing this strategy are as follows:

(a) The data can be made available at the site where the information is requested for, or at least at a site close to it. This minimises the data transmission requirements.

(b) Overall data availability is increased. If one site is down, we can retrieve data from another site (i.e. a copy of the same data).

Data replication (or no replication at all) can be performed in three ways, as shown in Fig. 8.19.

Fig. 8.19 Data replication options

Let us examine the different types of replication now.

In a fully replicated database environment, all the sites contain the full database.

  • Fully replicated databases: Every data element is available at every site in this type of database. This increases data availability and fault tolerance by many folds. In other words, almost all queries can be executed locally, without needing to retrieve any data from another site. Also, the performance of query execution is quite high, because of its nature (i.e. local execution).

    However, there are certain drawbacks of this scheme as well. Most notably, the overall performance, including that of the update operations, is bad. This is because when a user updates the database at one site, the changes made must be replicated to all the other sites in order to maintain data integrity and consistency. Moreover, concurrency and data recovery are not easy to achieve.

    In a partially replicated database environment, some of the databases/tables are replicated at some of the sites.

  • Partially replicated databases: In this form of replication environment, not every database element is present at every site. Which databases to replicate, how, and where is a decision specific to an environment. In general, databases that are viewed more than they are updated should be replicated. This provides better performance without too many overheads.

    An environment that has no replication, a data element is stored at exactly one place.

  • No replication: In the replication environment, no other copy of the database exists at any other site. This is the other extreme of replication, as compared to full replication.

    There are no update overheads in the case of full replication. However, because only one copy of data exists overall, data availability is not very high. If a site containing a specific data element is down and a user requests for it it will entail a wait. Also, data transmission requirements are higher since the queries cannot be executed locally. The DDBMS must first bring in data from the relevant site before it can execute any queries.

Client/server computing has gained popularity in the last few years. Earlier, it was expected that the powerful server computer (e.g. a mainframe) would do almost everything related to the application, and that the user's computer would be quite dumb in nature. No longer is it the case now. The users want to be able to use their computer also as a participant in the application execution in true sense.

8.6 DISTRIBUTED QUERY PROCESSING

A DDBMS has to take care of execution of queries that are distributed. In other words, one query has to refer to tables at multiple sites. This involves certain critical decisions, which we shall now discuss.

8.6.1 Costs

The most important cost in a DDBMS environment is related to data transmission. This it is quite logical as even, in a centralised (non-distributed) environment, there are many types of costs, such as the CPU time, memory requirements, hard disk space and so on. However, because the query is entered and executed on a single site, there is no question of data transmission costs.

This is not true in the case of distributed queries. Here, we incur all the costs that are associated with a centralised environment and, in addition, we also acquire data transmission costs. The idea is illustrated in Fig. 8.20.

Because the data is not available at a centralised location, we must first transmit it from one site to another. We may also need to transmit intermediate results. The strategy should always be to try and minimise these data transmission requirements.

Fig. 8.20 Centralised and DDBMS query processing requirements

Let us consider an example to illustrate this. Suppose we have just three sites numbered 1, 2 and 3. Of these sites, sites 1 and 2 contain one table each, as shown in Fig. 8.21.

Fig. 8.21 Data distribution example

A user at site 3 (called the result site) executes the following query:

SELECT Emp_Name, Salary, Department_Name

FROM Employee, Department

WHERE Employee.Department_Number = Department.Department_Number

Let us assume that every employee definitely has a department number assigned. Therefore, this query would produce 15,000 rows. Also, let us assume that the output of the query is 50 bytes. In other words, every output line contains 50 bytes.

In order to process this query, the DDBMS can follow one of the following approaches (conceptually shown in Fig. 8.22).

Fig. 8.22 Distributed query processing _ Possible approaches

Let us now discuss these approaches.

(a) Transfer the Employee and Department tables to site 3 and do a join. This will lead to the following steps:

(i) Transfer Employee table from site 1 to site 3. This means transfer of 200 × 15,000 bytes, or 30,00,000 bytes.

(ii) Transfer Department table from site 2 to site 3. This means transfer of 50 × 150 bytes, or 7,500 bytes.

(iii) At site 3, perform the join and show the result to the user.

Thus, the total data transmission requirement would be worth 30,07,500 (30,00,000 + 7,500) bytes.

(b) Transfer the Employee table to site 2 and do a join. Transfer the result of this operation to site 3. This will lead to the following steps.

(i) Transfer Employee table from site 1 to site 2. This means transfer of 200 × 15,000 bytes, or 30,00,000 bytes.

(ii) Perform the join operation at site 2 and send the results to site 3. This will cause 50 × 15,000, or 7,50,000 bytes of data to be transferred from site 2 to site 3.

(iii) At site 3, show the result to the user.

Thus, the total data transmission requirement would be worth 37,50,000 (30,00,000 + 7,50,000) bytes.

(c) Transfer the Department table to site 1 and do a join. Transfer the result of this operation to site 3. This will lead to the following steps:

(i) Transfer Department table from site 2 to site 1. This means transfer of 50 × 150 bytes, or 7,500 bytes.

(ii) Perform the join operation at site 1 and send the results to site 3. This will cause 50 × 15,000, or 7,50,000 bytes of data to be transferred from site 2 to site 3.

(iii) At site 3, show the result to the user.

Thus, the total data transmission requirement would be worth 7,57,500 (7,500+ 7,50,000) bytes.

Clearly, approach 3 is the most desirable as it requires the minimum amount of data transfer.

8.6.2 Semi-join

We will note that although several approaches exist for distributed query processing, all of them involve the transfer a great amount of data from one site to another. Better techniques, which minimise the amount of data transmission between sites, exist.

Semi-join is a technique that attempts to minimise the amount of data that needs to be transferred between sites during distributed query execution.

Semi-join reduces unwanted data transmission, thus reducing the load on the communication links. As a result, the overall performance of distributed query processing is a lot better than otherwise. The technique is simple—minimise the number of rows and columns before sending the data to another site. How can this be done? Let us understand with a simple example.

Suppose we need to join two tables—A and B—located at different sites. Table A is located at site 1, and table 2 is located at site 2. So far, we have studied that we need to:

1. Send table A from site 1 to site 2 and do a join at site 2, or

2. Send table B from site to 2 site 1 and do a join at site 1.

The approach to be chosen depends on the size of the two tables (whichever is smaller should be selected). Semi-join improves upon this. Rather than sending the entire table, it sends just the column(s) on which the join is to be achieved. This dramatically reduces the amount of data that needs to be transmitted from one site to another. Conceptually, this is shown in Fig. 8.23.

The Internet is also a special case of client/server computing. The Web browser is the client, and the Web server is the server here.

Fig. 8.23 Distributed query processing with and without semi-join

Let us understand this better with an extension of the same example discussed earlier, in which we had concluded that approach 3 was the best, as it involved minimum amount of data transfer.

Let us assume that the join was to happen on the Department_ID columns of the Employee and Department tables. We shall assume that the size of this column is 5 bytes. Therefore, the transfer of only this column from site 2 to site 1 would now involve 5 × 150 = 750 bytes, instead of 7,500 bytes as in the earlier case (refer to the previous section for more details). The saving achieved by semi-join is small here, but it can be quite significant in some other cases.

8.6.3 Distributed Query Decomposition

When working with distributed databases, two main options are available for the processing of queries, as shown in Fig. 8.24.

Fig. 8.24 Distributed query execution

Let us now discuss these approaches.

  • No distribution of query: This approach is quite simple. The user is responsible for working with data fragments and/or replicated data. For instance, if the user is interested in finding out the names and salaries of employees working in department 5, he is completely responsible for locating this information. The user must specify which copy of the replicated database should be chosen (i.e. from which site the table should be taken). Moreover, the user is responsible for ensuring data integrity and consistency among all the participating sites.
  • Full distribution of query: In this approach, the query is issued as though the user is working on a centralised database. There is no notion of distribution of data from the user's point of view. It is the DDBMS which takes all decisions regarding the retrieval of appropriate tables from various sites. The DDBMS performs a task called query decomposition for this purpose. The technique is simple–the DBMS breaks the user's query into smaller queries. Each smaller query is called a sub-query and it executes at the respective sites chosen by the DDBMS. The results of all the sub-queries are finally combined to produce the output. The DDBMS maintains and user information regarding the fragmentation, replication and distribution of data for this purpose. This is shown in Fig. 8.25.

    Also, the user does not handle tasks such as maintaining data integrity and consistency between the various sites. The DDBMS performs these tasks for the user.

    Fig. 8.25 Query decomposition

8.7 DISTRIBUTED CONCURRENCY CONTROL AND RECOVERY

8.7.1 Concurrency and Recovery Problems

Several problems arise while dealing with concurrency control and recovery issues in distributed database systems. These problems are specific to distributed databases, and are not observed in centralised database systems. Some of the main problems are listed below.

  • Site failure: There are situations when one or more sites in a DDBMS fail. In such situations, it is important that the DDBMS continues functioning. When the sites resume functioning, they must be brought back to the state of the other sites. In other words, the consistency and integrity of the database must be restored.
  • Network problems: Many times, the communication network fails, causing one or more sites to be cut off from the rest of the sites in a DDBMS environment. An extreme case is when the sites in a DDBMS environment are cut into two portions. This is called network portioning. In this case, a site can communicate with other sites that belong to the same side of the partition, but not with any site that belongs to the other side of the partition.
  • Data duplication: Because of data replication, the same multiple copies of data may exist. Hence, the DDBMS needs to monitor these copies and make sure that the database is consistent and is in a recoverable state, if and when problems arise.
  • Distributed transactions: There may be problems when a transaction spans across multiple sites in terms of some sites being successful in committing/rolling back their changes, but others not being able to do so. We know that the two-phase commit protocol is used in such situations.
  • Distributed deadlocks: In a centralised DBMS, a deadlock occurs at a single site. It is quite easy to detect and deal with such a deadlock. However, in a DDBMS, a deadlock may occur in any one or many of the multiple sites. This situation is called as distributed deadlock. The DDBMS must be able to handle such situations.

Here we shall discuss some of the steps taken in order to deal with such problems in a DDBMS environment.

8.7.2 Distinguished Copy

Data replication helps achieve better performance in a DDBMS environment. However, it also poses a major challenge due to the possibility of data inconsistency. In order to deal with such situations, the concept of distinguished copy may be used. The idea is simple–if a data item (say a table) exists at three sites, one of the copies is designated as the distinguished copy (or the main copy) and the other two copies depend on it. Any locking and unlocking that needs to be done because of transactions is applied to the distinguished copy. The idea is illustrated in Fig. 8.26. Here, the same table–A–occurs at sites 1, 2 and 3. The table at site 2 is the distinguished copy. The other two tables (at sites 1 and 3) depend on the table at site 2.

Fig. 8.26 Distinguished copy concept

The concept of distinguished copy can be used to deal with concurrency and recovery issues to a great extent. Based on how we choose to model it, there are three possible techniques, as shown in Fig. 8.27.

Fig. 8.27 Distinguished copy techniques

The term coordinator site needs to be discussed before we study these three techniques. A coordinator site is the one that contains the distinguished copy of the data item under consideration. In other words, the coordinator site is the main site for that data item.

8.7.2.1   Primary site technique   In this technique, one primary site becomes the coordinator site for all data items, as shown in Fig. 8.28. Thus, all locking and unlocking happens at the primary site.

Fig. 8.28 Primary site technique

Simply put, this method is quite similar to a centralised DBMS, where all locks are at the same site. This technique is quite straightforward and easy to understand and implement. Of course, there are drawbacks of this approach too. The primary site can easily get overloaded with so many locking and unlocking requests coming in. Moreover, if the primary site fails, the whole system comes to its knees quite quickly. This is because the primary site is the only one that controls the transaction management aspects of the DDBMS. In other words, system availability and reliability can be at stake.

8.7.2.2   Primary site with backup site technique   This approach deals with the problem of primary site failure. A site designated as backup site takes over from the primary site if and when the latter fails. The idea is illustrated in Fig.8.29.

Fig. 8.29 Primary site with backup site technique

Locking and unlocking happens at both the sites – that is the primary site and the backup site. If the primary site fails, the backup site becomes the primary site and a new site is chosen as the new backup site, as shown in Fig.8.30. The only problem that occurs in case of primary site failure is a small delay for these changes to take peace, after which the DDBMS can resume its operation.

Fig. 8.30 Primary site failure in Primary site with backup site technique

The main disadvantage of this scheme is that all locking and unlocking operations must happen at two sites–the primary site and the backup site. Consequently, the overall system performance is not very high. Moreover, the primary site and the backup site can be overloaded with too many locking/unlocking requests.

8.7.2.3   Primary copy technique   We have noticed that the two techniques described earlier are mere extensions of the centralised DBMS approach. However, these techniques cause overload on the primary and backup sites. In order to distribute the load of operations more symmetrically, the primary copy technique is used. In this technique, the distinguished copies of the various data items are stored at different sites and are called as primary copies. Thus, if a site fails, only the primary copies at that site are affected. The idea is depicted in Fig. 8.31.

Fig. 8.31 Primary copy technique

8.7.3 Dealing with Coordinator Failures

In the first two techniques discussed earlier, whenever the coordinator site fails, a new one must be chosen. The process depends on the technique, as discussed below.

  • Primary site technique: In this technique, when the primary site fails, all the transactions that were executing at the time of primary site failure must be aborted. After a new site is designated as the primary site, the aborted transactions must be redone. This process is quite tedious.
  • Primary site with backup site technique: In this technique, we need not abort all the running transactions – instead, we can suspend them. Thus, we put all the executing transactions on hold. These transactions are resumed once the new primary site is chosen.

    Another issue is how do we choose a new backup in the case of failures in the Primary site with backup site technique? The site that was originally the backup site, and which has now become the primary site, can make this selection. There are situations when no backup site exists, or even if it exists, it is down. In such situations, a mechanism called election is used. In this process, a site that needs to deal with the coordinator site realises that the coordinator site is down based on its repeated unsuccessful attempts at eliciting some sort of information from the (failed) coordinator site. This site now sends a message to all the other sites in the DDBMS, proposing to take over the role of the coordinator. These other sites either choose to select it, or disallow it. In the process, a few more sites may also declare their election. The site that gets the highest number of votes can then become the new coordinator site.

Fig. 8.32 shows an example of election.

Fig. 8.32 Example of election

8.7.4 Voting Method

Another approach of achieving concurrency control in a DDBMS environment is to use the voting method. So far, we have discussed methods of achieving concurrency control based on the concept of distinguished copy. That is, we designate one of the sites as the holder of the distinguished copy of the replicated data item. There is no such concept as a distinguished copy in the voting method.

Whenever a data item is to be updated, a locking request is sent to all the sites that hold a copy of that data item. A lock is associated with every copy of the data item. The copy can decide whether to grant or deny a lock.

Whenever a transaction needs to update a data item, it sends a locking request to all the copies (i.e. sites) of that data item. Every site either responds with a Yes or No. This response is called a vote. Associated with every data item is a minimum threshold that must be met for the transaction to proceed with the update. There are two possibilities now.

  1. If the sum of all the Yes votes is equal to or greater than the threshold, the transaction acquires a lock on the data item and informs all the sites about it. It goes ahead with the update.
  2. If the sum of all the Yes votes is less than the threshold, or if there is no response in a specified time period, the transaction cancels its locking request and informs all the sites about it.

An example of the voting method is shown in Fig. 8.33.

Fig. 8.33 Example of voting

Voting is more distributed in nature, as there is no dependence on a single distinguished copy. However, because of the number of locking requests/responses and voting messages that need to travel between sites, it is slower.

8.7.5 Distributed Recovery

Logging in centralised DBMS systems ensures a swift recovery from failures. In distributed systems, however, things are not so simple. Imagine that site A has sent a message to site B. If B does not respond, what should A do? There are three possible reasons for B not responding, as follows:

  1. The message from A to B did not reach B at all, due to communication problems.
  2. B was down when A sent the message. So although the message was delivered to B, B could not respond.
  3. B did receive the message from A and sent a response. This response was lost due to communication problems.

These possibilities are shown in Fig. 8.34.

Fig. 8.34 Possible problems in communication between two sites

Clearly, if such problems occur during communication between sites in a DDBMS environment, carrying out transactions or performing recovery operations would be extremely tricky. When a distributed transaction needs to take place, the COMMIT operation cannot happen unless all the participating sites are ready to commit their changes. The two-phase commit protocol is used to ensure this.

8.8 Distributed Deadlocks

Like almost everything else in distributed systems, deadlocks offer a great design challenge. Dealing with deadlocks even in a non-distributed environment is not easy; coupled with the challenge of multiple sites, the task becomes even tougher.

We shall discuss three strategies related to deadlocks in a distributed environment: Prevent a deadlock, Avoid a deadlock, Detect a deadlock.

8.8.1 Prevent a Deadlock

In this approach, the technique of timestamps is used. Each transaction carries its own timestamp. Consider two transactions, T1 and T2, and one resource, R, in which both of T1 and T2 are interested. Let us imagine that at a given point of time, T1 owns R, and T2 wants to own R soon thereafter. Quite clearly, we must quickly decide how to deal with this situation. Otherwise, this can ultimately lead to a deadlock. To solve such problems, two possible methods exist: Wait-die method and Wound-wait method.

Let us discuss these two methods.

  1. Wait-die method

    As previously mentioned, the situation is that T1 owns R, and T2 is making an attempt to own R. Let the timestamp of T1 be TS1 and that of T2 be TS2. In this approach, the algorithm for preventing a likely deadlock is as shown in Fig. 8.35.

    Fig. 8.35 Wait-die method

    As we can see, if transaction T2 has started prior to the transaction T1 (in terms of their timestamps), then T2 waits for T1 to finish. Otherwise, DDBMS simply kills T2 and restarts it after a while (with the same timestamp, TS2), hoping that T1 would have released the resource R by now.

    To summarise, an older transaction gets a higher priority. Also, we can see that a killed transaction is restarted with its original timestamp value.

    That allows it to retain its older priority (which would be higher than most other transactions, when it restarts).

  2. Wound-wait method

    This technique takes a different approach, as compared to the earlier method. Here, we start off with a similar comparison. We compare the timestamps of T1 and T2 (i.e. TS1 and TS2, respectively). If TS2 is less than TS1 (which means that T2 had started prior to T1), we kill T1. (Note that in the earlier case, we had blocked T2). If, however, T1 has started earlier, we halt T2. This process is shown in Fig. 8.36.

    Fig. 8.36 Wait-die method

    To summarise, in this approach, we grant immediate access to the request of an older transaction by killing a newer transaction. This means that an older transaction never has to wait for a younger transaction (unlike the earlier approach).

8.8.2 Avoid a Deadlock

After a lot of research, DDBMS designers have concluded that they cannot avoid a deadlock in a distributed environment. There are two main reasons for this:

  1. In order to avoid deadlocks completely, every site in the distributed system needs to be aware of the global state (i.e. information about all the other sites). This is clearly impossible, because the states of the sites would keep on changing constantly, and even if they communicate these changes to each other, there would be inevitable delays, making the information obsolete. As such, this requirement of global state will never be met.
  2. Maintaining global state, if at all, would entail tremendous amount of overheads in terms of network transmission/communication, processing and storage overheads.

8.8.3 Detect a Deadlock

In the process of detecting deadlocks, transactions are allowed to obtain access to shared resources. If a deadlock occurs because of this, it is detected. In such a situation, one of the transactions must release its share of resources.

The trouble in implementing this scheme in the case of a DDBMS is that every site has knowledge of the local transactions only. It cannot know about other sites and their transactions (i.e. global state). Therefore, it cannot help in deadlock detection. Consequently, some sort of arbitration is needed to detect deadlocks in DDBMS. Three solutions are proposed to handle this situation, as follows:

  • Centralised control: In this approach, one site is designated as the control site, and it decides how to detect and come out of deadlocks. All other sites provide information to this site, and abide by its rules. This approach has the advantage of simplicity, but also has the drawbacks of big overheads in terms of communications, storage and the danger of control site failure. The last point is most relevant as in such a case the entire system could come to a halt.
  • Hierarchical control: This creates a tree-like structure. Here, the site designated as parent detects deadlocks of its subordinate sites and takes an appropriate decision.
  • Distributed control: This is a democratic approach, wherein all the sites are treated as equal. All the sites need to cooperate with each other. There is a big amount of information exchange, resulting in substantial overheads.

As we can see, all the approaches have their advantages and disadvantages. Accordingly, the decision of choosing one of them is left to actual problem specification, which differs from one situation to another.

8.9 CLIENT/SERVER COMPUTING AND DDBMS

8.9.1 Client/server Computing

The term client/server architecture (also called client/server computing or client/server model) is very common in computer literature. It is also very important from the context of DDBMS.

Although it sounds complex, it is actually very simple to understand. In fact, it is used in our daily lives as well. For instance, when I use the tea vending machine to get some tea, the machine is serving me. Thus, the tea vending machine is the server, which serves tea to me – the client. When I go to a dentist for operation on my teeth, the dentist serves me – the customer, as shown in Fig. 8.37. Therefore, the dentist is the server in this case and I am the client. Note that in both the examples, the servers (tea vending machine and the dentist) wait for clients (me) to start the action or a dialog.

In computer terms, client and server are both computer programs. These two programs reside on different computers. The idea behind this is quite simple. If two or more programs have to interact with each other for performing a task, one must wait for requests from the other. In this context, the server program is a passive program that waits for requests from clients. Therefore, a server program endlessly waits for requests from one or more clients. A client, on the other hand, is an active program that initiates a request for the services of that server.

C.J. Date has provided rules for distributed databases. The more rules that a DBMS complies with; the better it is likely to be.

Fig. 8.37 Client and server

Note that a client and a server are both programs, although the ‘computers’ that run these programs are commonly and erroneously termed as client and server, respectively. But the terminology has become quite common these days, more so because hardware vendors add to the confusion by advertising their powerful computers as servers, whereas what they actually mean are computers that are capable of running server programs due to their high memory, hard disk and processor capabilities. We will, therefore, use these terms in either fashion.

Thus, when one computer requests for the services of another computer, it is called as client/server computing. The services could be anything. For example, a server computer could store files in which a client computer may be interested. Or, in a LAN environment with multiple clients and a server sharing an expensive printer, the server could have the printing hardware and software installed, which all client computers might use for printing. Thus, the term client/server can apply to a variety of applications. Normally, for a computer to act as a server, relatively larger disk and processing capacity are needed. Also, the server must have an operating system that supports multitasking (e.g. UNIX, Windows 2000). The reason for this is simple. There could be multiple requests from different clients at the same time to the same server. For each such request, the server operating system creates a task. The operating system queues these tasks and executes them according to the scheduling algorithm in the operating system.

The common way to depict client/server computing is shown in Fig. 8.38.

8.9.2 Client/server Computing and DDBMS

In the context of DDBMS, client/server computing can be called as a special case of distributed computing, which has the following properties:

(i) There may be one or more clients and servers.

(ii) Servers hold all the data.

(iii) Applications that require data execute at the clients.

(iv) There may not be complete location independence.

Fig. 8.38 Client/server concept

Effectively, we are saying that servers hold the databases and clients use them. Based on this understanding, we can think of two possible scenarios.

  1. Multiple clients access a single server, as shown in Fig. 8.39.
  2. One client accesses multiple servers, as shown in Fig. 8.40.

Fig. 8.39 Many clients, one server

Fig. 8.40 One client, many servers

The second architecture (one client, many servers) can be classified further into two categories, as follows.

(a) The client accesses multiple servers. However, at any given point in time, the client accesses only one server. That is, the client's individual database request is directed to exactly one server, not to multiple servers. Thus, a single request cannot fetch data from two or more servers.

(b) The client can access multiple servers at the same time. A single database request may fetch data form multiple servers.

We will realise that in case (a), the user must know which server holds which data elements. This information would be used to request data from a specific server. In case (b), the user need not possess this information. The user would send request for data as if the set of servers is actually one server.

8.10 DATE'S 12 RULES

C.J. Date has specified 12 rules for or expectations from a DDBMS. We list them below.

  1. Local autonomy: Every site in a DDBMS should be autonomous. That is, it should own and manage local data and operations.
  2. No dependence on a single site: The DDBMS should not depend on one site without which it cannot function. Thus, operations such as transaction management, query optimisation and execution, deadlock handling and so on should not depend on a single server.
  3. Continuous operation: The system must not be shutdown for actions such as adding/removing a site, dynamic creation/deletion of fragments at sites.
  4. Location independence: The user need not know which data is stored at which site and should be able to access all data items as if they were locally stored on the site.
  5. Fragmentation independence: Regardless of how the data is fragmented, the user should be able to access it.
  6. Replication independence: The user should not be aware of data replication. Thus, the user must not be able to request for a specific copy of data. Also, the user must not be required to ensure consistency among all the copies of data during an update operation.
  7. Distributed query processing: The DDBMS should be able to process queries that refer to data from more than one site.
  8. Distributed transaction processing: The DDBMS should ensure that both local and global transactions follow the ACID principles.
  9. Hardware independence: The DDBMS should run on different hardware platforms.
  10. Operating system independence: The DDBMS should run on a number of operating systems.
  11. Network independence: The DDBMS should run by using a variety of network types and networking protocols.
  12. Database independence: It should be possible to construct a DDBMS out of many DBMS, which are different in nature, architecture and data models.

Active program

Avoid a deadlock

Centralised applications

Centralised control

Client/server computing

Data replication

Derived horizontal fragmentation

Detect a deadlock

Distributed applications

Distributed control

Distributed databases

Distributed processing

Distributed transaction

Election

Fragmentation transparency

Fully replicated database

Hierarchical control

Hybrid fragmentation

Location transparency

Network partitioning

Networked architecture with one centralised database

Partially replicated database

Prevent a deadlock

Primary site

Reliability

Semi-join

Shared nothing

Truly distributed database

Vertical fragmentation

Wait-die method

Availability

Backup site

Centralised computing

Client

Coordinator site

Distinguished copy

Distributed computing

Distributed Database Management System (DDBMS)

Distributed deadlock

Distributed recovery

Distribution transparency

Fragmentation

Full replication

Global state

Horizontal fragmentation

Localisation

Naming transparency

Network transparency

No replication

Passive program

Primary copy

Query decomposition

Replication transparency

Server

Sub-query

Two-phase commit

Voting method

Wound-wait method

  • Databases can be centralised or distributed.
  • Distributed DBMS (DDBMS) manages distributed databases.
  • Three possible DDBMS architectures exist: Shared nothing, Networked architecture with one centralised database, and Truly distributed database.
  • In the shared nothing architecture, each computer has its own database.
  • In the networked architecture with one centralised database, there is one common database that is shared by all the computers.
  • In a truly distributed database, every computer in the distributed environment has its own database. However, all these databases are shared.
  • Location transparency is a concept in which a user need not be aware of the physical location of a database in a distributed environment.
  • In naming transparency, the user need not provide any additional information regarding the name of a database.
  • A database can be divided horizontally or vertically. This is called as fragmentation.
  • In horizontal fragmentation, a database is divided on the basis of rows.
  • In vertical fragmentation, a database is divided on the basis of columns.
  • Distributed databases offer better reliability and availability as compared to centralised databases. They offer higher performance, and are easy to expand/change.
  • Many issues exist in managing distributed databases, such as tracking data, distributed query processing, distributed transaction management, managing replicated data, distributed data recovery, and security.
  • Data distribution can happen in the form of replication or fragmentation.
  • Introduction to Database Management Systems
  • Data replication can be in the form of Full replication, Partial replication, or No replication.
  • Distributed query processing involves determining the cost of the query. The cost depends on the way it is executed internally.
  • Because distributed query processing can be quite expensive, semi-join is used to minimise this traffic. Here, an attempt is made to pass minimum data between sites during a distributed query execution.
  • A query in the distributed environment can be distributed completely or not at all.
  • A number of problems can occur in the area of distributed concurrency control and recovery. Problems such as site failure, network failure, data duplication, distributed transactions and distributed deadlocks can make this quite a challenge to deal with.
  • One of the solutions to deal with replicated distributed database problems is to identify a distinguished copy. This is treated as the main copy among the replicated ones.
  • Three techniques are used in conjunction with distinguished copy concept: Primary site technique, Primary site with backup site technique, and Primary copy technique.
  • The concepts of election and voting method are used to deal with the failure of the coordinator site.
  • Two methods are available to deal with distributed deadlocks: Wait-die method and Wound-wait method.
  • Client/server computing is closely related to DDBMS.
  • Two client/server architectures are mainly possible; such as one server multiple clients, one client multiple servers.
  • CJ Date has specified 12 rules for a DDBMS to qualify as really trustworthy.

  Mark as true or false

  1. A centralised database is the same as a distributed database.
  2. The task of a DDBMS is quite complex.
  3. In the shared nothing architecture, each computer has its own database.
  4. Naming transparency is related only to the location of a database.
  5. If a database is divided on the basis of rows, it is called as vertical fragmentation.
  6. Data replication can never happen.
  7. Semi-join is the same as outer join.
  8. A query can be distributed across multiple sites.
  9. We can attempt to avoid a distributed deadlock.
  10. Election method is used if a participant site fails.

1.  In the network architecture with one centralised database architecture, there is/are ________ databases overall.

(a) 0

(b) 1

(c) 2

(d) more than 2

2.  In ___________, the user need not be aware of the physical place of the database.

(a) location transparency

(b) name transparency

(c) site transparency

(d) area transparency

3.  In _______, all the sites contain all the data elements.

(a) full replication

(b) partial replication

(c) no replication

(d) All of the above

4.  The ________ protocol is used to govern distributed transactions.

(a) granular locking

(b) normalisation

(c) concurrency

(d) two phase commit

5.  We need to determine the _______ of a distributed query in order to judge its performance.

(a) value

(b) amount

(c) time

(d) cost

6.  One approach to deal with distributed concurrency is the use of __________.

(a) unique records

(b) locking

(c) distinguished copy

(d) unique copy

7.  The idea behind semi-join is to ___________.

(a) maximize traffic between sites

(b) minimize traffic between sites

(c) put more load on the coordinator site

(d) put more load on the participating sites

8.  The voting method is used in case the ________ fails.

(a) main site

(b) coordinator site

(c) participant site

(d) follower site

9.  A client is _________.

(a) active program

(b) passive program

(c) active and passive program

(d) None of the above

10.  A server is _________.

  (a) active program

  (b) passive program

  (c) active and passive program

  (d) None of the above

  1. Distinguish between centralised and distributed systems.
  2. Discuss the approaches to distributing data.
  3. What are the advantages of distributed databases?
  4. List the challenges faced by distributed databases.
  5. What are fragmentation and replication?
  6. What are the issues in distributed transaction management?
  7. What is semi-join?
  8. Discuss the concepts of election and voting.
  9. What is client/server computing?
  10. List down at least five of the twelve rules specified by CJ Date regarding DDBMS.

  1. What would one look for while designing the architecture of a DDBMS?
  2. Why do you think is the two-phase commit protocol required?
  3. Are DDBMS closely related to distributed operating systems? If yes, in what sense?
  4. Read about the following terms: Distributed query, Distributed transaction, Distributed request.
  5. Many Websites use the mirroring technique. Is it related to DDBMS?
  6. Is parallel computing related to DDBMS in any ways?
  7. Why is there a need for TP monitors such as EJB and MTS when DBMS products themselves support transaction management features?
  8. Investigate the problems in distributed file systems.
  9. Study more about remote computing protocols such as Remote Procedure Call (RPC), COM/CORBA, and IIOP. How are they related to DDBMS?
  10. Why do you think is there a need for location transparency and name transparency?