6. Validating the Measurement System – Six Sigma for Business Excellence: Approach, Tools and Applications


Validating the Measurement System

Importance of Measurement

Measurement ‘converts’ existing information of unknown magnitude into a usable value. It helps us take decisions. For example, when doctors measure the blood pressure of a patient, they often want to decide on the medication or the line of treatment based on that. During a cricket match, when the umpire decides whether the batsman is out or not out, his decision affects the outcome of the match. In a manufacturing unit, when a manufactured machine part is inspected, a decision is taken about whether to accept or reject that machine part. Such decisions involve risks of error in measurement. When a good part is rejected, it is called the producer's risk or α-risk. On the other hand, when a bad part is accepted, it is called the consumers’ risk or β-risk.

In all the above cases, the impact of wrong measurement on the decision is quite evident. We often do not realize how our decisions depend on the results of a measurement system. We also do not realize the consequences of incorrect results of a measurement system. Many of us may have more than one clock in our houses. For the domestic application of time measurement, the criticality is quite low. But imagine its criticality in a project like satellite launching at the Indian Space Research Organization (ISRO).

The lowest unit that our wrist watches can record is normally one second. It is rather adequate for day-to-day activity. The time clocks which are used in Olympics would require much higher accuracy and precision and a wrist watch is not an acceptable measurement system in that case. For example, the difference between the fastest and the slowest runners in the Olympics 100 meters finals would be to the order of one-tenth of a second. This difference in a 100 meters race for a corporate company or a small locality would be around two to three seconds. Thus, a stop-watch may be adequate to decide the winner of a colony race but in Olympics it may declare all the participants as winners. However, today we have digital frame capture technology and it has taken over the conventional time measurement instruments. It is, therefore, important that we use a measurement system that is appropriate for the objective and is able to discern the ‘process variation’ to take correct decisions.

Where Do We Use Measurement Systems?

Measurement systems are used in numerous areas. Some of these are listed here.

  • Product classification and/or acceptance
  • Process adjustment
  • Process control
  • Analysis of process variation and assessment of process capability
  • Decision making

Measurement systems are not limited to manufacturing and shop-floor environment. Table 6.1 should be useful in understanding the wide variety of measurement systems:


Table 6.1 Examples of measurement systems

Type of measuring system Type of data Consequences of measurement variation
Watches keep time Variable May miss appointments, train, bus, etc.
Stop watches of various types to decide the winner of a race, or for process evaluation Variable A gold medal may be given to a non-deserving participant.
Measuring tape of a tailor Variable Clothes will not fit well.
Engineering graduation examination, MBA entrance examination Variable Students’ ability will not be assessed properly. Deserving participants may not get selected for the MBA course.
Interview Discrete binary (selected or not selected) A suitable candidate may get rejected and a non-deserving candidate may get selected.
Cricket umpire Discrete binary (out or not out) A batsman may be declared out when he is actually not, or vice versa.
Blood pressure (BP) measurement Variable Medicines can be wrongly prescribed.
Stress test for heart fitness Discrete binary (test positive or negative) A person having heart problem may be declared fit, and vice versa.
Counting parts for inventory Discrete Incorrect purchase order, production disruption, incorrect profit (or loss), etc.
Fuel gauge in a car Variable The fuel in the car may get over without warning and the car can be held up on road.

In Six Sigma projects, it is essential that the belts assess the measurement system before assessing the process capability and sigma level. The procedures used for variable and attribute data are different. However, measurement system analysis (MSA) should be performed for both the types. Before we study these procedures, let us look at some basic terms used in MSA.

Measurement Terminology

The following definitions of the terms need in MSA are as provided in Measurement System Analysis Manual published by (AIAG 2002). Accuracy is the degree of conformity of a measur quantity to its actual or true value. For example, a watch showing time equal to the standard time can be considered accurate.

Precision is the degree to which further measurements show the same or similar results. Figure 6.1 may be useful in clarifying the difference between accuracy and precision.

Bias is the difference between measurement and the master value. To estimate bias, we need to measure the same part a number of times. The difference between the average of these measurements and the true value is called bias. Calibration helps in minimizing bias during usage. For example, a person who wants to check his own weight takes a few readings. He finds that the average of five such readings is 55 kg. He then goes to a lab where the weighing scales are regularly calibrated. He finds his weight to be 56.5 kg in the lab. The bias of the weighing scale is 55–56.5 = –1.5 kg.



Figure 6.1 Accuracy and precision


Repeatability is the inherent variation due to the measuring equipment. If the same appraiser measures the same part a number of times, the closeness of the readings is a measure of repeatability. Traditionally, this is referred to as ‘within appraiser variation’. It is also known as equipment variation (EV).

Linearity is the change in bias over an operating range. It is a systematic error component of the measurement system. In many measuring systems, error tends to increase with larger measurements. Examples are pressure gauge, dial gauge, weighing scales, etc.

Stability is the measurement variation over time. Measure of stability could be considered as drift. Periodic calibration of measuring equipment is performed to assess and assure stability.

Reproducibility is variation in the average of measurements carried out by different appraisers using the same measuring instrument while measuring identical characteristics on the same part. Reproducibility has been traditionally referred to as ‘between appraiser’ or appraiser variation (AV). Training and operating instructions help reduce reproducibility variation. Sometimes reproducibility also depends on the part measured. In such cases, we say that there is an interaction between operators and the parts.

Process and Measurement Variation

Our objective in many situations is to measure process variation (PV). But what we measure in reality is a total of process as well as measurement variation. It is desirable that measurement variation remains a very small portion of the observed variation, of which it is a component. This is shown in Figure 6.2.


Figure 6.2 Partitioning of variation



(Courtesy Institute of Quality and Reliability)


In a typical measurement system analysis (MSA), we use statistical methods to estimate how much of the total observed variation (TV) is due to the measurement system. An ideal measurement system should not have any variation. However, this is impossible and we have to be satisfied with a measurement system that has variation of less than 10 percent of the total variation. As the portion of variation due to the measurement system increases, the value or utility of the system goes on reducing. If this proportion is more than 30 percent, the measurement system is unacceptable.

Repeatability and Reproducibility (R&R) Study

Let σm be the standard deviation of measurement variation and σObserved standard deviation of total observed variation. The ratio of measurement variation to process variation, i.e., σm/σObserved is called gauge repeatability and reproducibility (GRR). It is customary to quantify R&R value as the ‘ratio of measurement standard deviation with total standard deviation’ in percentage. This is known as the ‘percentage study variation method’. The MSA Manual published by Automotive Industry Action Group (AIAG) specifies the following norms (AIAG 2002):

  • If GRR is less than 10 percent, the measurement system is acceptable.
  • If GRR is between 10 and 30 percent, equipment may be accepted based upon the importance of application, cost of measurement device, cost of repair, etc.
  • Measurement system is not acceptable for GRR beyond 30 percent.

In a typical measurement system study, GRR is estimated. The stability, bias, and linearity errors are to be addressed during the selection and calibration of the instrument.

Please note that in the percentage study variation method, we are using ratios of standard deviations. The other alternative to this is the ‘percentage contribution method’. In this method, we compare ratios of variances. It is easy to convert the norms mentioned in the study variation method. For example, when GRR is 10 percent, the ratio σm/σObserved is 0.1. Thus, ratio of variances Norms for both the methods are summarized in Table 6.2.


Table 6.2 Acceptance criteria for measurement systems

Acceptance norms for R&R with % study variation method Acceptance norms for R&R with % variance contribution method Decision guidelines
<10% <1% Acceptable measurement system.
Between 10 and 30% Between 1 and 9% May be acceptable based upon the importance of application, cost of measurement device, cost of repair, etc.
>30% >10% Unacceptable measurement system. Every effort should be made to improve the system.

Source: AIAG 2002.


Often, we may like to compare measurement variation with the tolerance of the component being measured. This comparison can be expressed as precision to tolerance ratio or P/T ratio. Precision is the 6σm band that includes 99.73 percent of measurements. This is compared with the tolerance. Thus, P/T Ratio = A multiplier of 5.15 is sometimes used instead of 6 for 99 percent measurements instead of 99.73. The P/T ratio may be an alternative when the process variation is very small. The AIAG acceptance norms for P/T ratio are the same as those for % Study Variation Method shown in Table 6.3.

The Number of Distinct Categories (NDC)

Another norm used in variable measurement systems is the number of distinct categories (NDC). This number represents the number of groups within our process data that the measurement system can discern or discriminate. NDC =


Table 6.3 R&R study illustration



(Courtesy Institute of Quality and Reliability)


Imagine that you measured 10 different parts and the NDC = 4. This means that some of those 10 parts are not different enough to be considered as being different by the measurement system. If we want to distinguish a higher number of distinct categories, we would need a more precise gauge. The AIAG MSA Manual suggests that when the number of categories is less than two, the measurement system is of no value for controlling the process, since one part cannot be distinguished from another. When the number of categories is two, the data can be divided into two groups, say high and low. When the number of categories is three, the data can be divided into three groups, say low, middle and high. As per AIAG recommendations, NDC must be five or more for an acceptable measurement system.


Procedure for Gauge R&R Study for Measurement Systems:

  1. Decide part and measuring equipment

  2. Establish and document measurement method of measurement

  3. Train appraisers (operators) for measurement

  4. Select two or three ‘Appraisers’ (preferably three)

  5. Select 5 to 10 parts (preferably 10). Parts selected should represent process variation

  6. Each part must be measured by each appraiser at least twice, preferably thrice

  7. Measurements must be taken in random order and without seeing each other's readings so that each measurement can be considered as ‘independent’

  8. Record, analyze and interpret the results.

Collecting Data

Typically, in a gauge R&R study, 10 parts are measured by 3 appraisers. Each appraiser measures each part thrice. Thus, we have 9 measurements of each part with a total of 90 measurements. The data structure will be similar to that shown in Figure 6.3. A, B and C are appraisers and P1, P2,…P10 are the parts. Aij represents measurement of ith part by appraiser A for the jth repetition. When we compare the measurement of the same part by the same appraiser, we get an estimate of repeatability. We can call this as measurement variation within appraiser. If we compare the average measurements by each appraiser of the same part, we can estimate the variation between appraisers, i.e., reproducibility.

Analysis of Data

After collecting the data, we can analyze it using one of the two methods:

  1. Control Chart Method
  2. ANOVA Method

We will briefly discuss the control chart method. Calculation of ANOVA method are more complicated and beyond the scope of this book. However, SigmaXL and Minitab software programs can perform analysis using both the methods. The procedure is illustrated using an example from the AIAG MSA Manual (AIAG 2002). In this procedure, the averages and ranges are calculated for each appraiser for each part. The magnitude of range values will relate to repeatability or equipment variation (EV). If we calculate the average of all these range values, we will get ‘R-double-bar’. We can convert this into standard deviation σrepeatability using statistical constant K1. The value of K1 depends upon the number of trials by each appraiser for each part. For three trials, K1 = 0.5908.


Figure 6.3 Data collection for R&R study


(Courtesy Institute of Quality and Reliability)


Similarly, we can calculate the range of averages of 9 measurements for each of the 10 parts. This is the range of parts variation Rpart. We can use constant K3 to convert this range value to standard deviation σp. For 10 parts, the value of K3 is 0.3146. Based on these two values, we can also calculate the standard deviation due to reproducibility, i.e., variation between appraisers. An illustration of this procedure is shown in Table 6.2. For complete details of the procedure, refer to the AIAG MSA Manual. Percentage GRR in this example is 27.48. This is between 10 and 30 percent and, therefore, is conditionally acceptable. The P/T ratio is 22.4 percent which is also conditionally acceptable. We can accept the equipment with improvement actions planned and documented. The number of distinct categories and precision to tolerance ratio can also be calculated. A soft copy of the template is available in the CD provided with the book.

Readers can try the above MSA example on software using the following guidelines. In the above example, each appraiser measures each part three times. This is known as the ‘crossed’ method.

Minitab and SigmaXL commands for GRR are provided below.

Minitab: Create Gage R&R study worksheet using commands > Stat > Quality Tools > Gauge Study > Create Gage R&R Study Worksheet, input the above data and the analyze the data using commands > Stat > Quality Tools > Gauge Study > Gage R&R Study (Crossed)

SigmaXL: Use the MSA template or use commands > SigmaXL > Measurement Systems Analysis > Create Gage R&R (crossed) worksheet, input the above data and then analyze the data using commands > Measurement Systems Analysis>Analyze Gauge R&R (Crossed).

Bias and Linearity

Bias Study

There are two methods for estimating bias: independent sample method and control chart method.

Independent Sample Method

Obtain a sample and establish its reference value relative to a traceable standard. If it is not available, select a production part that falls in mid-range and designate it as a master sample for bias analysis. Steps for independent sample method are outlined below:

  1. Measure the part at least 10 times as precisely as possible in a measurement laboratory and compute the average. Use this average as the ‘reference value’.
  2. Have a single appraiser measure the sample at least 10 times in a normal manner.
  3. Plot the data as a histogram relative to the reference value. Review the histogram for bell shape.
  4. Analyze the data. Formulae for the bias test are complex and beyond the scope of this text. For a more detailed discussion, refer to the MSA Handbook by AIAG.

If bias is not zero for measuring equipment, it should be adjusted by performing a calibration. In case adjustment is not possible on the measuring equipment (such as slip gauges), it can be subtracted or added in the measured value. Adequate precautions must be taken in documenting procedures and training technicians. A template for bias study with 15 measurements is available in the CD-ROM provided with this book.

Control Chart Method for Determining Bias

To assess bias and stability, we can use the X bar and R chart. (For a more detailed discussion on control charts, refer to Chapter 17 on statistical process control). Control chart analysis should indicate that the measurement system is stable before the bias is evaluated. The method for determining bias is the same except that we have to obtain the (X-double bar) from the control chart. For complete details of the procedure, refer to the MSA Manual by AIAG.

Linearity Study

Linearity is the change in bias over the operating range. Linearity is evaluated using the following procedure:

  1. Select g > 5 parts whose measurements, due to process variation, cover the operating range of the gauge.
  2. Have each part measured precisely in the measurement laboratory to determine its reference value and to confirm that the operating range of the subject gauge is encompassed.
  3. Have each part measured m > 10 times on the subject gauge by one of the operators who normally uses the gauge. Select the parts at random to minimize the appraiser ‘recall’ bias in measurements.

Calculate the part bias for each measurement and the bias average for each part using the formula:

biasij = xij – (reference value for ith part) for ith part and jth measurement.



Plot the individual biases and bias averages with respect to the reference value on a linear graph. Also calculate the confidence intervals. For details of calculations, refer to the AIAG MSA Manual. These calculations are tedious and beyond the scope of this book. We will see the results of an example given in the software.

If the bias = 0 line is completely included in the confidence interval, linearity is acceptable.

Linearity Study Example A quality engineer wants to introduce a new measurement system. To conduct linearity study, she selects five parts (g = 5) which cover the operating range of the measurement system based upon process variation. Each part is measured by the first principles to determine its reference value. Each part is then measured 12 times (m = 12) by the lead operator. The parts were selected at random during the study. The data is shown in Table 6.4.

We can easily calculate the difference between each measurement and the reference value. The differences and average bias for each part are also shown in the table. We can now perform a regression of bias for each measurement with part reference value as the predictor. Considering that the calculations are complex, we will review the output from Minitab software. Perform regression in Minitab using the following commands:> Stat > Regression > Fitted Line Plot, choose linear model and in options check ‘Display confidence interval’.


Table 6.4 Data for linearity study


Regression and confidence interval are discussed in detail in Chapter 12. Minitab Output in Figure 6.4 shows the regression or the best-fit line. The dotted lines show confidence intervals. Bias = 0 line can be added by clicking anywhere on the graph, right-clicking the mouse and using the option > Add Reference line, choose X and enter 0 (zero).



Figure 6.4 Minitab output for linearity study example


Observe that Bias = 0 line is NOT included in the confidence interval. Linearity is, therefore, NOT acceptable.

MSA for Attribute Data

Quite often, we have to use the measurement system for attribute data. The simplest example is that of an umpire's decision during a cricket match to decide whether a batsman is out or not out. In the industry, typical examples can be

  • crack testing result, i.e., crack or no crack;
  • leak test;
  • visual inspection;
  • selection of a candidate in interview, or
  • grading tea or coffee.

Attribute agreement analysis is used to assess such measurement systems. In this procedure, at least 10 good parts and 10 bad parts should be inspected by three appraisers. Some experts recommend at least 50 parts (Mawby 2006). We should also know about the evaluation of each part as done by an ‘expert’ or ‘master’. The collected data may look like Table 6.5.


Table 6.5 Attribute MSA for binary data


(Reproduced with permission from SigmaXL)


Once the evaluation is completed, we can analyze the data for

  • agreement within appraisers,
  • agreement between appraisers, and
  • agreement between appraisers and ‘true standard’.

A measure called Cohen's Kappa is frequently used. Its value can range between 0 and 1. Value 1 shows a perfect agreement whereas value 0 indicates that the agreement is no better than chance. Table 6.6 shows an example of Cohen's Kappa calculation.


Table 6.6 Cohen's Kappa calculation illustration



Cohen's Kappa value of >0.75 is considered acceptable and the Kappa value of <0.4 is considered as poor (AIAG 2002). Kappa value can be calculated within repeated trials of a single appraiser, between single trials of any two appraisers, and between any appraiser and the ‘reference’ or ‘standard’. In other cases, ‘Fleiss Kappa’ value is calculated. Most of the software applications, such as Minitab or SigmaXL, are capable of analyzing data for attribute agreement analysis.

The Kappa values within appraisers statistics of percentage matched for each appraiser vs standard are shown in Table 6.7.

As we can see from the results, appraiser B is very inconsistent as his/her within appraiser Cohen's Kappa value is 0.23 which is very low. He/she should be trained perhaps by appraiser C. Please note that appraiser C is very consistent within himself/herself (Kappa value 0.9) as well as with the ‘standard’ as 19 out of 20 are matched with the standard. As the appraisers have performed repeated measurements, Kappa values between them cannot be calculated. In such cases, most software programs calculate other statistics such as ‘Fleiss Kappa’. Value of Fleiss Kappa close to 1 shows good agreement.


Table 6.7 Cohen's Kappa values within appraisers


(Reproduced with permission from SigmaXL)



Minitab:> Stat > Quality Tools > Attribute Agreement Analysis.

SigmaXL:Use Attribute Agreement Analysis Template or commands > Measurement System Analysis > Attribute MSA (Binary).

If we have attribute nominal data with more than two levels, Minitab calculates Kendall's coefficient of concordance.

The Analytic Method for Attribute MSA

In the analytic method, the concept of gauge performance curve is used to assess repeatability and bias of the measurement system. In this method, we need to have at least 8 parts selected at as nearly equidistant intervals as practical, covering the complete process range. Sizes of the parts should be accurately known before the study for reference. Reference parts should have sufficient overlap with tolerance. Measure each part 20 times with the gauge under study and record the result as accept or reject. At least 6 parts must have the accepted number between 1 and 19. If required, reference parts can be added between some of the ranges to get a minimum of six parts. Analytic method is supported in Minitab. The commands are:

> Stat > Quality Tools > Gauge Study > Attribute Gauge Study (Analytic Method)

For details of the method, refer MSA Handbook by AIAG (AIAG 2002).

Measurement Systems in Complex Situations

Measurement systems analysis can become more complicated for situations as described below:

  • When each part can be measured by only one appraiser. This is the case in destructive testing. Examples are tensile testing, crash testing, etc.
  • When the measuring equipment is damaged. An example is temperature measurement of molten metals in a foundry.
  • When product behavior can change and cannot be assumed as stable. Examples are testing of engines on dynamometer, calibration stands for fuel pumps, etc.

Use Minitab commands > Stat > Quality Tools > Gauge Study > Gauge R&R R (Nested) when each part is measured by only one operator, as is the case in destructive testing. SigmaXL does not support nested R&R study.

Procedures for such situations are difficult to generalize. Good guidance is provided in some books (Mawby 2006). Procedures may be developed in consultation with experts in MSA.

Measurement Systems in the Enterprise

While measurement systems analysis is most frequently done in manufacturing operations, it is also applicable to other processes. For example, MSA can be carried out for professors evaluating answer papers, cricket match umpires deciding about LBW, etc. Some more examples are listed here:

  • HR: selection, promotion, attrition
  • Marketing: for customer satisfaction measurement
  • Sales: receivables
  • Engineering: drawing errors, time to process change, errors in bill of materials (BOM)
  • Research and development (R&D): Instrumentation for data-logging
  • Supply-chain management: stock accuracy, transaction errors, ERP database accuracy
  • Purchase: purchase order approvals.

It is necessary to validate measurement system(s) for the CTQ of the project. There are many statistical procedures to analyze and quantify measurement system uncertainty.

  • Repeatability relates to the measuring system.
  • Reproducibility is a measurement variation due to different appraisers.
  • Linearity is measurement variation over the range of the measuring equipment.
  • Stability is measurement variation over time.
  • Measurement system is considered acceptable if R&R is less than 10 percent of process variation.
  • In case of attribute data, agreement analysis can be performed.
  • Cohen's Kappa value indicates the extent of agreement within and between appraisers.

Automotive Industry Action Group (AIAG) (2002). Measurement Systems Analysis.

Breyfogle III, Forrest W. (1999). Implementing Six Sigma: Smarter Solutions Using Statistical Methods. New York, NY: John Wiley & Sons.

Mawby, William D. (2006). Make Your Destructive, Dynamic and Attribute Measurement System Work for You. Milwaukee: American Society for Quality.