10. Inferential Statistics Part 2: Confidence Intervals, Power, Sample Size and Goodness of Fit – Six Sigma for Business Excellence: Approach, Tools and Applications

10

Inferential Statistics Part 2: Confidence Intervals, Power, Sample Size and Goodness of Fit

Confidence Intervals

Point Estimation and Efficiency of Estimators

Point estimation is a single number calculated from sample data. When we get data of, say, 10 readings from a process, we can calculate the mean of the sample data. If we take five more samples, we can again calculate the mean based on the data of all 15 readings. Which of these is the real population mean? We actually get values of sample means and . None of these may actually be the real population mean. We very rarely know what precisely the population mean is. As the sample size increases, the sample mean will approach the population mean µ. However, is not µ. Sample mean is called the point estimator of the population mean. We expect that is ‘reasonably’ close to the population mean µ. Another point estimator is the median. We should use the estimator which is more likely to be closer to the population mean.

Components of Closeness

There are two components of ‘closeness’: bias and standard deviation.

  1. Bias is the ‘expected value’ of estimator minus the parameter. For example, the mean of a large number of sample averages s will be equal to the population mean. A statistic is said to be an unbiased estimator if and only if the mean of sampling distribution of the estimator equals the population parameter θ Both sample mean and median are unbiased estimators of population mean µ.
  2. Standard deviation or variance of the estimator: In case of sample averages (or point estimation of mean), this will be . This is known as the standard error of mean.

For large samples, means tend to be distributed normally. For large sample sizes, medians are also distributed normally. However, statistical theory shows that the standard deviation of sample medians is approximately 1.25 (Johnson 2005). This is more than the standard error of mean. Thus, the sample mean is a better estimator of population mean as it is more likely to be closer to it. However, this does not imply that a particular sample mean will be closer to the population mean than the sample median. Statistical theory also shows that standard deviation of no other estimator is less than the standard error of mean . Thus mean is the most efficient estimator of central tendency.

Interval Estimates

Confidence Interval for Mean

We need to draw conclusions about populations based on samples using statistics. We are, therefore, never 100 percent sure about the value of population parameters. Thus we estimate population parameters like mean µ and standard deviation σ with ‘confidence intervals’. Such estimates are called ‘interval estimates’. Confidence Intervals are based on a certain confidence level, typically 95 percent.

 

 

Figure 10.1 Distribution of sample means

 

Let us assume that we take a large number of samples of size n from a population and measure these for a characteristic. The central limit theorem helps us in understanding that the means are likely to be normally distributed for significantly large values of n, say, more than 5. If the population standard deviation is σ, the standard deviation of means will be given by . This is called the ‘standard error of mean’. Figure 10.1 shows the distribution of sample means around population mean µ. We can say that 100(1 – α) percent of the sample means will lie between . The area beyond these values is the α risk that we are taking.

However, in real life, we rarely know the population mean µ. We take a single sample and calculate the sample mean Based on the value of we want to estimate the confidence interval of population mean µ. (Refer to Figure 10.2.) Let us assume that the confidence level is 95 percent. Then the probability that this value of will lie within is 0.95. In general, the probability will be percent confidence level/100. Consider a possibility that the value of is at the extreme left of the confidence zone. Then the value of µ will be . Similarly, if is at the other extreme, i.e., at the extreme right, the value of µ will be Thus, the confidence interval for population mean µ is .

 

 

Figure 10.2 Confidence interval for population mean

 

The practical meaning of 95 percent confidence interval can be stated thus: “The probability that the value of population mean µ will lie between the two confidence interval bounds is 0.95.” For example, if we take 1000 samples from a population, we will have 1000 sample means and, therefore, 1000 confidence intervals. We can expect that 95 percent or approximately 950 intervals will contain population mean µ. See Figure 10.3 for illustration with a few samples. Note that confidence intervals for all samples except sample 5 include population mean µ.

 

Figure 10.3 Practical meaning of confidence interval

%

 

(Courtesy Institute of Quality and Reliability)

 

Confidence Interval for mean µ when the population standard deviation σ is known is given by

 

 

where the factor Zα/2 depends upon confidence level

 

For 95%, α = 0.05, Z0.05/2 = 1.96

For 90%, α = 0.10, Z0.1/2 = 1.645

For 99%, α = 0.01, Z0.01/2 = 2.576

Application Example

A process is known to produce parts with a mean flow of 100 pounds/minute and a standard deviation of 5. The machine undergoes preventive maintenance and is reset by the operator. After starting the machine again, the first five components measure 97, 99, 101, 96 and 100. Find the confidence interval for population mean for alpha risk of 5 percent.

µ0 = 100, σ = 5, n = 5      Two tails with α = 0.05 distributed on both tails.

 

 

Confidence interval for mean when the population standard deviation is not known is when we can use sample standard deviation and student's t-distribution. This is given by the following formula:

Confidence Interval for µ when σ is not known

 

 

where v is degrees of freedom

In the above equation, the factor tα/2 depends upon the confidence level and degrees of freedom v.

Application Example

A process is known to produce parts with mean flow of 100 pounds/minute. The machine undergoes preventive maintenance and is reset by the operator. After starting the machine again, the first five components measure: 97, 99, 101, 96 and 100. Calculate confidence interval for mean if alpha risk is 5 percent.

  • H0: µ0 = 100, σ = not known, n = 5
  • Two tails with α = 0.05 distributed on both tails
  • Calculated standard deviation is 2.073

The value of t-distribution depends on degrees of freedom. From Table 2 (in the CD), the value of t0.05/2,4 = 2.7764. Thus, the confidence interval can be calculated as

 

 

Since mean is 98.6, CI will be 98.6 ± 2.567 = 96.03 to 101.168. A Microsoft Excel based template for calculating confidence intervals is provided in the CD. Try using the template for the two previous application examples.

Confidence Interval for Standard Deviation

We can calculate the confidence interval for standard deviation using χ2 (Chi-Square) distribution with (n – 1) degrees of freedom using the following formula:

 

 

Chi-Square distribution is not symmetric and, therefore, confidence interval for standard deviation is also not symmetric.

Application Example

A pizza company delivers pizzas with a mean of 33 minutes and a standard deviation of 4 minutes. Their Six Sigma team claims that an improved process was piloted for 26 pizza delivery trips and they could achieve a standard deviation of 3 minutes. What is the confidence interval for the new standard deviation? Assume alpha risk of 10 percent.

Solution: α = 0.1, α/2 = 0.05. From the table of chi-square distribution (Table T4 in the Statistical Tables in the CD), χ20.95, 25 = 14.611 and χ20.05, 25 = 37.652. Using the formula for confidence interval, we get the confidence interval as 2.44 to 3.92. Observe that the historical α = 4 is not included between 2.44 and 3.924; hence, we must reject the null hypothesis. The team should be recognized by the management. Standard deviation has significantly reduced.

Note: A template for confidence intervals is provided in the CD. Try using the template to calculate confidence interval for this example.

Confidence Interval and Proportion

We can find approximate confidence interval for the proportion using normal distribution if the sample size is large. If n is large and p ≥ 0.1, the confidence interval for the proportion using normal approximation to binomial is given by

Confidence Interval for

where is x/n

Note that this formula is based on binomial standard deviation For one-sided confidence interval, use Zα instead of Zα/2.

Application Example

A process is run for 3 days. Of the 3500 parts produced, 450 were reworked. What is the worst and best expected proportion from the process? Assume 95 percent confidence level.

Solution:

 

 

Thus, the worst expected proportion is 0.140 and the best expected proportion is 0.117 at 95 percent confidence level. Try using the template for this example.

Exercise

A call center monitors calls that are not answered within 20 seconds. Such calls are called ‘bad calls’. Data of the past 30 days shows that out of 4000 calls received, 360 were bad calls. What is the maximum percentage of bad calls that can be expected at 99 percent confidence level? (Answer: 10.05 percent)

Power and Sample Size

In hypothesis testing, we make an assumption called null hypothesis, which is about a population parameter. A population parameter could be mean µ, standard deviation σ for population size of n. Due to physical constraints, we can rarely measure or inspect the whole population. Therefore, we collect data for the sample size n. This data would have mean and standard deviation s. Based on the values obtained from the sample data, we calculate some “statistic”. We use this statistic to decide to what extent our assumption about population parameter is correct. When we draw any conclusion based on a sample, we are taking some risk in terms of decision making. One of the important questions we need to ask ourselves while taking decisions is: What should be the size of the sample?

Beta risk, Sample Size and Power

The sample size depends on

  • the α-risk we are prepared to take,
  • the power of the test which is simply (1 – β) where β is consumer's risk and power is the probability of correctly rejecting the null hypothesis, and
  • the ratio of δ/σ where δ is the smallest true difference we want to detect and σ is the standard deviation.

 

Figure 10.4 Concept of beta risk and power

%

 

(Courtesy Institute of Quality and Reliability)

 

The concept of α-risk was discussed earlier. β-risk is the probability of not rejecting H0 when it should be rejected. Refer to Figure 10.4 for the concept of beta risk and power for 1-sample Z-test. δ is the extent of shift from the hypothesized mean µ0. Observe that the larger the value of δ, the smaller will be the β-risk. If the sample size is increased, the standard error of mean will reduce and the distribution will be narrower. B-risk will, therefore, reduce when the sample size is increased.

Sample Size for Z-test

While applying the Z-test, we assume that population has normal distribution and standard deviation σ is known. Z-statistic is given by the following formula. We can calculate the required sample size if we know the values of Z, standard deviation and the difference between the sample average and population mean that we wish to detect. The value of Z is known if we know the confidence level.

 

 

Observe that consumers’ risk β does not appear in the above equation. The equation actually assumes that β = 0.5. Figure 10.5 illustrates the relation between δ/σ ratio, sample size and β-risk.

 

 

Figure 10.5 An illustration of how a larger sample size reduces β-risk and can detect smaller δ/σ ratio

Application Example of Sample Size Calculation

What is the sample size which would confirm the significance of mean shift of greater than 4 days in a cycle time if the population standard deviation is 20 days? What if the standard deviation is 16 days? (Assume α = 0.05)

δ = 4, σ = 20,

For standard deviation of 20, the sample size is:

 

 

Similarly, for standard deviation of 16 days, the sample size can be calculated as 62.

Using Software for Sample Size for Z-test

Consider the previous example for standard deviation of 20.

Minitab commands are > Stat > Power and Sample Size > 1 - Sample Z. Specify difference as 4. (See Figure 10.6)

 

 

Figure 10.6 Minitab dialogue box for power and sample size for 1-sample Z-test

 

Note: SigmaXL supports power and sample size calculations in most cases but not for Z test.

 

Power = 1 – β. For β-risk values of 0.05, 0.1, 0.2 and 0.5 power will be 0.95, 0.9, 0.8 and 0.5 respectively. Enter these values with ‘space’ between them. Minitab output is shown in Figure 10.7. Observe that sample size for β = 0.5 matches the sample size calculated using the formula. As sample size is discrete and, therefore, rounded off to the next integer, the actual power using this sample size is slightly more than requested.

 

 

Figure 10.7 Output in Minitab for power calculations

Sample Size for 1-Sample Student's T-test

For T-test, we must use the value of students T distribution function instead of Z. The formula for sample size can be worked out as:

 

 

As we do not know the sample size, the value of t-distribution is also not known. Thus, this is an iterative calculation. The power and sample size can be calculated using the software.

Application Example

A battery manufacturer wants to estimate the mean battery life. What is the sample size which would confirm the significance of mean shift of greater than 5 days of battery life? The initial mean is 30 days and the standard deviation 10 days. Assume α = 0.05 and β = 0.20

Minitab commands are: > Stat > Power and Sample Size > 1-Sample t.

In SigmaXL use power and sample size calculator. Use > SigmaXL > Statistical Tools > Power and Sample Size Calculators. Input difference as 5 and power as 0.8. Sample size given by the software is 34 with actual power of 0.8077.

Power of 2-Sample T-test

Application Example

A call center wants to compare the mean waiting time of its two offices: A and B. Office A has a mean waiting time of 6 minutes based on 20 calls while office B has a mean waiting time of 4.8 minutes based on the same number of calls. Pooled standard deviation of waiting time is 1.1 minutes. With what power would you test the hypothesis that office B is better (i.e., it has less waiting time)? Assume normal distribution and α risk of 0.05.

H0:µA = µB, H1: µA > µB, sp = 1.1, = 6, = 4.8, n1 = n2 = 20

In SigmaXL, choose Power and Sample Size Calculators like in the last example and select 2-sample t-test calculator. Input the above data but remember to choose ‘Greater Than’ for alternate hypothesis. Software output is shown in Table 10.1 for reference. Thus, the power is 0.96.

 

Table 10.1 Power of 2-sample t-test

Sample Size for Proportions

A green belt wants to improve the quality level of a process from 95 to 98 percent. What sample sizes are required to verify the improvement for β-risks of 0.2, 0.1 and 0.05? Assume α-risk of 0.05.

We will calculate sample sizes for power values of 0.8, 0.9 and 0.95.

 

Table 10.2 Power of proportion test

 

In SigmaXL, use commands > SigmaXL > Statistical Tools > Power and Sample Size Calculators > 1 Proportion Test Calculator. Input power value of 0.8, hypothesized proportion as 0.95, alternate proportion as 0.98. Choose ‘Greater Than’ in the alternate hypothesis. The sample sizes are 253, 322, and 386 for the power values of 0.8, 0.9 and 0.95 respectively.

Minitab commands are > Stat > Power and Sample > 1 proportion.

Power and Sample Size for ANOVA and DOE

In addition to hypothesis tests, Minitab, SigmaXL, and most other statistical software programs support power and sample size calculations for one-way analysis of variance (ANOVA) and 2-level factorial designs. ANOVA is discussed in Chapter 11.

Goodness-of-Fit Tests

Goodness-of-fit tests are used to assess how well the observed data fits into a particular distribution. We want to compare the observed frequency distribution with corresponding values of expected frequencies. Most of these tests use Chi-square distribution for this purpose. The underlying principle can be understood with an example.

Application Example: Poisson Distribution

Suppose we want to assess whether data on the number of calls received in a call center belongs to Poisson distribution with a mean number of 4.6 calls per minute. We take a sample of total 400 minutes’ time interval and collect data of actual calls received during this period.

Null and alternate hypotheses are given by:

H0: Number of calls per minute belongs to Poisson distribution with mean 4.6 calls per minute

H1: Number of calls per minute does not belong to Poisson distribution with mean 4.6 calls per minute

The expected frequencies can be calculated using Poisson distribution tables. Table 10.3 shows the observed and expected frequencies and calculations of χ2.

At significance level of 5% (α = 0.05), the calculated χ2 value of 9.782 is less than critical value of χ20.05,12 21.026. Therefore, null hypothesis cannot be rejected and therefore it must be concluded that Poisson distribution with mean 4.6 is a reasonable fit. Please note that degrees of freedom are 1 less than the number of intervals or classes.

 

Table 10.3 Goodness-of-fit test for Poisson distribution

 

Goodness-of-fit tests can be conducted for other distributions as well. The calculation of expected frequencies must be made using the distribution for which we are testing goodness-of- fit.

Application Example: Normal Distribution

The procedure for normal distribution is similar to the earlier example. Let us consider an example. The scores of 50 students in an examination are listed in Table 10.4. Use goodness-of-fit test to conclude whether the scores are normally distributed. Assume the confidence level as 95 percent.

From data, mean = 68.42 and standard deviation is 10.414.

The null and alternate hypotheses are:

H0: population scores are normally distributed with mean 68.42 and SD = 10.414

H1: population scores are not normally distributed with mean 68.42 and SD = 10.414

It is recommended that the overall range be divided into sufficient number of classes with equal probability. While doing this, it is also recommended that about 5 expected values exist in each class. To follow these recommendations, the data is divided into 10 classes with equal probability of 0.1. The class interval limits are calculated using Excel function NORMINV(cum probability, mean, standard deviation). Expected frequency for each class interval will be 0.1 × 50 = 5. Calculations are shown Table 10.5.

Inferential Statistics Part 2: Confidence Intervals, Power, Sample Size and Goodness of Fit

 

Table 10.4 Marks of 50 students

 

Table 10.5 Calculations of expected frequency

 

The total number of class intervals (categories) k is 10. The degrees of freedom is (k – p – 1) where k is the number of categories and p the number of parameters estimated using the same data. As the same data is used to calculate mean and standard deviation, degrees of freedom is (10 – 2 – 1) = 7.

The table value of χ20.05,7 is 14.067 which is larger than the calculated value of χ2. We, therefore, conclude that the distribution is a reasonable fit.

Other Tests and Considerations

Goodness-of-fit tests are useful to assess how well observed data fit in with a given distribution. We have discussed examples of Poisson and Normal distributions. However, the procedure can be adapted to suit other distributions as well.

The data can be grouped into intervals of equal probability or equal width. Each bin should contain at least five or more data points, so certain adjacent bins sometimes need to be joined together for this condition to be satisfied.

Many statistical tests and procedures are based on specific distributional assumptions. The assumption of normality is quite common in classical statistical tests. Very often, reliability modeling is based on the assumption that the distribution of data follows a Weibull distribution.

There are many nonparametric and robust techniques that are not based on strong distributional assumptions. By nonparametric we mean a technique, such as the sign test, that is not based on a specific distributional assumption. However, techniques based on specific distributional assumptions are in general more powerful than these nonparametric and robust techniques. Power, in this context, implies the ability to detect a difference when that difference actually exists.

If one is using a technique that makes a normality (or some other type of distributional) assumption, it is important to confirm that this assumption is in fact justified. If it is, more powerful parametric techniques can be used. If the distributional assumption is not justified, a nonparametric or robust technique may be required. We have used Chi-square tests for assessing goodness-of-fit. Other tests are also available. Two other important tests are:

  1. Kolmogorov-Smirnov Test, which is based on the empirical cumulative distribution function (ECDF). It can be used only in case of continuous distributions.
  2. The Anderson-Darling Test, which is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. This test gives more weight to the tails than the Kolmogorov-Smirnov test. The Anderson-Darling Test is frequently used for testing normality. Critical values depend on the specific distribution being tested.
Summary

Confidence intervals are based on a certain confidence level, typically 95 percent.

We usually do not know the population parameters in real life.

We, therefore, draw conclusions about populations based on samples using statistics.

Confidence intervals for population parameters such as mean µ, standard deviation s, proportion, etc. can be calculated using statistical methods.

The power of a test is the probability of rejecting the null hypothesis correctly.

For a given sample size, we should verify whether we have adequate power in the test, ANOVA or designed experiment. Similarly, for a given power, we should verify whether we have adequate sample size.

The Sample Size depends on difference δ and standard deviation σ.

Larger ratios of δ/σ require lower sample sizes and lower d/σ ratios require higher sample sizes.

A software application can perform calculation in case of Z, T, proportion, and ANOVA. It can also perform sample size calculation in some more situations.

Goodness-of-fit tests can be used to assess whether data belongs to a particular distribution such as normal, Poisson, etc.

References

Anderson, David R., Denis J. Sweeney and Thomas A. Willians (2007). Statistics for Business and Economics. New Delhi: Thomson South-Western Division of Thompson Learning Inc.

Breyfogle III, Forrest W. (1999). Implementing Six Sigma—Smarter Solutions Using Statistical Methods. New York, NY: John Wiley & Sons.

Johnson, Richard (2005). Miller and Freund's Probability and Statistics for Engineers. Upper Saddle River: NJ: Prentice Hall Inc.

Montgomery, Douglas C. (2001). Statistical Quality Control. New York, NY: John Wiley & Sons.