Chapter 12: Theory of Sampling – Biostatistics

Chapter 12

Theory of Sampling

Objectives

After completing this chapter, you can understand the following:

  • The definition, meaning and significance of sampling and its distribution.
  • The concept related to different methods of sampling with examples.
  • The concept of large and small samples.
  • The need for sampling in biological decision making situations.
  • The standard error concept and its importance.
  • The estimation of population parameters with the help of sample statistic.
12.1 INTRODUCTION

In this chapter we discuss the concepts of sampling and sampling distributions, which is the actual basis of statistical estimation and hypothesis testing. The main purpose of sampling is to allow us to make use of the information gathered from the sample to draw influences about the entire population. One can define a population as a collection of objects having a certain well-defined set of attributes. A sample is any subset of a given population. It is possible to estimate the population parameters from the limited sample parameters with the help of statistical methods and concepts. This falls under the category of statistical inference [Inductive statistics]. The inferential process is not error free. It is due to the fact that the estimation or inference is based on the limited sample data obtained from samples.

We should evaluate such errors in order to have a measure of confidence in our inferences. If we take random samples, these errors occur randomly and thus the same can be computed probabilistically.

In this chapter, we will develop the concepts of sampling to describe sampling distributions for various sample statistics such as the sample mean, proportion and introduce the well-known sampling distributions as the Chi-square, F-distribution, t-distribution and standard normal distribution. These distributions are very well fit into certain sample statistics that play a major role in estimation and hypothesis testing.

12.2 WHY SAMPLE?

In many situations, even though we are very much interested in some specific characteristic of a specific population, we cannot physically examine the entire population due to cost, time or other limitations. In such instances, examine a part of a population by means of a sample with the expectation that the sample will be the representative of the population under study.

12.3 HOW TO CHOOSE IT?

One way is to use simple random sampling, which gives all samples of the size specified an equal chance of being selected. Based on the given random sample, one can find a sample statistic such as mean or variance; the same can be used to estimate the corresponding population parameter. Every statistic is a random variable having its own probability distribution. The probability distribution referred by the sample statistic is known as sampling distribution. It has a defined property like any probability model. Based on the properties one can evaluate the chance errors involved in drawing the inference from a sample.

12.4 SAMPLE DESIGN

It is a procedure or plan for obtaining a sample from a prescribed population prior to collecting any data.

12.5 KEY WORDS AND NOTATIONS

Population: Collection of objects having certain well-defined set of attributes.

Example:

  • The population of affiliated colleges in Tamil Nadu.
  • The population of government hospitals in Tamil Nadu.

Sample: It is a portion of the population.

Example:

  • Collection of affiliated colleges in Tamil Nadu with minority status.
  • Collection of government hospitals only in Chennai.

Parameter: It refers the characteristics of the population.

Example:

  • Population mean, population SD etc.

Statistic: It refers to the characteristics of the sample.

Example:

Sample mean, sample SD etc.

Degrees of freedom: It means the number of items to be selected freely out of ‘n’ items. It is [n – 1]. It is denoted by df.

Example:

Select three integer numbers such a way that their addition leads to the value 100.

 

40 + 10 + 50 = 100

 

One can choose freely two items only, the selection of third value cannot be done freely. If you select 40 & 10; the third value should be 50.

Degrees of freedom = df = 3 – 1 = 2.

Census: It refers to the complete enumeration of the population.

Notations:

N - population size

μ - population mean

σ - population SD

p - population proportion

n - sample size

- sample mean

s - sample SD

p - sample proportion

R - population correlation coefficient

r - sample correlation coefficient

Sample survey: The process of partial enumeration is called a sample survey.

12.6 ADVANTAGES AND DISADVANTAGES OF SAMPLING

Advantages

  • Less time is needed to study the sample than the population.
  • Less cost towards the analysis in most numbers of situations, sampling gives adequate information.
  • The confidence level of data collected is more in sampling than in population.

Disadvantages

  • At times there is a possibility of the error factor.
  • High degree of expertise is required while selecting the sample.
12.7 NON RANDOM ERRORS/NON SAMPLING ERRORS

This type of error can occur in two different situations:

  1. Sample is not selected from the corresponding population.
  2. Sample is taken from pre-defined population, buy response bias that is respondents are not giving the proper information.
12.8 RANDOM ERRORS/SAMPLING ERRORS

At times a well-designed sample may not provide actual representation of the population under study; it is because a sample is a portion of a population. The inference based on this sample towards the parent population lead to incorrect inferences.

Such type of errors are referred as random error or sampling error.

12.9 TYPES OF SAMPLE

A sample can be classified in to two major categories.

  1. Probability sample and
  2. Non-probability sample.

12.9.1 Probability Sample

If the probability of selection of each member into a sample is non-zero, then the resulting sample is said to be a probability sample.

12.9.2 Non-probability Sample

If a sample is not probabilistic sample, then it is said to be non-probabilistic sample.

Normally the sampling is based on two specific principles.

Principles: 1 Law of statistical regularity

This law implies that a reasonably large number of items selected at random from the population such a way that the characteristics of the population and the sample are equal.

Principles: 2 Law of inertia of large numbers

This law reveals that wherever the sample is quite large the inference will be very close to the actual.

Different methods of sampling

12.10 RANDOM SAMPLING

According to N.M. Harper, ‘it is a sample selected in such a way that every item in the population has an equal chance of being included’. In general, it is the process of selecting sample from a population in such a way that every item of the population has an equal chance of being included in the sample.

Example:

  • Selection of any five members out of a group containing 20 members will constitute a random sample.
  • Selection of 4 aces out of a well-shuffled pack of 52 cards will constitute a random sample.

Notations:

 

Population size

N

Sample size

n [nN ]

Number of possible samples

m = NCn

Different samples

S1, S2,…, Sm

P [Selecting a sample]

1/m

 

In other words, simple random sampling refers the process which ascertains that each sample of size n [S1, S2, … , Sm] has an equal probability of being selected up of the chosen sample.

The simple random sampling method can be adopted with or without replacement of the items selected. In practice, sampling is done always without replacement. While selecting a single random sample, we must use some specific method to ensure true randomness. One such method involves the use of random numbers. Usage of random numbers ensures that every element in the population has equal and independent chance of being selected.

Example: 1

Let us consider the production record on a particular day of the employees of a Firm Bhavana Sree Ltd. along with the employee numbers.

E. No. – Employee Number; Prod. – Production

We can use the random number table for selecting a simple random sample of size 5, without replacement from the population of 50 employees.

Step 1:

Select 5 two digit random numbers using the random number table

Step 2:

Select the employees by considering the random number selected as their employee numbers.

If we proceed in the same way, we can create different samples of size 5.

Note:

Since we are sampling without replacement, we do not want to use the same random number twice.

12.10.1 Systematic Sampling

It is a procedure that starts with a random starting point in the population and then includes in the sample every be kth element encountered thereafter.

Example: 2

Population size [N]:        100 students

Sample size [n]:               10 students

Sampling ratio = n/N = 10/100 = 1/10

Form 10 different groups according to roll numbers as follows:

Select any one number in G1

[1      2      3      4      5      6      7      8      9      10]

Suppose the selected item is 8. Then in each group select the 8th item.

That is 8, 18, 28 and 98. The collection of all these elements leads to a sample of size 10. This sample is referred as systematic sample.

It is different from the simple random sampling. In this only the first element is selected randomly. There is a chance of response bias to occur. This method of selecting a sample is commonly used among the probability sampling deigns.

12.10.2 Stratified Sampling

P: Population [Size N ]

P1, P2, P3: Sub-Population [Size N1, N2, N3 and N = N1 + N2 + N3]

S1, S2, S3: Samples from each sub-population of size n1, n2 and n3, respectively.

Divide the single population into many sub-population called strata. Select a random sample from each stratum. Then the stratified sample is nothing but the grouping of different sample selected from all the strata with a one sample. This sampling technique needs prior knowledge about the population. This helps to partition the single population into different strata based on some homogeneous characteristics.

In order to set the maximum information using stratified sampling, the strata must be different from each other but homogeneous within each structure.

Example: 3

Problem: Determining the faculty preferences for a union in a college.

Population: 100

To say specifically, the preferences will be differing according to the different grades of the teachers. If we take a sample out of this population directly, we will not get any fruitful results. Instead try to split this single population of college teachers into different sub-population based on their grades and select a sample from each strata and form a one big sample by merging all the sub-samples collected from different strata. If so there is more chance for us to have fruitful results.

Population: 100

Stratified sample = [S1]U[S2]U[S3]U[S4]U[S5]U[S6]

In stratified sampling, the number of items selected from each stratum is in proportion to its size. This method ensures that the stratum in the sample is over weighted by the number of elements it contains with. It is very much used in managerial applications. The reason is that it allows to infer conclusions based on each stratum separately.

12.10.3 Multi-stage Sampling

As the name indicates the selection process of this type of sample contains different stages.

Stage 1:

Population is divided into different groups called first stage units.

Stage 2:

The first stage units are then divided into smaller groups, called second stage units.

Stage 3:

The second stage units are divided into smaller groups, called third stage units.

This staging process will go on until a sample of required number is attained.

Example: 4

Population: Group of institutions

I1
I2
I3
I4
I5
I6

I: Each institution contains different department.

D: Each department contains different courses.

First stage units: [I1, I2, ..., I6]

Second stage units: [I1[D1, D2, ...D6], ...]

Third stage units: [[I1, D1][C1, C2, C3], ...]

Select a sample using proper method out of first stage units. Then select a sample out of second stage units is selected out of the sample selected based on first stage units and the same procedure is repeated from stage to stage until we reach the required sample size. This method of selecting sample will be very much useful in the case of a very large population.

12.11 NON-RANDOM SAMPLING METHODS

To apply the probability, sampling needs a list of all sampling units. The same is not possible in all the cases. In order to overcome from this situation, we seek the help of non-random sampling technique.

12.11.1 Convenience Sampling

In this type of sampling, the selection of sample is totally left to the convenience of the researcher. The cost of selecting a convenience sample is very low in comparing with the probability sampling. On the other hand, it suffers from excessive biasness, which in turn leads to possible errors and the same cannot be quantifiable. It is very much useful in public opinion surveys, sample regarding demand analysis, shopping centre surveys etc.

Convenient sampling is separately used in exploratory studies or when representing the population is not a critical factor.

12.11.2 Purposive Sampling

If we select an element from the population based on certain characteristics, then the resulting sampling is known as purposive or judgment sample.

Population of students

Among the 100 students of a class, the sample is selected only based on the students those who are members of extracurricular group.

12.11.3 Quota Sampling

There is a defined proportion of elements to be selected from the population based on certain characteristics, is referred as quota sampling.

Example: 5

Population: 1000 customers

Top income group [TIG] 20%
Middle income group [MIG] 30%
Low income group [LIG] 50%

Out of this population select a sample of size 100, is such a way that

Sample: 100 customers

Top income group [TIG] 30%
Middle income group [MIG] 30%
Low income group [LIG] 40%

This type of sampling is often used in conducting public opinion polls such as predicting consumer preferences in market research studies and public opinions regarding political issues and candidates. There is a chance of reducing the biasness in the case. It is very easy to adopt and less cost.

12.11.4 Cluster Sampling

It requires the prior knowledge about the population. The population is to be partitioned into different groups called clusters; the formation of clusters is based on some characteristics.

Step 1:

Form the clusters.

Step 2:

Select few clusters at random.

Step 3:

Select the elements at random based on the randomly selected clusters.

The resulting sample is referred as cluster sampling.

Example: 6

Population: 1000 students

Clusters formed based on discipline.

Department of Mathematics

50

Department of computer science

100

Department of Management

500

Department of Fashion

150

Department of Bio-Tech.

50

Department of Interior Design

150

Among the clusters randomly select any two clusters.

Department of Fashion

50

Department of Computer Science

100

Select few elements randomly out of these two randomly selected clusters.

Department of Fashion

5

Computer Science

15

The above-mentioned sample is said to be a cluster sample of size 20.

12.11.5 Sequential Sampling

Samples are selected one after another based on the outcome of the previous samples.

This type of sampling method is used in the statistical quality control department very often.

12.12 SAMPLING DISTRIBUTIONS

We can define a sampling distribution as follows. The distribution of all possible values that can be assumed by some statistic evaluated from samples of same size randomly drawn from some population is called the sampling distribution of that of statistic.

Population: Ν

From the population of size N, draw the different sample of size n, [n < N] randomly. Let the sample be [s1, n], [s2, n],[sk, n].

With the sample data it is possible to evaluate the sample statistics such as sample mean, sample SD etc.

Sampling distribution based on the sample means:

Consider all the sample means 1, 2,…, k.

Construct a frequency distribution based on the means of the samples.

Means of sample Frequency
   
   
   
   
   

The resulting continuous distribution based on the means of the sample is referred as sampling distribution based on the means of the samples. For the constructed distribution, it is possible for us to evaluate the measures mean, SD etc.

The mean is said to be the mean of the sample means. The standard deviation of this sampling distribution based on mean is known as the standard error [SE] of the distribution.

In the same way, one can construct a sampling distribution based on the SD of the samples.

SDs of sample Frequency
   
   
   
   
   

Likewise for every statistic of the sample it is possible to construct different sampling distribution.

Example: 7

Population: Weekly expense of five families

Collect all possible combinations of different samples containing exactly of size 2. Also evaluate the sample means and SDs as well as the mean and SD of the population. Since Ν = 5 and n = 2, we can have 5C2 samples. Over all we can have 10 sample of size 2.

Sample no. Sample data Sample mean
01
45, 40
42.5
02
45, 47
46.0
03
45, 35
40.0
04
45, 33
39.0
05
40, 47
43.5
06
40, 35
37.5
07
40, 33
36.5
08
47, 35
41.0
09
47, 33
40.0
10
35, 33
34.0
Total
400

Construction of a sampling distribution

Mean of the population = 40

SD of the population = 5.44

Consider all the sample means and the associated sampling distribution ofis

We now evaluate Ε[] and var[]

12.13 NEED FOR SAMPLING DISTRIBUTION

We can draw the inferences about the population parameters based on the sample statistics only. In addition to the sample statistic, if we know the probability distributions with respect to the sample statistic, it is possible for us to calculate the probability when the sample statistic assumes any specific value. This characteristic is very much needed in all statistical inferences.

Note:

The variance of the sampling distribution is equal to the variance of the population divided by the size of the sample used to get the sampling distribution.

Case: 1 ; when the population size is infinite.

Case: 2 ; when the population size is finite.

Central limit theorem

P: [μ,σ, Ν] for a sufficiently large value of n [n ≥ 30], the sampling distribution of sample mean [] is approximately a normal distribution with mean μ and σ . Ρ: [μ,σ, Ν].

Note:

The same holds food for the sample proportion also.

Relationship between the sample statistics with the population parameter

  • The mean of all possible sample means will be exactly equal to the universe mean.
  • The mean of all possible sample SDs will be approximately equals to ; where n is the sample size.

Note:

While evaluating the sample variance, we use the relation.

Here we use [n – 1] in the division instead of [n].

This is due to technical reason in order to have E[s2] = σ2.

Show that the sample variance an unbiased estimator of the population variance σ2.

Case: 1

Sample from infinite population having normal distribution, we know that the expected value of the chi-square statistic

This implies that, E[s2] = σ2.

The sample variance s2 is an unbiased estimator of σ2 for infinite populations having normal distributions.

Case: 2

For samples from infinite populations

 

Taking expectation on both sides of [1], we have

it is obvious

And the sample variance is thus an unbiased estimator of σ2 for an infinite population in general.

12.14 STANDARD ERROR FOR DIFFERENT SITUATIONS

12.14.1 When the Population Size Infinite

  1. Standard Error [SE] of the specified sample mean n.
  2. Standard error [SE] of difference of two sample means .
  3. Standard error [SE] of the specified sample SD[s]
  4. Standard error of the difference of two sample SDs s1
  5. Standard error [SE] of the specified sample proportion [p]:
  6. Standard error [SE] of the difference of two sample proportions [P1 – P2].

    Standard error [SE] of the sample correlation coefficient [r]

12.14.2 When the Population Size is Finite

Sample is drawn with replacement

  1. Standard error of the specified sample mean []: refer formula [1].
  2. Standard error of the specified sample proportion [p]: refer formula [5].

Sample is drawn without replacement

  1. Standard error [SE] of the specified sample mean []:
  2. Standard error of the specified sample proportion [P]:

12.14.3 Sampling Distribution Based on Sample Means

Consider a random sample of size n out of a population with actual mean is and variance σ2, then we know that the sample observation are independent and identically distributed random variables. Then the sample mean,

Clearly is also a random variable with an expected value.

Variance of can be given as

Note: 1

It indicates that the expected value of the sample mean and the actual population mean are one and the same.

Note: 2

This shows that the variability in sample means is lesser then the population variance, .

Whenever the sample size is large, the fluctuation will be less from one sample to the other.

Population parameters are estimated from sample data because it is not possible to examine the entire populations practically in order to make a perfect evaluation.

Statistical estimation procedures provide the process by which estimates of the population parameters can be evaluated with the degree of confidence needed. This degree of confidence is controllable with respect to the size of the sample and by the type of estimate made.

12.15 POINT AND INTERNAL ESTIMATION
Type of organization Estimation of interest

Manufacturing industry

Quality of raw materials used for production

Bank

Mean number of arrivals of the customer at the teller’s window

The estimate can be of two types, they are

  1. Point estimates and
  2. Interval estimates.

12.15.1 Point Estimate

It refers a specific value which is used to estimate the value of the unknown population parameter.

Example:

  • The mean salary of a sample of top-level executives in many firms may be used as a point estimate of the corresponding population mean for top-level executives in all firms.
  • The percentage of employed women who prefer Cinthol brand soap over all other brands may be used as an estimate of the corresponding population percentage of all employed women.

Similarly, the use of sample mean to estimate the population mean, the use of sample SD to estimate the population SD and etc., in each case we use point estimate of the parameter.

Estimate and estimator

An estimator is random variable, and its numerical value is an estimate.

Population parameter

Estimator [sample statistic]

Estimate [value of estimator]

Mean – μ

= 100

Variance – σ2

s2

s2 = 50

12.15.2 Properties of Good Point Estimators

The criteria for good point estimators are

  1. Unbiasedness
  2. Relative efficiency
  3. Consistency and
  4. Sufficiency

Unbiasedness

An estimator is unbiased, if its expected value is equal to the population parameter being estimated.

Relative efficiency

It refers the sampling variability of an estimator.

If two estimators of a given population parameter are both unbiased, the one with the smaller variance for a given sample size is defined as being relatively more efficient. If e1 and e2 are two unbiased estimators of the parameter e, then the relative efficiency of e1, with respect to e2 is defined as [assume that Var[e1] < Var[e2]].

Consistency

An estimator is said to be consistent, if the probability of the parameter being estimated approaches 1 as n approaches infinity.

e1 – Sample estimator

e – Population estimator

Sufficiency

An estimator e1 is said to be a sufficient estimator, if it uses all the information contained in the sample, to estimate the population parameter.

12.16 INTERVAL ESTIMATE

An interval estimate of a population parameter is the specification of two values between which we have a certain degree of confidence then actual population parameter lies. It can be otherwise called confidence internal estimation. To evaluate the same, we required the value for the confidential level or the level of significance.

Population parameter: μ

Sample parameter: , s, n

Level of significance: 5%

Test statistic: Ζ

Table value of the test statistic: Zt

 

Z0.05 = 1.96 [2-tailed test]

 

Then the interval estimation of the population parameter μ can be defined as where ; if σ is known if not .

Then ; [since σ is not known]

There is a 95% confidential level for the population parameter μ to lie in the interval

This clearly indicates that there is a 5% chance for the population mean μ not to lie in the defined internal estimate.

12.17 CONFIDENCE INTERVAL ESTIMATION FOR LARGE SAMPLES

For business application it is not sufficient merely to consider the single point estimate of the population parameter. Instead we require an estimation procedure that permits some error in the estimate with the given level of accuracy. In classical inference such a method incorporates the use of what is known as confidence interval estimation? We can discuss the same with respect to the population mean as the parameter of interest.

Consider the sampling distribution of [mean] of the random samples of size n. From a normal population with mean μ and known variance σ2, that is, Ν [μ, σ2] the same can be defined in the standard form as, transferred with respect to the Z-statistic.

If we permit the error percentage as a, we say the level of significance is α.

We can assert with the probability [1 – α ] that normal random variable will lie in between –Zα and +Zα.

The same can be written symbolically,

Equation [1] reveals that μ is contained in the interval between and its probability equal to [1 – α]. The interval is referred as the confidential interval for μ, and [1 – α] is called the degree of confidence since μ is contained in the given interval with probability value [1 – α].

Hence, the probability of the value of μ to lie in the interval is [1 – α].

Note:

If the sample size is large enough say n ≥ 30, then the sample is said to be a large sample. If not it is referred as a small sample [n < 30].

Example: 8

As a part of the National Health and Nutrition Examination Survey [NHANES], haemoglobin levels were checked for a sample of 1139 men age 70 and over. The sample mean was 145.3/Li and the standard deviation was 12.87 g/Li. Use these data to construct a 95% confidence interval for μ.

Step 1:

Given α = 0.05                       [since 1 – 0.95 = 0.05]

s = 12.87/Li;                                 n = 1139; = 145.3/Li

Since, n = 1139 > 30; it refers a large sample.

According to the standard normal table when α = 0.05, the value of Zα = Z0.05 = 1.96.

Step 2:

The interval estimation can be given as ± Zt * SE[].

Step 3:

Step 4:

Use the value for , Zα and SE[], we have

The required confidence interval of estimation with 95% confidence level for the average haemoglobin level is

 

μ: [136.86161, 153.73839].

Note:

There is a very close association between the length of interval where in which μ lies and the level of significance α. Whenever α decreases, the length of the interval where in which μ lies is also increases.

If we want to increase the chance of the value of μ to lie in the estimated interval try to choose α minimum.

Suppose for the above problem, if we assure the value of α = 0.

We have Zα = Z0 = 3.

Hence the interval estimation becomes,

Since α = 0; There is a 99.73% assured chance for the population mean μ to lie in the interval [132.3841, 158.2159].

Note: 1

It is obvious that in the above problem the interval estimation when α = 0.05 lies well within the interval estimation when α = 0.

 

That is [132.3841, [136.86161, 153.73839] ,158.2159].

Note: 2

When σ is not known, we can make use of the sample SD[s]. Then the interval estimation formula reduces to

Confidence limits for μ, [μ1 – μ2], Ρ and [Ρ1 – Ρ2] for large random sample

SE, Standard Error; CL, Confidence Limits; α = 10%; Z0.1 = 1.645.

Example: 9

Researchers measured the bone mineral density of the soibes of 94 women who had taken the drug CEE. The mean was 1.016 g/cm2 and the standard deviation was .155 g/cm2. A 95% confidence interval for the mean is [.948, 1.048]. True or false.

Step 1:

Given α = 0.05

s = 0.155;      n = 94; = 1.016

Since, n = 94 > 30; it refers a large sample.

According to the standard normal table when α = 0.05, the value of Zα = Z0.05 = 1.96.

Step 2:

The interval estimation can be given as

Step 3:

Step 4:

Use the value for , Zα & SE[], we have

 

μ: 1.016 ± 1.96 × 0.01599

Step 5:

The required confidence interval of estimation with 95% confidence level is μ: [0.9847, 1.0473]

The given interval is exactly co-inside with the evaluated one. There is a 95% for the population to lie in the interval [0.9847, 1.0473].

12.18 CONFIDENCE INTERVALS FOR DIFFERENCE BETWEEN MEANS

Example: 10

The following table summarizes the sucrose consumption [mg in 30 minutes] of black blowflies injected with Pargyline or saline [control].

  Saline Pargyline
n
900
905
14.9
46.5
S
5.4
11.7

Construct [a] 95% confidence interval; [b] a 90% confidence interval for the difference in population means.

Step 1:

Given α = 0.05,

Since, both the samples are large, the table value of Z0.05 = 1.96

 

Sample-1
Sample-2
Blowflies injected with saline
Blowflies injected with Pargyline
n1 = 900
n2 = 905
= 14.9
= 46.5
s1 = 5.4
s2 = 11.7

 

Population – 1
Population – 2
Mean = μ1
Mean = μ2

 

Step 2:

The interval estimation can be given as

Step 3:

Use the values of 1, 2, Zα and SE, we have

Step 4:

Thus, 30.756 and 32.44 are the lower and upper bounds, respectively, of the 95% confidence interval for .

12.19 ESTIMATING A POPULATION PROPORTION

Example: 11

In a sample of 400 population from a village, 230 are found to be eaters of vegetarian items and the rest non-vegetarian items. Estimate the population proportion based on 5% level of significance?

Step 1:

Given α = 0.05

Since the sample is large, the table value of Z0.05 = 1.96

Sample proportion ; q = 1 – p = 0.425; n = 400

Step 2:

The interval estimation for the population proportion can be given as

 

p ± Zα* SE[p]

Step 3:

Step 4:

Use the values of p, Zα and SE[p], we have

Step 5:

There is a 95% chance for the population proportion to lie in the interval [0.527, 0.623].

Example: 12

A cultivator in bananas claims that in a random sample of 700 bananas contained 45 defective bananas. Estimate the population proportion based on 1% level of significance?

Step 1:

Given α = 0.01

Since the sample is large, the table value of Z0.01 = 2.58.

Step 2:

The interval estimation for the population proportion can be given as

 

p ± Zα* SE[p]

Step 3:

Step 4:

Use the values of p, Zα and SE[p], we have

Step 5:

There is a 95% chance for the population proportion to lie in the interval [0.0475, 0.0811].

Finite population

Example: 13

The central government is interested in evaluating the number of fortune 500 manufacturing firms that plan to ‘fight inflation’ by following certain voluntary wage – price guidelines. A sample of 100 of the firms is taken, and 20 said they do not follow any of these guidelines.

Determine 90% confidence interval for the percentage of fortune 500 firms that do not follow the guide lines.

Step 1:

Given α = 0.1

Since the sample is large and finite, the table value of Z 0.1 = 1.645

Sample proportion = 0.2; q = 1 – p = 0.8; n = 100; N = 500

Step 2:

The interval estimation for the population proportion can be given as p ± Zα* SE[p]

Step 3:

Step 4:

Use the values of p, Zα and SE[p], we have

Step 5:

Thus, 14.11% and 25.89% are the lower and upper bounds, respectively, of the confidence interval.

Example: 14

A random sample of size 10 is drawn without replacement from a finite population of 30 units. If the number of defective units in the population be 6, find the SE[p].

Step 1:

Given: n = 10

      Ν = 30 [finite population]

      Ρ = 6/30 = 1/5 = 0.2

      Q = 1 – P = 0.8

Step 2:

Step 3:

The value of SE[p] is 0.105.

12.20 ESTIMATING THE INTERVAL BASED ON DIFFERENCE BETWEEN TWO PROPORTIONS

Example: 15

A sample survey of citizens in a Village-A gives that out of 1000 members interviewed, 420 members were found to be vegetarians. In another survey, conducted Village-B, 370 out of 1000 members were vegetarians. Construct a 99% confidence interval for the true difference in the proportion of favourable responses in the two villages.

Step 1:

Given,

Sample-1

Sample-2

Step 2:

 

[p1p2] ± Zα * SE[p1p2]

Step 3:

Step 4:

Use the value of p1, p2, Zα and SE[p1 – p2]0, we have

Since the value of probability value is > = 0; we discard the negative value.

Hence; [p1p2]: [0, 0.1062].

Step 5:

Thus, 0 and 0.1062 are the lower and upper bounds, respectively, of the 99% confidence interval for [p1 – p2].

12.21 CONFIDENCE INTERVAL ESTIMATION FOR SMALL SAMPLE

Example: 16

To study the conversion of nitrite to nitrate in the blood, researchers injected four rabbits with a solution of radioactively labeled nitrite molecules. Ten minutes after injection, they measured for each rabbit the percentage of the nitrite that had been converted to nitrate. The results were as follows.

  1. For these data, calculate the mean, the standard deviation and the standard error of the mean.
  2. Construct a 95% confidence interval for the population mean percentage.

Step 1:

Based on the given data evaluate the sample mean and the SD.

[Refer the sections Sec. 4.3; Sec. 5.6]

Mean = = 51

SD = s = 3.1948

n = 4

∵ n = 4(< 30);it is a small sample. α = 0.05, df = ν = n – 1 = 4 – 1 = 3. The table value of tt[0.05,3 df] = 3.1825.

Note:

Since the table value of t is given based on one-tail test, while taking the table value based on two-tail test, consider the value of α as [α /2]. Here α = 0.05, but consider α = 0.025.

Step 2:

The interval estimation can be given as,

 

μ ± tα[ν]*SE[].

Step 3:

Find SE[]

Step 4:

Use the values of ,tα[v], and SE[], we have

Step 5:

The required confidence internal of estimation with 95% confidence level is μ : [ 45.123, 56.87]

Example: 17

A sample of 20 fruit fly [Drosophila melanogaster] larva was incubated at 37°C for 30 minutes. It is theorized that such exposure to heat causes polytene chromosomes located in the salivary glands of the fly to unwind, creating puffs on the chromosomes arm that are visible under a microscope. The following normal probability plot supports the use of a normal curve to model the distribution of puffs. The average number of puffs for the 20 observations was 4.30, with a standard deviation of 2.03; construct a 95% confidence interval for μ.

Step 1:

Given the data

Sample

Mean = = The average number of puffs = 4.3

Since n < 30; implies it refers a small sample. α = 0.05, df = 20 – 1 = 19. The table value of tt [0.05,19 df] = 2.093.

Step 2:

The interval estimation can be given as

 

± tα [v]*SE[]

Step 3:

Find SE[]

Step 4:

Use the value of , tα[v], and SE[], we have

Step 5:

The required confidence interval of estimation with 95% confidence level is μ : [3.3253,5.2747].

Example: 18

Experimenters test two types of fertilizer for possible use in the cultivation of cabbages. They grow cabbages in two different fields. One of the two fertilizers is applied in each field. At harvest time, they select a random sample of 25 cabbages from the crop grown with fertilizer-1 and randomly selected 12 cabbages from the crop grown with fertilizer-2. The sample mean and variance of weights of cabbages grown with fertilizer-1 are 44.1 g and 36 g. The mean weight computed from the second sample is 31.07 g and the variance is 44 g. The experiments assume that the two population weights are normally distributed. They also assume that the two population variances are equal. Compute 95% confidence interval for [μ1 – μ2].

Step 1:

Given,

 

Sample-1
Sample-2
=44.1
= 31.7
s12 = 36
s2 2 = 44
n1 = 25
n2 = 12

 

Sample-1 and Smaple-2 are small samples.

Step 2:

The interval estimation can be given as

Step 3:

Use the values of 1, 2, tα and SE, we have

Hence, the required confidence interval of estimation with 95% confidence level based on difference of two means can be given as [7.8459, 16.9541].

Example: 19

Ferulic acid is a compound that may play a role in disease resistance in corn. A botanist measured the concentration of soluble Ferulic acid in corn seedlings grown in the dark or in a light/dark photoperiod. The results [nmol acid per g tissue] were as shown in the table.

  Dark Photoperiod
n
4
4
92
115
S
13
13

Construct a 90% confidence interval for the difference in Ferulic acid concentration under the two lighting conditions. [Assume that the two populations from which the data came are normally distributed.]

Step 1:

Given,

 

Sample-1
Sample-2
=92
= 115
s1 = 13
s2 = 13
n1 = 4
n2 = 4

 

Sample-1 and Smaple-2 are small samples.

Step 2:

The interval estimation can be given as

Step 3:

Use the values of 1, 2, tα and SE, we have

Hence, the required confidence interval of estimation with 95% confidence level based on difference of two means can be given as [2.376, 43.624].

Example: 20

A simple random sample of 10 electronics firms is asked in a questionnaire to state the amount of money spent on employee training programme during the year just ended and during a year a decade ago.

Construct a 95% confidence interval for the mean difference in expenditures for employee training programme by the 10 firms.

Step 1:

Based on the given data, find the mean difference d = x – y; then find mean and SD based on the values of d.

Note:

We can chose either [xy] or [y – x] as d; provided the sum of d should be positive.

Step 2:

The interval estimation can be given as

Find

Step 3:

Use the values of [], tα, and SE[], we have

Step 4:

The required confidence interval of estimation with 95% confidence interval with 9 df is μd: [–0.0861, 3.0861]

12.22 DETERMINING THE SAMPLE SIZE

Deciding the proper sample size is an integral part of any sampling study where inferences need to be made.

Error

It is defined as the absolute difference between the parameter being estimated and the point estimate obtained from sample.

Evaluation of sample size for a mean

Known elements: σ2,

To be estimated: μ ~ N [μ, σ2]

The error can be defined as,

By definition

Equations [1] & [2] implies that,

Squaring on both sides of [3], we have

Thus, [4] gives the sample size required to attain the tolerable error with the required degree of confidence.

Note 1:

When σ2 is not known, we can make use of the sample variance s2 and the sample size n is defined as

The value it can be referred from the t-table minimum level of significance α and [n – 1] degrees of freedom.

Note 2:

The sample size for a proportion can be defined as

when Ρ is not known can be assumed as Ρ = 0.5.

Note 3:

For a two sample case, [n1 = n2 = n] the size of the sample can be defined as

where d is equal to one half the width of the desired confidence interval and assume that n1 = n2 = n.

Note 4:

For a two sample proportions can be defined as

where d is equal to one half the width of the desired confidence interval and assume that nl = n2 = n.

Example: 21

Evaluate the sample size n to find 90% confidence interval for the purchase price of TVS in various retail stores in a given area such that the sample mean will differ by no more than 25. Assume that σ is known and equal to 35/-.

Step 1:

Given:

Step 2:

The sample size should be minimum 6 in order to attain the error factor 25 with the required 90% confidence level.

Example: 22

A researcher wishes to know whether the mean length of employment with the current firm at time of retirement is different for men and women. The researcher would like to have a confidence interval estimate of the difference between the population means. The specifications are a confidence interval width or 1 year and 95% confidence. Pilot samples yielded variances of 5 and 7. The researcher wants sample of equal size. What size sample should be drawn from each population?

Step 1:

Given α = 5% = 0.05

Step 2:

Step 3:

We needed a sample of at least 185 men and an independent sample of at least 185 women is needed.

Example: 23

A cigarette manufacturer wished to conduct a survey using a random sample to estimate the proportion of smokers who would switch to the company’s newly developed low-bar brand. The sampling error should not be more than 0.02 above or below the actual proportion, with a 99% degree of confidence.

Step 1:

Given α = 0.01

Step 2:

Hence, the minimum sample size should be at least 4161 members in order to attain the error 0.02 with the required 99% confidence level.

Example: 24

The weight of cement bags follows a normal distribution with SD 0.2 kg. Find how large the value of n should be taken so that error can be plus or minus 0.05 of the actual value with a confidence level of 90%.

Step 1:

Error = 0.05

Step 2:

Then the value of n can be given as

Step 3:

The sample size should be at least 44, so that the mean weight of cement bags can be estimated within ± 0.05 kg of the actual value with a 90% confidence level.

Example: 25

For two populations of consumers, a researcher wants to estimate the difference between the proportions, who have used a particular brand of coffee. A confidence co-efficient of 0.95 and an interval width of 0.10 are desired. Estimates of p1 & p2 are 0.20 and 0.25, respectively. How large should the sample size be [n1 = n2]?

Step 1:

Given that

Step 2:

The researcher should draw a sample size of at least 534 from each population.

Example: 26

A medical researcher proposes to estimate the mean serum cholesterol level of a certain population of middle-aged men, based on a random sample of the population. He asks a statistician for advice. The ensuing discussion reveals that the researcher wants to estimate the population mean to within ± 6 mg/dL or less, with 95% confidence. Thus, the standard error of the mean should be 3 mg/dL or less. Also, the researcher believes that the standard deviation of serum cholestrolin the population is probably about 40 mg/dL. How large a sample does the researcher need to take?

Step 1:

Given that α = 0.05; σ = 40; SE = 3

 

Zα = Z0.05 = 1.96

Step 2:

We know that

That is

 

n = 177.78 that is, n = 178 app.

The researcher should take a sample size of 178.

EXERCISES
  1. A zoologist measured tail length in 86 individuals, all in the one-year age group, of the Deermouse peromyscus. The mean length was 60.43 mm and the standard deviation was 3.06 mm. Can be 95% confidence interval for the mean is [59.77, 61.09].
  2. There is an old folk belief that the sex of a baby can be guessed before birth on the basis of its heart rate. In an investigation to test this theory, foetal heart rates were observed for mothers admitted to a maternity ward. The results [in beats per minute] are summarized in the table.

    Construct a 95% confidence interval for the difference in population means.

  3. As part of a large study of serum chemistry in healthy people, the following data were obtained for the serum concentration of uric acid in men and women aged 18–55 years.
     Serum Uric Acid [mmol/I]
      Men Women
    n
    530
    420
    .354
    .263
    S
    .058
    .051

    Construct a 95% confidence interval for the difference in population means.

  4. An agronomist measured the heights of n corn plants. The mean height was 220 cm and the standard deviation was 15 cm. Calculate the standard error of the mean if
    1. n = 25
    2. n = 100
  5. As part of study of the treatment of anemia in cattle, researchers measured concentration of selenium in the blood of 36 cows who had been given a dietary supplement of selenium [2 mg/day] for one year. The cows were all the same breed [Santa gertrudis] and had borne their first calf during the year. The mean selenium concentration was 6.21 μg/dL and the standard deviation was 1.84 μg/dL. Construct a 95% confidence interval for the population mean.
  6. In a study of larval development in the tufted apple budmoth [Playnota idaeusalis] an entomologist measured the head widths of 50 larvae. All 50 larvae had been reared under identical conditions and had moulted six times. The mean head width was 1.20 mm and the standard deviation was 14 mm. Construct a 90% confidence interval for the population mean.
  7. A group of 101 patients with end-stage renal disease were given the drug epoetin. The mean hemoglobin level of the patients was 10.3 [g/dL], with an SD of 0.9. Construct a 95% confidence interval for the population mean.
  8. A pharmacologist measured the concentration of dopamine in the brains of several rats. The mean concentration was 1,269 ng/g and the standard deviation was 145 ng/g. What was the standard error of the mean if
    1. 8 rats were measured?
    2. 30 rats were measured?
  9. The diameter of the stem of a wheat plant is an important trait because of its relationship to breakage of the stem, which interferes with harvesting the crop. An agronomist measured stem diameter in eight plants of the Tetrastichon cultivar of soft red winter wheat. All observations were made three weeks after flowering of the plant. The stem diameters [mm] were as follows:

    The mean of these data is 22.75 and the standard deviation is .238.

    1. Calculate the standard error of the mean.
    2. Construct a 95% confidence interval for the population mean percentage.
  10. For the 28 lamb birth weights, the mean is 5.1679 kg, the SD is .6544 kg and the SE is .1237 kg. Construct [a] a 95% confidence interval for the population mean [b] a 99% confidence interval for the population mean.
  11. Ferulic acid is a compound that may play a role in disease resistance in corn. A botanist measured the concentration of soluble ferulic acid in corn seedlings grown in the dark or in a light/dark photoperiod. The results [nmol acid per g tissue] were as shown in the table.
      Dark Photoperiod
    n
    4
    4
    92
    115
    S
    13
    13

    Construct the 95% confidence interval for the difference in Ferulic acid concentration under the two lighting conditions.

  12. Prothrombin time is a measure of the clotting ability of blood. For 10 rats treated with an antibiotic and 10 control rats, the prothrombin times [in seconds] were reported as follows:
      Antibiotic Control
    n
    10
    10
    25
    23
    S
    10
    8

    Construct a 90% confidence interval for the difference in population means [Assume that the two populations from which the data came are normally distributed].

  13. A dendritic tree is a branched structure that emanates from the body of a nerve cell. In a study of brain development, researchers examined brain tissue from seven adult guinea pigs. The investigators randomly selected nerve cells from a certain region of the brain and counted the number of dendritic branch segments emanating from each selected cell. A total of 36 cells were selected, and the resulting counts were as follows:

    Construct a 95% confidence interval for the population mean.

  14. In evaluating a forage crop, it is important to measure the concentration of various constituents in the plant tissue. In a study of the reliability of such measurements, a batch of alfalfa was dried, ground and passed through a fine screen. Five small [.3 g] aliquots of the alfalfa were then analyzed for their content of insoluble ash. The results [g/kg] were as follows:

    For these data, calculate the mean, the standard deviation and the standard error of the mean.

  15. Six healthy three-year-old female Suffolk sheep were injected with the antibiotic Gentamicin, at a dosage of 10 mg/kg body weight. Their blood serum concentrations [μg/mL] of Gentamycin 1.5 hours after injection were as follows.

    For these data, the mean is 28.7 and the standard deviation is 4.6;construct a 95% confidence interval for the population mean.

  16. Human beta-endrophin [HBE] is a hormone secreted by the pituitary gland under conditions of stress. A researcher conducted a study to investigate whether a program of regular exercise might affect the resting [unstressed] concentration of HBE in the blood. He measured blood HBE levels, in January and again in May, on ten participants in a physical fitness program. The results were as shown in the table. HBE Level [pg/mL].

    Construct a 95% confidence interval for the population mean difference in HBE levels between January and May.

  17. If N = 2696, n = 100 and the number of defectives in a sample is 5. Evaluate the 99% confidence interval for the proportion of defective articles in the whole batch.
  18. Doctors who have developed a new drug for the treatment of a certain disease treat a group of 400 patients suffering from the disease with the new drug. They treat another group of 400 patients with an alternative drug. At the end of two weeks, 320 of the patients receiving the new drug recover, whereas 240 of those taking the alternative drug recover. Construct the 95% confidence interval for the difference between the true proportions of patients who might be expected to responds to the two drugs.
  19. What are type I and type II errors in testing of hypothesis?
  20. Explain the following:
    1. Simple random sampling
    2. Stratified random sampling
    3. Systematic sampling
  21. Sampling is a necessity under certain conditions – illustrate by a suitable example.
  22. What are the types of hypothesis? Compare and contrast them.
  23. Explain in detail the steps involved in the testing of hypothesis.
  24. Distinguish between complete enumeration and sample survey.
  25. How far is the later more advantageous than the former and why?
  26. Briefly explain the principal steps involved in sample survey.
  27. Explain the concepts of sampling distribution and standard error.
  28. Discuss the role of standard errors in large sample survey.
  29. Explain briefly the reasons for the increasing popularity of sampling methods. Explain briefly any two methods of sampling which help us to obtain a representative sample.
  30. What do you mean by sampling? What are the types of sampling?
  31. A researcher is planning to compare the effects of two different types of lights on the growth of bean plants. She expects that the means of the two groups will differ by about 1 inch and that in each group the standard deviation of plant growth will be around 1.5 inches. Consider the guideline that the anticipated SE for each experimental group should no more than be one-fourth of the anticipated difference between the two group means. How large should the sample be [for each group] in order to meet this guidelines?
  32. Data from two samples gave the following results:
      Sample 1 Sample 2
    n
    6
    12
    40
    50
    S
    4.3
    5.7

    Compute the standard error of and the range for the population mean with 5% level of significance.

  33. Compute the standard error of for the following data.
      Sample 1 Sample 2
    n
    10
    10
    125
    217
    S
    44.2
    28.7
  34. Compute the standard error of and the range for the population mean with 5% level of significance.
      Sample 1 Sample 2
    n
    5
    7
    44
    47
    S
    6.5
    8.4
  35. Suppose the sample sizes were doubled, but the means and SDs stayed the same, as follows. Compute the standard error of and the range for the population mean with 5% level of significance.
      Sample 1 Sample 2
    n
    10
    14
    44
    47
    S
    6.5
    8.4
ANSWER THE QUESTIONS
  1. Write short notes on sampling.
  2. The probability distribution referred by the sample statistic is known as_______________.
  3. Procedure for obtaining a sample from a prescribed population prior to collecting any data is referred as_______________.
  4. Parameter refers_______________the of the population.
  5. Parameter is otherwise known as_______________
  6. State any two advantages of sampling.
  7. State any two disadvantages of sampling.
  8. Define the term non-sampling errors.
  9. A sample can be classified in to_______________major types.
    1. 2
    2. 3
    3. 4
    4. None
  10. State any two random sampling methods.
  11. State any two non-random sampling methods.
  12. Define the term sampling distribution.
  13. State the relationships between the sample statistics and the population parameter.
  14. High light the term ‘standard error’.
  15. The population is said to be finite, if it is_______________.
    1. countable
    2. uncountable
    3. None
  16. What do you mean by confidence interval?
  17. What do you mean by level of significance?
  18. Define the term table value for the test statistic.
  19. ‘When the sample statistics are know it is possible for us to evaluate the range for the population mean’ – Comment on this_______________.
  20. Deciding the proper_______________is an integral part of any sampling study_______________.
ANSWERS
  1. A sample is any subset of a given population. It is possible to estimate the population parameters from the limited sample parameters with the help of statistical methods and concepts. This falls under the category of statistical inference [Inductive statistics]. The inferential process is not error free. It is due to the fact that the estimation or inference is based on the limited sample data obtained from samples. The main purpose of sampling is to allow us to make use of the information gathered from the sample to draw influences about the entire population.
  2. Sampling distribution
  3. Sample design
  4. Characteristics
  5. Statistic
  6. Refer Section 12.6
  7. Refer Section 12.6
  8. Refer Section 12.7
  9. (a)
  10. Refer Section 12.9
  11. Refer Section 12.9.2
  12. Refer Section 12.12
  13. Refer Section 12.13
  14. The standard deviation of a sampling distribution is referred as standard error
  15. (a)
  16. Refer Section 12.17
  17. The permitted error % is known as level of significance
  18. The statistical table value for the statistical distribution referred based on the α level
  19. True
  20. Sample size