10. Measurement Design – Management Research Methodology: Integration of Principles, Methods and Techniques


Measurement Design


Upon completion of this chapter, you will be able to:

  • Define measurement and scaling
  • Study the different types of scales
  • Learn about the errors of measurement
  • Understand how to validate measurements
  • Understand the concept of reliability of measurement
  • Learn different ways of scaling research questions
  • Become Familiar with unidimensional scale construction
  • Appreciate multidimensional scaling


In any research, hypotheses and theories are tested using empirical data already available or specially collected. In situations where data are collected specifically, the researcher exerts good control over the process of data collection to ensure good quality data. A research is as good as the data that is used in it. Data, when used for some quantitative analysis, as in hypothesis testing, is as good as the measurement done on it. It would not be an exaggeration to say, therefore, that a research is as good as the measurements made in it. Measurement is, thus, the most vital part of any research study.

Measuring physical entities used in physical sciences is comparatively easier and less prone to errors and approximations than the conceptual entities so much used in management theories. Measurement in respect of length, weight, inventory, number of rejections, and so on are easier than measuring concepts like attitude, morale, job satisfaction, perceived product quality and so forth. The second aspect related to measurement in management research is that many of the above mentioned constructs and concepts are multidimensional in nature, whereas physical entities (products, brands, and consumers) are unidimensional. Thus, in management research measurement becomes more involved and complex. Measurement is inalienably bound to scaling, which can be thought of as the continuum on which measurements are made and measured entity is located. There are three major ways of obtaining measured data. They are: (i) administering a standard instrument already developed, tested, and validated by others; (ii) administering an instrument that is specially developed by the researcher, (to be tested and validated); and (iii) record already measured data (such as inventory balances, absenteeism, numbers sold, and so on). Development of measurements and scales require scientific skills, considerable time, and effort. In this chapter, we will discuss some of the more important requirements. Definitions, the type of data required, scale construction, measurement errors, and validity and reliability of measurement will be discussed, in that order.

measured data
Measured data for a management research are obtained in three ways: administrating a standard instrument already developed, administering a specially designed instrument and extracting already measured data from records.

Primary types of measurement scales

Measurement has been defined as “the matching of an aspect of one domain to an aspect of another” (Stevens, 1968). Kerlinger (1973) has defined measurement as “the assignment of numbers to objects to represent amounts or degrees of a property possessed by all of the objects”. In a simple way, measurement can be defined as the assignment of numbers (symbols) indicative of quantity to properties, characteristics, or behaviour of persons, objects, events, or states. The numbers (symbols) have the same relevant relationship to each other as do the things represented. Three important characteristics of numbers (symbols) are: (i) order—numbers are ordered, (ii) distance—difference between numbers are ordered; and (iii) origin—the series has a unique origin, which is indicated by number zero. The result of this is a scale of measurement.

In social sciences the term ‘scaling’ is applied to procedures attempting to determine quantitative measures of subjective abstract concepts. Scaling is defined as a “procedure for the assignment of numbers (or other symbols) to a property of objects in order to impart some of the characteristics of numbers to the properties in question” (Edwards, 1957).

Scaling is a procedure for attempting to determine quantitative measures of subjective abstract concepts.

Nominal Scales

The lowest level of measurement is classification measurement, which consists simply of classifying objects, events, and individuals into categories. Each category is given a name or assigned a number; the numbers are used only as labels or type numbers without any relation like, order, distance, or origin between the numbered categories. This classification scheme is referred to as a nominal scale. Nominal scales are least restrictive and are widely used in social sciences and business research. Examples are telephone numbers or departmental accounting codes. There is one to one relation to each number and what it represents. Statistical calculations are not meaningful. Many researchers feel that this is not a scale at all (nominal means “in name only”).

nominal scale
This is a measurement procedure to classify objects, events and individuals into categories.



(i) Where do you live ______ ______       ______
     City   Town           Village
(ii) Do you own a car Yes/No  

Ordinal Scales

These scales are used for measuring characteristics of data having transitivity property (that is, if x > y and y > z, then x > z). They include the characteristics of the nominal scale plus an indicator of order. The task of ordering, or ranking, results in an ordinal scale, which defines the relative position of objects or individuals according to some single attribute or property. There is no determination of distance between positions on the scale. Therefore, the investigator is limited to determination of ‘greater than’, ‘equal to’, or ‘less than’ without being able to explain how much greater or less (the difference). Some of the examples of ordinal scales are costs of brands of a product and ordering of objectives according to their importance. Statistical positional measures such as median and quartile and ranking indexes can be obtained.

ordinal scale
This scale is used to measure data having transitivity property. It includes the characteristic of nominal scale in addition to indicating order.


  • Please rank the following objectives of the manufacturing department of your organisation according to their importance.
    Objectives Rank
    Quality ______
    Cost ______
    Flexibility ______
    Dependability ______
  • Please indicate your preference in the following pairs of objectives of R&D management.
    1 2 (Sample Answer)
    New product New process 1
    New product Quality 1
    New product Cost 1
    New process Quality 2
    New process Cost 1
    Quality Cost 1


    Derived Ranks
    Objective No. of times ranked first Derived rank
    New Product 3 1
    Quality 2 2
    New process 1 3
    Cost 0 4

Interval Scales

The interval scale has all the characteristics of the nominal and ordinal scales and, in addition, the units of measure (or intervals between successive positions) are equal. This type of scale is of a form that is truly ‘quantitative’, in the ordinary and usual meaning of the word. Almost all the usual statistical measures are applicable to interval measurement unless a measure implies knowing what the true zero point is. A simple example is a scale of temperature. Interval scales can be changed from one to another by linear transformation (for example, centigrade to fahrenheit degrees in temperature measurement).

interval scales
Interval scale possesses all the characteristic of nominal and ordinal scales. In addition, the units of measure (intervals between successive positions) are equal.

Example The link between the R&D and marketing departments in your organisation is:

Ratio Scales

In essence, a ratio scale is an interval scale with a natural origin (that is, ‘true’ zero point). Thus, the ratio scale is the only type possessing all characteristics of the number system. Such a scale is possible only when empirical operations exist for determining all four relations: equality, rank-order, equality of intervals, and equality of ratios. Once a ratio scale has been established, its values can be transformed only by multiplying each value by a constant. Ratio scales are found more commonly in the physical sciences than in the social sciences. Measures of weight, length, time intervals, area, velocity, and so on, all conform to ratio scales. In the social sciences, we do find properties of concern that can be ratio scaled: money, age, years of education, and so forth. However, successful ratio scaling of behavioural attributes are rare. All types of statistical analyses can be used with ratio scaled variables.

ratio scale
A ratio scale is an interval scale with a natural origin (a true zero point) possessing all the characteristic of the number system.

Example What percentage of R&D expenditures is directed to new product development?

Errors in measurement

Precise and unambiguous measurement of variables is the ideal condition for research. In practice, however, errors creep into the measurement process in various ways and at various stages of measurement. The researcher must be aware of these potential error sources and make a conscious effort to eliminate them and minimise the errors. This is of primary importance in measurements using instruments specially designed by the researcher.

Variation of measurement consists of variations among different scales and errors of measurement. Sources of variation in measurement scales are presented in Table 10.1 (Lehmann and Hulbert, 1975).

measurement errors
Major of these are: errors due to interviewer, errors due to instrument and respondent errors.


Table 10.1 A Classification of Errors

Origin Type of Error
1. Researcher wrong question
Inappropriate analysis
Experimenter expectation
2. Sample wrong target
wrong method
wrong people
3. Interviewer Interviewer bias
4. Instrument

(a) Scale


(b) Questionnaire

Rounding off
Cutting off
Evoked Set
Construct-Question Incongrence
5. Respondent Consistency/Inconsistency
Lack of commitment

The major errors of concern are:

  1. Errors due to Interviewer bias Bias on the part of the interviewer may distort responses. Rewording and abridging responses may introduce errors. Encouraging or discouraging certain viewpoints of the respondent, incorrect wording, or faulty calculation during preparation of data may also introduce errors.
  2. Errors due to the instrument An improperly designed (questionnaire) instrument may introduce errors because of ambiguity, using words and language beyond the understanding of the respondent, and non-coverage of essential aspects of the problem or variable. Poor sampling will introduce errors in the measurement; whether the measurement is made at home or on site also may affect the measure.
  3. Respondent error These may arise out of influences due to health problem, fatigue, hunger, or undesirable emotional state of the respondent. The respondent may not be committed to the study and may become tentative and careless. There may be genuine errors due to lack of attention or care while replying, that is, ticking a ‘yes’ when ‘no’ was meant. Further, errors may occur during coding, punching, tabulating, and interpreting the measures.

Validity and reliability in measurement

Knowing that errors will creep into measurements in practice, it is necessary to evaluate the accuracy and dependability of the measuring instrument. The criteria for such evaluation are validity, reliability, and practicality. The basic discussion on validity and reliability was presented in section on validity and reliability of experiments and quasi experiments of Chapter 6. In the following sections, they will be discussed with reference to measurements and measuring instruments (Judd and Mclleland 1998).

Validity refers to the extent to which a test/instrument measures what we actually wish to measure. Reliability has to do with the accuracy and precision of a measurement procedure. Practicality is concerned with a wide range of factors of economy, convenience, and interpretability.

In any research, there are always measurement errors and non-sampling errors. There has been no accepted body of theory that may be used to predict either the direction or magnitude of these errors. One may ignore, estimate, or attempt to measure them. The measurement accuracy and measurement error can be defined as:


    Measurement accuracy = r/t, and       Measurement error = (1 – r/t)

Where r is the recorded sample value and
  t is the true value


The attempt to measure t and compare it to r is concerned with the validity of the measurement. The measure of the variability in r is the reliability of the measurement. The works of Emory (1976), Tull and Albaum (1973), Nunnally (1967), and Sekaran (2000) discuss validity and reliability of measurement scales extensively.

Validity of Measurement

After a model has been chosen for the construction of a measuring instrument and the instrument has been constructed, the next step is to find out whether or not the instrument is useful. This step is usually spoken of as determining the validity of the instrument (Nunnally, 1967). A scale or a measuring instrument is said to possess validity to the extent to which differences in measured values reflect true differences in the characteristic or property being measured. Two forms of validity are mentioned in research literature. The external validity of research findings is their generalisability to “… populations, settings, treatment variables, and measurement variables …” (Tull and Hawkins, 1987). The internal validity of a research instrument is its ability to measure what it aims to measure.

Internal validity This is the extent to which differences found with a measuring tool reflect true differences among those being tested. The widely accepted classification of validity consists of three major forms: content, criterion-related, and construct.

Content validity Content validity of a measuring instrument is the extent to which it provides adequate coverage of the topic under study. Content validity has been defined as the representativeness of the content of a measuring instrument. If the instrument contains a representative sample of the universe of the subject matter of interest, then content validity is good. To evaluate the content validity of an instrument, one must first agree on what elements constitute adequate coverage of the problem, then determine what forms of these opinions constitute relevant positions on these topics. If the questionnaire adequately covers the topics that have been defined as the relevant dimensions, it is possible to conclude that the instrument has good content validity.

content validity
Content validity is the extent to which the instrument provides adequate coverage of the topic under study. This is judgmental in nature and requires, generally, a panel of judges and accurate definitions of the topic. Face validity is a minimum index of content validity.

The determination of content validity is judgmental and can be approached in several ways. First, the designer may himself determine the validity through a careful definition of the topic of concern, the items to be scaled, and the scales to be used. This logical process is somewhat intuitive and is unique to each research designer. A second way to determine content validity is to use a panel of persons to judge how well the instrument meets the standards. An example is provided in the Annexure 10.1.

Face validity This is a basic and the minimum index of content validity. Face validity indicates that the items that are supposed to measure a concept, on the face of it, do look like they are measuring the concepts. The commonly used other names of the content validity are intrinsic validity, circular validity, relevance, and representativeness.

Criterion-related validity This form of validity reflects the success of measures used for some empirical estimating purpose. One may want to predict some outcome or estimate the existence of some current behaviour or condition. These cases involve predictive and concurrent validity, respectively. They differ only in a time perspective. An opinionaire or opinion questionnaire that correctly forecasts the outcome of a union election has predictive validity. An observational method that correctly categorises families by current income class has concurrent validity. While these examples appear to have rather simple and unambiguous validity criteria, there are difficulties in estimating validity.

criterion related validity
Is an external validity which reflects the success of measures used for some empirical estimating purpose.

It is a must to assure that the validity criterion used itself is ‘valid’. Thorndike and Hagen (1969) suggest that any criterion must be judged in terms of four qualities: relevance, freedom from bias, reliability, and availability.

Other authors have called predictive validity as empirical validity or statistical validity. An example of this is provided in Annexure 10.2.

Construct validity Construct validity testifies to how well the results obtained from the use of the measure fits the theory around which the test is designed. It is concerned with knowing more than just that a measuring instrument works. It is involved with the factors that lie behind the measurement scores obtained; with what factors or characteristics (that is, constructs) account for or explain the variance in measurement scores. One may also wish to measure or infer the presence of abstract characteristics for which no empirical validation seems possible. Attitude scales and aptitude and personality tests generally concern concepts that fall in this category. Even though this validation situation is much more difficult, it is necessary to have some assurance that the measurement has an acceptable degree of validity. In attempting to determine construct validity, one associates a set of other propositions with the results derived from using the measurement tool. If measurements on the devised scale correlate in a predicted way with these other propositions, it is easy to conclude that there is some construct validity.

Construct validity is assessed through convergent validity and discriminant validity. Convergent validity is established when the scores obtained by two different instruments measuring the same concept are highly correlated. Discriminant validity is established when, based on theory, two variables are predicted to be uncorrelated, and the scores obtained by measuring them are indeed empirically found to be so.

One of the widely used methods of simultaneously establishing convergent and discriminant validity (construct validity) is the multi-trait multi-method matrix (MTMM), (Campbell and Fiske, 1959). The basic concept of multi trait-multi method matrix is that correlation among the scores of the same trait should be the largest correlations in the matrix (Refer Table 10.2).

multi trait – multi method
This method uses a matrix of correlations among the scores of the same trait or quality and the measurement methods. The largest correlation coefficient establishes the construct validity.


Table 10.2 Multi-trait Multi-method Matrix

The correlation between a measure of one trait (construct) and a measure of another construct should be smaller than the correlations between two measures of the same trait (construct). This would establish both convergent and discriminant validities.

Because of the common measurement effect, M2 is more highly correlated with M3 than those between M1-M3, M1-M4 and M2-M4 but will be lower than that between M1-M2 and M3-M4.

A single study doesn’t establish construct validity. It is construed as an extending process of investigation; even tentative acceptance of construct validity demands some amount of aggregation of results of reliability and validity studies in a series. In any case it is not a certification of a measure (Peter, 1981). In literature, construct validity has been referred with different names such as trait validity and factorial validity. An example is provided in Annexure 10.3.

Reliability in Measurement

The reliability of a measure indicates the stability and consistency with which the instrument measures the concept and helps to assess the ‘goodness’ of a measure. A measure is reliable to the degree that it supplies consistent results. Reliability is a partial contributor to validity. A reliable instrument need not be valid, but a valid instrument is reliable. Reliability is not as valuable as validity, but is easier to assess. Reliable instruments can at least be used with confidence that a number of transient and situational factors are not interfering. They are robust instruments in that they work well under different conditions and at different times. This distinction of time and condition is the basis for identification of two aspects of reliability—stability and equivalence.

Stability The ability of a measure to maintain stability over time, despite uncontrollable testing conditions and the state of the respondents themselves, is indicative of its stability and low vulnerability to changes in the situation. With a stable measure we can secure consistent results with repeated measurements of the same person with the same instrument. It is often a simple matter to secure a number of repeat readings in observational studies but not so with questionnaires. Two tests of stability are test-retest reliability and parallel-form reliability.

Test-retest reliability The reliability coefficient obtained by the repetition of an identical measure on a second occasion is called test-retest reliability. Test-retest interviews are typically the only possible way to repeat measurements. When a questionnaire containing some items that are supposed to measure a concept is administered to a set of respondents now, and again to the same group, say, several weeks to few months later, then the correlation between the scores obtained at the different times from the same set of respondents is called the test-retest coefficient. The higher it is, the better the test-retest reliability, and hence the stability of the measure across time. But this test also has three measurement problems. If the time between measurements is short, the respondent may merely recall and repeat his earlier answers. The retest may also cause differences in transient factors (for example, the respondent may feel less on the second exposure to a question). The first measurement may also cause him to revise his opinions at the time of retesting. The reduction of these test-retest problems can be done in two ways. One is extension of the time interval between measurements so that the respondent will be less likely to remember his first answers. The second way is to divide respondents into two groups on a random basis, and then measure group A before the publicity and group B afterwards. The degree of stability can be determined by comparing results of repeated measurements. Along with repeated observations, we use statistical measures of dispersion, such as standard deviation or ranges of percentages.

test – retest
reliability This is a reliability coefficient obtained by repeating an identical measure on a second occasion. It is the correlation coefficient of measures obtained in the test and the retest.

Parallel-form reliability When responses on two comparable sets of measures tapping the same construct are highly correlated, there will be parallel-form reliability. Both forms have similar items and the same response format with only the wordings and the ordering of questions changed. This helps in establishing the error variability resulting from wording and ordering of the questions. If two such comparable forms are highly correlated, it is easy to conclude that the measures are reasonably reliable, with minimal error variance caused by wording, ordering, or other factors.

parallel-form reliability
When the responses on two comparable set of measures of the construct are highly correlated there will exist parallel form of reliability of the instrument.

Equivalence A second aspect of reliability considers how much error may be introduced by different investigators or different samples of the items being studied. Thus, while stability is concerned with personal and situational fluctuations from one time to another, equivalence is concerned with variations at one point in time among investigators and sample of items or with the internal consistency. A good way to test for the equivalence of measurements by two investigators is to compare their observations of the same events. We can give a similar type of test to interviewers by setting up an experimental comparison between them using test-retest situations.

This is a kind of reliability which considers how much error is incurred when different investigators measure the same attribute in different conditions.

We test for item sample equivalence by using alternative sets of questions with the same person at the same time. The results of the two tests are then correlated. A second method, the Split-Half Reliability technique, can be used when the measuring tool has a number of similar questions or statements in response to which the subject can express himself. The instrument is administered to the subject, and then the results are separated item wise into two randomly selected halves. The Spearman-Brown formula is applied:


R = n r/[1+(n–1)r]


   R = estimated reliability of the entire instrument

   r = correlation coefficient between measurements

   n = ratio of the number of items in the changed instrument to the number in the instrument

When n = 2, split-half reliability, R’ = [2r/(1+r)]

These are then compared; if the results are similar, the instrument is said to have reliability in an equivalence sense. An example is provided in the Annexure 10.4 to highlight the reliability in measurement.

Inter-item consistency reliability is a test of the consistency of respondents’ responses to all the items in a measure. To the degree that items are independent measures of the same concept they will be correlated with one another. The most popular test of inter-item consistency reliability is Cronbach’s coefficient or Cronbach’s alpha (Cronbach, 1951), which is used for multipoint scaled items, and the Kuder Richardson formulas (Kuder & Richardson, 1937) used for dichotomous items (Sekaran, 2000).

A series of two random subsets are chosen from the items. For each pair of subsets so chosen, the Spearman-Brown coefficient is calculated. The average of the series of coefficients generated gives the Cronbach alpha, which is a better indicator of reliability than the Spearman-Brown coefficient. A computer statistical package like SPSS calculates Cronbach alpha for a set of scaled homogeneous items.

Sometimes, we can combine our efforts to secure stability and equivalence in a single set of operations. One way is by giving alternative parallel tests at different times. Correlations of the results provides a combined index of reliability, which will be lower than either the separate stability or equivalence measures because it takes more sources of variation into account. A comparison of techniques of reliability is discussed by Peter (1979).

Practicality The scientific requirements of a project call for the measurement process to be reliable and valid, while the operational requirements call for it to be practical. Thorndike and Hagen (1969) define practicality in terms of economy, convenience, and interpretability.

Types of scaling (scale classification)

There is no fixed way of classifying scales. From the point of view of management research, scales can be classified based on six different aspects of scaling.

  • Subject orientation In this type of scaling, variations across respondents are examined. In this, the stimulus is (held constant) the same and difference in responses across respondents is studied. The stimulus-centered approach studies variations across different stimuli and their effect on the same individual.
  • Response form Here the variation across both stimulus and subject is investigated. This is the most generally used type of scaling in data collection methods for research.
  • Degree of subjectivity This reflects the fact that judgment and opinions play an important part in responses.
  • Scale properties The scale could be nominal, ordinal, interval or ratio type.
  • Number of dimensions This reflects whether different attributes or dimensions of the subject area are being scaled.
  • Scale construction technique This indicates the technique of deriving scales—ad hoc, group consensus, single item, or a group of items, and whether statistical methods were employed or not.

Some of the more generally used scaling methods* are discussed below (McIver and Carminer, 1981).

Response Methods

Response methods/variability methods include rating techniques, ranking methods, paired comparison, and rank order scaling approach.

Rating scales A rating scale is a measuring instrument that requires the person doing the rating or observing to assign the person or object being rated directly to some point along a continuum, or in one of an ordered set of categories. Rating scales are used to judge properties of objects without reference to other similar objects. These ratings may be in such forms as ‘like-dislike’, ‘approve-indifferent-disapprove’, or other classifications using even more categories. There are many types of rating scales that are used in practice, differing in the fineness of the distinction they allow and in the procedures involved in the actual process of rating. The classification of rating scales, as given by Selltiz et al (1959), are graphic, itemised, and comparative. A graphic rating scale is one in which lines or bars are used in conjunction with descriptive phrases and is a common and simple form of scale. There are many varieties of graphic rating scales, namely, vertical segmented lines, continuous lines, unmarked lines, horizontal lines marked to create equal intervals, and so on. Other rating scale variants are three-point scales, four-point scales, five-point scales and longer scales. A second form, the itemised scale, presents a series of statements from which a respondent selects one as best reflecting his evaluation. These judgments are ordered progressively in terms of, more or less, some property. It is typical to use 5 to 7 categories, with each being defined or illustrated in words. Itemised rating scales are also referred to as category scales or numerical scales. The third type of rating scale is comparative scale where the ratings clearly imply relative judgment between positions. The positions on the scale are especially defined in terms of a given population or in terms of people of known characteristics.

rating scales
A rating scale is a measuring instrument that requires the person doing the rating to assign the person or object being rated to some point along the continuum or in one of the ordered set of categories.

Ranking scales In ranking scales, the respondent directly compares two or more objects and makes choices among them. Widely encountered is the situation where the respondent is asked to select one as the ‘best’ or the ‘preferred’. When dealing with only two choices, this approach is satisfactory, but it may often result in the ‘vote splitting’ phenomenon when the respondent is asked to select the most preferred among three or more choices. This ambiguity can be avoided through the use of paired comparisons or rank ordering techniques.

ranking scales
A respondent directly compares two or more objects and makes choices among them. When large number of objects are involved choice becomes ambiguous. The ambiguity can be voided by using paired comparisons.

Method of paired comparisons With this technique, the respondent can express his attitudes in an unambiguous manner by making a choice between two objects. Typical of such a situation would be a product testing study where a new flavour of soft drink is tested against an established brand. In the general situation there are often many more than two stimuli to judge, resulting in a potentially tedious task for the respondents (there will be n(n–1)/2 judgments, where n is the number of stimuli). Using of these n(n–1)/2 judgements simultaneous evaluation of all stimuli objects is not possible in a simple manner. Thurstone developed an analytical procedure called ‘law of comparative judgements’ to derive quasi-intervally scaled paired comparison data (see Torgenson W.S. [1958] for details of this procedure).

Method of rank order Another comparative scaling approach is to ask respondents to rank their choices (if 5 items are chosen out of 10, only 5 need be ranked). This method is faster than paired comparisons and is usually easier and more motivating to the respondent. On the negative side, there is some doubt regarding how many stimuli may be handled by this method. Less than five objects can usually be ranked easily, but respondents may grow quite careless in ranking, say, ten or more items. In addition, the rank ordering is still an ordinal scale with all of its limitations.


Example Following are the R&D objectives a firm generally meets with. Please indicate the objectives relevant to your R&D by ticking and rank them.

Objectives Rank
1. New product development (NPTD) ——-
2. New process development (NPSD) ——-
3. Modification of existing product (MEPT) ——-
4. Modification of existing process (MEPS) ——-

A total of 200 respondents rank the four items of objectives as shown in the table below:

Is derived by weighting the number with the rank numbers assigned and summing up giving preference scores, for example,

NPTD – (1 × 96 + 2 × 52 + 3 × 20 + 4 × 32) = 388. Further, the following table (b) is formed.

Objective Preference Score Preference Rank order
NPTD 388 1
MEPT 482 2
NPSD 528 3
MEPS 582 4

New product development followed by modifying existing products are the top two preferences for the R&D objective.

Method of successive intervals Neither the paired-comparison nor the rank order method is particularly attractive when there are a large number of items to choose. Under these circumstances, the method of successive intervals is sometimes used. In this method, the subject is asked to sort the items into piles or groups representing a succession of values. An interval scale can be developed from these sortings.

Quantitative Judgment Methods

These include direct judgment methods, fractionation, and constant sum methods.

Direct judgment method The respondent is required to give a numerical rating to each stimulus, with respect to some designated characteristic. In the unlimited type, the respondent may choose his or her own number in graphical method to position the rating on a line, as shown.

In limited type, the respondent has limited choice, as given by the researcher.

Fractionation methods In fractionation, a number denoting the ratio between two stimuli, with respect to some characteristic, is estimated.

Constant sum methods Constant sum methods are used to standardise the scale across respondents by requiring them to allocate a given total score (say 100) among the stimuli (total adds upto a constant and, hence, the name).

Scale construction techniques

Scale construction techniques refer to the construction of sets of rating scales (questionnaire) that are carefully designed to measure one or more aspects of a person’s attitude or belief towards some object. An individual’s rating of a particular item in questionnaire is not of much concern. The individual’s responses to the various item scales are summed up to get a single score for the individual.

scale construction techniques
These refer to construction of sets of rating scales (questionnaire) which are carefully designed to measure one or more aspects a persons’ beliefs or attitudes. Scale construction methods are judgement scaling, factor scales and multi dimensional scaling.

In any data collection method using variability methods, raw data responses are obtained. The data are ordinal-scaled and a model or a technique of scaling for transforming this into scale values that are interval-scaled will be required. Thurstone’s Case V scaling is relevant in this case. Raw data obtained from judgment methods can be scaled using the semantic differential method of Osgood, et al (1957).

Judgment Methods

Arbitrary scales It is possible to design arbitrary scales by collecting a number of items that are unambiguous and appropriate to a given topic. Arbitrary scales are easy to develop and can be designed to be highly specific to the particular case and content. They can provide much useful information, and can be quite adequate if developed by one who is skilled in scale design.

Consensus scaling In this approach, the selection of items is made by a panel of judges who evaluate proposed scale items and determine: (i) whether the item belongs in the topic area, (ii) its ambiguity, and (iii) the level of attitude that the scale item represents. The most widely known form of this approach is the Thurstone differential scale. Differential scales approach, known as the Method of Equal Appearing Intervals, was an effort to develop an interval scaling method for attitude measurement (Judd and Mclelland, 1998).

Thurstone differential scale
This scale is constructed using consensus of a panel of judges with equal intervals appearing on the scale. Is widely used for attitude measurement.

Item analysis In this procedure, a particular item is evaluated on the basis of how well it discriminates between persons whose total score is low. The most popular type of scale using this approach is the summated scale.

Summated scale This consists of a number of statements that express either a favourable or unfavourable attitude towards the object of interest. The respondent is asked to agree or disagree with each statement. Each response category is assigned a numerical score to reflect its degree of attitude favourablity, and the scores are summed algebraically to measure the respondent’s attitude. The most frequently used form of scale is patterned after the one devised by Likert (Kerlinger, 1973).


Example of Likert Scale:

The objectives of the R&D department of your organisation are clearly set.

Likert scale
This mostly used scale is based on judgement method using an agree-disagree format. Each category is assigned a numerical score. The scores on the scale are summed (summated scaling) up to get the total score for an individual.

Cumulative scales These scales are constructed as a set of items with which the respondent indicates agreement or disagreement. Total scores on cumulative scales have the same meaning. Given a person’s total score it is possible to estimate which items were answered positively and which negatively. The major scale in category is the Guttman scalogram. Scalo-gram analysis is a procedure for determining whether a set of items forms a unidimensional scale, as defined by Guttman (1950). A scale is said to be unidimensional if the responses fall into a pattern in which endorsement of the item reflecting the extreme position results also in endorsing all items that are less extreme.

Case-V scaling Here the scaling uses a model of comparative judgment. If all of a group of subjects prefer A to B and only 60 percent prefer B to C, then Thurstone’s model can help develop an interval scale from these stimulus-comparison proportions (Figure 10.1).

If Rj and Rk are mean values of the two judgment processes, then interval RjRk = Zjk (Edwards, 1957).

Factor Scales

The term factor scales is used here to identify a variety of techniques that have been developed to deal with two problems that have been glossed over so far. They are (i) how to deal more adequately with the universe of content that is multidimensional, and (ii) how to uncover underlying dimensions that have not been identified. The different techniques used are latent structure analysis, factor analysis, cluster analysis, and metric and non-metric multidimensional scaling.

The Q-Sort technique This is similar to the summated scale. The task required of responder is to sort a number of statements into a predetermined number of categories.

Example (seven items on the scale):

* Most agreed—1, 2 (two items); Neutral—3, 4, 5 (three items); Least agreed—6, 7 (two items)


Responses of the subjects A, B, C, and D are given in the table above. It can be noted that pair A and B and pair C and D seem ‘most alike’. A subject’s scores are correlated with every other subject and factor or cluster analysis is performed to get the groupings (Kerlinger, 1973).

Semantic differential (SD) This is a type of quantitative judgment method that results in assumed interval scales. This scale is obtained on factor analysis of these assumed scale values and can be used rather easily and usefully in decisional survey research employing multivariate statistics. This scaling method, developed by Osgood and his associates (Osgood et al, 1957), is an attempt to measure the psychological meaning of an object to an individual. It is based on the proposition that an object can have several dimensions of connotative meanings, which can be located in multidimensional property space, in this case called semantic space, for example, both direction and intensity. One of the ways this is done is requiring the respondent to rate assuming equal intervals on a set of bipolar adjectives (like extremely clear — extremely ambiguous; extremely strong — extremely weak, and so forth) and then arranging integer values to these intervals. Averages of these scores for two groups can be compared to get a semantic differential profile. Some of the major uses have been to compare company ‘images’ and brands, determine attitudinal characteristics of consumers, and analyse the effectiveness of promotional activities.

semantic differential (SD)
This is a quantitative judgement method that results in assumed interval scales. Uses bipolar adjectives like ‘extremely clear’ and ‘extremely ambiguous’. Measures images, effectiveness, attitudes, etc.


Fig. 10.1 Thurstone scale V

Example Do you like the taste of apples?

Stapel scale This is a modified version of semantic differential. It is an even numbered non-verbal rating scale using single adjectives instead of bipolar opposites. Both intensities are measured concurrently. Neither the equality of intervals nor the additivity of ratings of a respondent is assumed (Crespi, 1961). A format is shown below.

Multidimensional scaling Multidimensional scaling (MDS) is a powerful mathematical procedure that can systematise data by representing the similarities of objects spatially as in a map (Schiffman, 1981). In multidimensional scaling models, the existence of an underlying multidimensional space is assumed. This term describes a set of techniques that deal with property space in a more general way than does the semantic differential. With multidimensional scaling it is possible to scale objects, individuals, or both, with a minimum of information. A detailed discussion on MDS is presented in Section on MDS in Chapter 18.

Standardised instruments As in many instances of research, an available, standardised instrument may be selected to collect data instead of developing one. These standard instruments are developed and tested by experts and the results of using them may be compared with the results of other studies : A large number of standardised instruments are available for testing a variety of topics like personality, achievement, performance, intelligence, general aptitudes, and projective tests (sub-topics are also covered by them). Proper selection of instrument is important for its use in a specific research. For this purpose, specifications; conditions for use; details regarding validity, reliability, and objectivity; directions for administration; scaling; and interpreting should be considered carefully (for greater details about this refer to Conoley & Kramer 1989, Keyser and Sweetland 1986, Mitchell 1983).


Measurement is the assignment of numbers to objects or is the amount of property possessed by objects. Since highly abstract and imprecisely defined constructs and properties are present in social science and management, measurement problems are specially difficult in research. The primary type of measurement scales used are nominal scales, which simply categorise objects; ordinal scales, which rank objects or attributes; interval scales, which are truly quantitative, enabling the use of statistical methods on data; and ratio scales, which are interval scales with natural origin.

In management and social sciences, the variables and constructs are highly abstract and their measurements tend to be indirect. Therefore, validity and reliability assume great importance. Validity means freedom from systematic errors or bias in measurement. It indicates the appropriateness of a measuring instrument. Reliability means freedom from variable errors or errors in repetitive use of the measure. It is the accuracy or precision of a measuring instrument.

A scale or instrument is said to possess validity when the extent of differences in measured values reflect true differences in the characteristics measured. There are two forms of valid-ity—content validity, which is the extent of coverage of the topic being measured; and criterion related validity, which reflects the success of measures used for an empirical estimating purpose. By far, the most important form of validity is construct validity, which testifies how well the result obtained from the use of the measure fits the theory, which it is supposed to support. A valid instrument is reliable but a reliable instrument is not necessarily valid.

Reliability in measurement means stability of measurement under different conditions and at different times. Two tests of stability are: (i) test-retest reliability and (ii) parallel-form reliability. When different sets of items or two investigators measure a construct, how closely they agree gives the equivalence. A measure of the equivalence is obtained using the Spearman-Brown formula on the two measures. Cronbach alpha, based on the above formula, is a more popularly used measure of reliability.

There are several types of scaling methods; response methods, which includes direct rating, comparison, and ranking methods. Scale construction techniques refer to the construction of sets of rating scales (as in questionnaires). These are carefully designed and most often standardised for general use in research. These consist of: (i) consensus scales, which a panel of experts devise; (ii) summated scales where item analysis is used (and of which Likert scale is an example); and (iii) factor scales that are used to develop measures of dimensions of complex constructs using multivariate statistical techniques. Q-sort, semantic differential, and stapel scales are examples of factor scales. Multidimensional scaling techniques are used when measuring multidimensional complex constructs and they use proximity or similarity data. They reveal the dimensions relevant to the subjects.

ANNEXURE 10.1 Illustrative Example—Content Validity

An instrument for classification of decision-making styles of manufacturing executives was designed and tested* (Vijaya Kumar, et al, 1990). The instrument was intended primarily to classify executives as analytic and heuristic. These styles were defined based on the research of several authors on what constitutes analytic and heuristic. An integrated definition was evolved. A total of 54 questions embodying the indicators were offered to over 100 executives to judge them as analytic or heuristic in their opinion. The results were statistically analysed and 18 questions per style were extracted. In a second voting, these were condensed to 20 in all (10 per each style). Three situational factors were included to test these questions. The content validity was thus established on the basis of the judgment of the executives and the earlier theories and constructs related to decision-making style. The statistical procedure was used only to refine the judgments of the executives. The following simple example indicates the analysis followed in establishing content validity.

A statement of the questionnaire gets the following response.

Number responding = 100
Analytical = 89
Heuristic = 10
Undecided = 1
Analytical = 89/99 = 89.9 %
Heuristic = 10/99 = 10.1 %
  = 100/((0.899*0.101)/99) = 3.03 %

For 99 per cent confidence level, the true estimate of proportion for the population will fall within 89.9 per cent + 2.58 × 3.03 (Maximum 97.71 per cent and Minimum 82.09 per cent) for analytical and within 10.1 per cent + 2.58 × 3.03 (Maximum 17.91 per cent and Minimum 2.29 per cent) for heuristic. That is, at least 82.09 per cent of the population would have rated the status analytical with a probability of 99 per cent. From the results obtained, only those items which scored more than 70 per cent were taken as the initial set of elements.

ANNEXURE 10.2 Illustrative Example—Concurrent and External Validity

For the instrument discussed as an example under content validity, the concurrent and external validities were established as follows. A ‘decision profile form’ was designed with a view to obtaining a self rating score, and also the superior’s rating score of the respondent.

  • The definitions of two styles were offered as ‘P’ type decision-making and not as analytical and heuristic.
  • Using these definitions, ratings were obtained in the following format on a ten-point scale.

                Type ‘P’

                Never 1 2 3 4 5 6 7 8 9 10 Always

                Type ‘Q’

                Never 1 2 3 4 5 6 7 8 9 10 Always


These ratings were obtained both by the respondent for himself (which is a self rating score), and by his superior for the respondent (which is the superior’s rating score). The questionnaire was also filled in by the respondents. They were scored simply as the total number of items generally agreed (GY), separately for analytic items and heuristic items. These are termed individual score for decisions for one’s self and individual score for decisions for superior. The concurrent validity of the instrument was determined by finding the correlations between the individual’s scores for decisions for self and self rating scores; and the correlation between the individual scores for decision for the superior and the superior’s rating scores of the individual. Significant high correlations for the above two sets of scores indicate the desired validity.

Further, total scores of ‘analytic’ items, and total scores of ‘heuristic’ items were checked for correlation. If the two types of items are orthogonal and independent, we expect an insignificant correlation coefficient close to zero, and if they are significantly negatively correlated, then our contention that they belong to the same continuum would be borne out empirically. The decision profile form and the questionnaire were administered to a random sample of respondents from a group of production executives. Their responses were collected on a ten-point scale, as explained above.

The ‘analytic’ and ‘heuristic’ scores obtained for decisions made for one’s self were correlated with the self rating scores of the individual respondents for the purpose of validation. Significant high correlation would indicate the validity of this questionnaire.

Simultaneously, yet another ‘decision profile form’ was given to the immediate superior of the above mentioned production executives who responded to the questionnaire. The decision profile form, on completion by the immediate superior, would indicate the level of his subordinates’ decision-making capability, again on a ten-point scale. Such rating by the superior concerned was obtained without the knowledge of the subordinates. As part of the validation process, the questionnaire was administered to 100 executives, and their superiors were administered the rating forms. Similarly, the questionnaire, and later, a self rating form were administered to 100 other executives. Only 29 of the former, and 49 of the latter responded.

Results For a sample of 49, the correlation coefficient between the self scores and self rating scores for the analytic style was 0.60682, significant at 0.10 level (correlation coefficients would be 0.288 at 0.05 level and 0.372 at 0.01 level, respectively). This validates the second portion concerning the analytical decisions for superiors. For the same sample, the correlation coefficient between scores of decision for superiors, and scores of superior’s rating for heuristic style is 0.27153. This marginally fails to be significant at 0.05 level. However, the reliability of the instrument is sometimes taken as a measure of its validity and in view of the high reliability of the instrument, this failure is not considered serious.

ANNEXURE 10.3 Illustrative Example—Construct Validity

John and Reve (1982) developed an instrument (questionnaire with scaling) for assessing dyadic relationships in marketing channels. Multiple items were developed for each variable. The scales and the response formats were finalised using a pilot study. The core of the study was the assessment of construct validity of the measures developed for the inter-organisational dyadic structure and sentiment variables. Three dimensions of construct validity were considered and they are: (i) internal consistency as validity, (ii) convergent validity, and (iii) discriminant validity.

Reliability and internal consistency were assessed in two separate procedures. To assess internal consistency, the researcher analysed the pool of scale items developed for each variable by using item-total correlations and common factor analysis. Cronbach’s alpha was estimated for the unidimensional scales that were extracted. The convergent validity was assessed by analysing multi-trait multi-method matrices of correlations where different structural and sentimental variables constituted the traits and the instruments from the wholesalers and retailers constituted the methods. The criterion of maximally different method was met. The discriminant validity was simultaneously assessed with convergent validity by using the analysis of covariance. The results showed that the scale items having higher scale reliability were distinctly different for wholesalers than for retailers. The convergent and discriminant validity results showed that validity was accomplished. However, the MTMM matrix of sentimental variables showed very low correlations, indicating a lack of ability to discriminate in cross-construct correlations. The same was tested in the structural equation analysis used in the study.

ANNEXURE 10.4 Illustrative Example—Reliability in Measurement

The equivalence reliability for the instrument discussed earlier was established under validation as follows:

To establish reliability, the split-half technique was chosen in order to check the internal consistency of the measuring instrument. A correlation analysis was made between the split-halves on a predominant sample of 150 executives. It was found that all the reliability coefficients using split-half method were significant at 0.01 level. Thus, we can conclude that the instrument is valid and reliable.

The formula used to arrive at the reliability coefficient is the ‘Spearman Brown’ formula (Anastasi, 1966):


r = 2 r’/(1 + r’)


     r = reliability coefficient of the whole test and

     r’ = reliability of the half-test, found experimentally.

When the reliability coefficient of the half-test (r’) is 0.50, for example, the reliability coefficient of the whole test r by the above formula is: (2x0.5)/(1+0.5) or 0.667.

The whole test reliability coefficients are shown in Exhibit 10.1 (these are significant at 0.01 level). Thus, the instrument is considered to measure the true style reliably.


Exhibit 10.1 Reliability coefficient for test

Note: # All values are significant at 0.01 level

Suggested Readings

  • Cox, T. F. and M. A. Cox (2001). Multidimensional Scaling, 2d ed. London: Chapman and Hall.
  • Judd, C. M. and G.H. Mclleland (1998). “Measurement”, in D.T. Gilbert, S T Fiske and G Lindzey (eds). Handbook Social Psychology, 4th ed. pp. 180–232 McGraw Hill.
  • McIver, J. P. and E. G. Carmines (1981). Unidimensional Scaling. New Delhi: Sage Publications.
  • De Vellis, Robert F. (2002). Scale DevelopmentTheory and Applications. New Delhi: Sage Publications.

1. For what type of problem and under what circumstances would you find the following data gathering techniques most appropriate?

  1. Likert scale
  2. Questionnaire
  3. Interview
  4. Observation
  5. Q-Sort

2. Construct a questionnaire that could be administered to a group of supervisors for the following topics.

  1. Reasons for workers leaving the organisation.
  2. What could be done to improve the quality on the shop floor?
  3. Assessment of non-work interests of the workers.
  4. Manager’s interests in the welfare of workers.

3. Construct suitable scales for the following measurement situations.

  1. Evaluation of worker performance and supervisor performance.
  2. Assessing the organisational climate in an organisation.
  3. The level of technology in production operations.
  4. The degree of integration of production, R & D, and marketing.

4. When collecting facts, which one would you prefer to use—free answer, answer to a multiple choice question, or answer to a dichotomous question?

5. Evaluate the following questions.

  1. What was the total sales of all products in your organisation last year?
  2. Do you think your organisation encourages innovative behaviour? Evaluate on a scale. Very Strongly – ranging from not at all.
  3. When did you last disagree with your boss?
  4. As a manager of production, what decision would do you take with respect to inventories?
  5. How many meetings, on an average, do you attend every month?
  6. How do you rate your supervisor X? Evaluate on a scale of outstanding –satisfactory –poor.
  7. Have you changed your jobs often? YES/NO
  8. Please indicate performance of the water pump supplied to you, on a scale ranging from excellent to poor
  9. What improvements do you think should be made in the pumps?

6. Explain the concepts: (a) Scaling, (b) Operational definition, (c) Stimulus, (d) Proximity, (e) Configuration, (f) Latent structure, and (g) Trait validity.

7. Which of the definitions is relevant for measurement purposes? Why?

  1. Ostensive definition.
  2. Verbal definition.
  3. Descriptive definition.
  4. Operational definition.
  5. Conceptual definition.

8. What are the situational characteristics of a measurement? Give examples.

9. Give a conceptual and two or more operational definitions for each of the following concepts.

  1. Satisfied customer
  2. Not meeting customer demands
  3. Direct overheads
  4. Financial implications
  5. Worker participation
  6. Manufacturing excellence

10. Develop scales for the operational definitions in Question 9.

11. Develop four different scales for measuring brand loyalty—one nominal, one ordinal, one interval, and one ratio.

12. Examine management literature (in the area of your interest where the measurement problems are tricky. Search and note down the definitions used by the authors and the scales used). Can you find an alternate operational definition and an alternate scaling procedure?

13. When is multi-dimensional scaling useful in measurement?

14. Give an example to illustrate the use of the following 10 scales. Develop the scale.

  1. Non-comparative graphic rating scales
  2. Non-comparative itemised rating scale
  3. Comparative graphic rating scales
  4. Comparative itemised rating scale
  5. Paired comparisons
  6. Rank order scales
  7. Constant sum scales
  8. Semantic differential scales
  9. Stapel scales
  10. Likert scales

15. Choose a hypothesis with the help of your guide or one that is available in literature, suitable for verification using a field survey with a questionnaire. Develop a questionnaire indicating scales and the variables. Discuss your questionnaire.

16. Construct a short rating scale to be used for the evaluation of the teaching performance of a new professor.

17. Construct a Likert type questionnaire for the following problems:

  1. The members of the faculty of an institution of advanced learning taking up consultancy work.
  2. Religious activities in an industrial organisation.
  3. R&D to be directed to new product development.
  4. Inadequate top management support to self development of employee in an organisation.

18. What statistical techniques can be used with a nominal scale?

19. Give examples of situational characteristics as components of measurement.

20. What kind of scale is appropriate for the following? Discuss how you arrived at your answer.

  1. Determining the percentage of men in a town who require vitamin supplements.
  2. Determining kind of top management support desired by employees for creative contribution to the organisation
  3. Determining the average proportion of spare time scientists in R&D of an organisation require
  4. Determining whether the company adopt flexible working hours for its employees

21. Develop different scales or measuring corporate dependency on universities for R&D. One nominal, one ordinal, one interval, and one ratio.

22. Search for published papers in good journals to give you atleast two examples of nominal, ordinal, interval, and ratio scales and describe them.

23. Develop sound measurements for the following.

  1. Product leadership
  2. Customer satisfaction
  3. Top management support
  4. Linkage between two departments
  5. Effectiveness of models in decision making
  6. Technology upgradation

24. How is internal comparison reliability different from split-half reliability ?

25. Can you group the following nine validity terms into two major categories? If not, explain why? If yes, categorise them giving reasons.

  1. Content validity
  2. Face validity
  3. Criterion validity
  4. Concurrent validity
  5. Predictive validity
  6. Construct validity
  7. Convergent validity
  8. Discriminant validity

26. What are the indicators of accuracy in measurement?

27. What are the problems of external validation? Discuss with examples.

28. How is construct validity achieved by achieving convergent and discriminant validities?

29. Which is more important in measurement related to decisional research, reliability or validity? Give reasons for your answer.

30. Select two research papers dealing with multi-trait multi-method matrix used in establishing validity. Compare them. Present them to a group, giving differences and similarities.

31. Select two recent research papers describing the development of a measuring instrument. Compare their validation procedures. Are there differences? If yes, explain why.