SW 132 Research and Evaluation in Social Work Practice
Schram

Fall 2001
Key Points for 10/2/01
 

Constructing Measurement Instruments

1. Constructing Measurement Instruments is a question of operationalization , i.e, converting concepts into variables as is suggested by the following schema:

Theory
Concept A -- Concept B
|

Hypothesis
variable a -- variable b

2. Variables are called variables because they vary across the people we study. For instance, self-esteem can be measured as a variable because some people can be shown to have high levels of self-esteem while others have low levels of self-esteem. It varies across the population.

3. To operationalize a concept is to make it measurable by converting it into a variable. Some concepts are easy to operationalize, such as age, earnings, years of formal schooling, etc. Others might be more controversial such as race. We might operationalize self-esteem by measuring it according to responses people give to survey questions about how they perceive themselves. Once we operationalize concepts into variables, we can collect data on them and see how they vary across the population we are studying.

4. When operationalizing concepts into variables, we try to achieve the highest level of precision in measurement that that concept allows. We do not want false precision or misplaced concreteness which occurs by creating measures of the concept that are more precise than we can legitimately be about that concept. Instead, we only want as much precision as the concept can legitimately allow. We want as much precision as possible, so that we can later analyze the data as precisely as is possible. Some things can not be measured very precisely; others can.

6. The levels of precision in measurement are nominal , ordinal, interval, and ratio .

7. Nominal is where the numbers simply name differences--such as the variable religion might be coded 1=Catholic, 2=Islamic, 3=Greek Orthodox, 4=Jewish, 5=Lutheran, 6=Episcopalean, etc. The number are merely substitutes for names. You can convert many nominal variables into dichotomous or dummy variables that measure the presence or absence of something, such as Catholic where 1=Catholic and 2=not a Catholic.

8. Ordinal is where the variable’s numbers imply a rank order, as in Self Esteem, 1=low self-esteem, 2=moderate self-esteem, 3=high self-esteem. Ordinal scores can only be ranked.

9. Interval is where the variable’s numbers imply a rank order where there are equal intervals between each number, making for a more precise ranking. Some attitudes can be measured through surveys to produce scores of equal intervals. Internal scores can be subject to addition and subtraction and we can measure the net difference in scores, such as when we compare someone with a score or 12 to someone with a score of 10 and note that the former’s score is 2 points higher than the latter’s score.

10. Ratio is where the equal interval rank order is based on an absolute zero, as in income, age, length of unemployment, number of times returned for treatment, etc. Ratio scores can multiplied and divided. We can say when something, such as someone’s self-esteem score, is two times as high as someone else’s score. Being able to do more mathematical operations enables us to analyze the data more precisely.

11. When operationalizing concepts into variables, we want to create measures that have reliability and validity.

12. Reliability is when a measure can be counted on to do the same thing in every different instance. A reliable measure is like a good thermometer that always accurately measures body temperature in each and every person no matter who they are. It is consistent in its measurement. It may produce different results in different instances but it will do it in each instance in the same way. For instance , a reliable measure of the concept of "learned helplessness" might be counting up the number of times in a month each of your long-term unemployed clients on welfare did or did not attend a class on job-finding skills. While each client might get recorded having attended a different number of times, some of them many times, some of them no times at all, each client’s attendance was measured the same way consistently for all clients.

13. Checking for reliability can involve examining if: (1) the same measure measures the same thing for different subpopulations , e.g., whether an SAT question is interpreted the same way by different ethnic groups (group bias effect); (2) a different version of the same measure works the same way for same population (parallel forms reliability) ; (3) the same measure is repeated on the same population at two different points in time to see how stable it is over time (test-retest reliability ); or (4) Different people do the measuring with the same measure to see if that affects how the measurement is done in different instances (interobserver reliability).

14. Validity is where a measure measures what it is supposed to measure. A valid measure is a good measure of the concept. A valid measure corresponds closely to what the concept is about.

15. Checking for validity can involve face validity where the measure is determined to appear on its face to have the right elements to tap what is being measured. For instance, a battery of agree/disagree items on a survey questionnaire designed to create a scale score for the level of self-esteem each person interviewed can be checked to see if each item is logically related to self-esteem.

16. Content validity is where a group of informed expert researchers examine the scale to see if they assess whether it is designed to measure what it is supposed to measure, especially because it refers to the different things it needs to refer to. For instance, a group of experts can look to see whether a measure of IQ tests for all the relevant forms of intelligence.

17. Checking for validity can also involve empirical tests. Empirical tests of validity can take the form of either criterion-related or construct validity.

18. Criterion-related forms of validity involve using external criteria to see if you measure is valid. One form of criteria-related validity is predictive validity where your measure is determined to be valid because scores on your measure predict other phenomena that are predicted to occur in the future. For instance, a scale of self-esteem in a survey questionnaire may be determined to have predictive validity because scores on the scales were correlated with whether welfare clients interviewed subsequently went on job interviews.

19. Another form of criterion-related validity is concurrent validity . This is where your measure is determined to be valid because it is correlated  with something that happens concurrently and that other concurrent  phenomenon is assumed to be related. Here the measure can be predicted to vary with other factors that have already been established to be associated  with that concept being measured. For instance, concurrent validity would be established when your newly devised measure of self-esteem can be shown to vary with other contemporaneous reports of say people’s life satisfaction scores because that has been previously theorized to be related to self-esteem.

20. Construct validity is focused a summary analysis on whether a measure is valid based on how it relates to other variables in expected ways. One form of construct  validity is convergent validity, where the measure is shown  to vary with other measures of the same concept. For instance, a depression scale on a survey questionnaire of clients in a treatment facility should vary with scores from a clinician’s diagnoses. Factor analysis that shows these variables all go together would be an example of convergent validation.

21. Another form of construct validity is discriminant validity   where a measure relates more to other measures of the concept than measures  of other concepts. For instance, a measure of self-esteem based on a scale constructed from a battery of agree/disagree items in a survey questionnaire given to clients in an agency should show that their scores on that scale correlate more closely with their clinicians diagnoses as to whether each patient had high or low levels of self-esteem more so than the scores each client got on a marital satisfaction scale. A discriminant analysis procedure that shows your measure is associated with some variables but not others in expected ways would be an example of  discriminant validation.

22. There is often a trade-off, or as the text says "a tension," between relability and validity. Trying to emphasize one can can lead to the neglect of the other. Trying to get a reliable measure can mean sacrificing ensuring the measure is valid. This tension often suggests the different emphases of quantitative and qualitative research. It also suggests the need for triangulation.

23. Checking for reliability and validity is best done initially through pilot-testing, in other words pre-testing your measures on a small sample population.
24. We can never be perfectly sure that measures are completely reliable and valid. Therefore, we need to assume that some error will be built into our measures. We can try to eliminate systematic error when measures are biased in ways that systematically mismeasure some subpopulations. Random error is much harder to eliminate given that it occurs at random or in other words by chance. We need to be prepared to try to account for the level of error in our measurements but this can never be done with anything approximating perfect certainty. The goal is to try to eliminate all systematic error as best we can and to design our measures such that there is likely to be only a small, if unknowable, amount of random error.

25. One way to start to compensate for the inevitability of some level of error in our measures is to engage in triangulation. Triangulation is where we rely on more than one measure or more than one method of data collection to study the same thing. If multiple measures or multiple data sources reinforce each other, then we can be more confident about our findings.
------------------------------------
Group Work for 10/2/01
Operationalization

Break into groups of 4.
Take no more than 15 minutes.

Create measures of "welfare dependency " and "learned helplessness."

Provide tests to show how your measures are reliable and valid.

State the level of measurement for each.