|
Psychometrics is the science of measuring "psychological" aspects of
a person such as knowledge, skills, abilities, or personality. Measurement of
these unobservable phenomena is difficult and much of the research and accumulated art of this discipline is designed to reliably
define and then quantify. Critics, including "hard science" practitioners and social activists, have argued that such definition
and quantification is impossibly difficult and that such measurements are very often misused (although users of psychometric
techniques can reply that their critics often misuse data by not assessing them with psychometric criteria). Significant
psychometricians include Karl Pearson, L. L. Thurstone, and Arthur Jensen.
Significant critics include the late Stephen Jay Gould.
Much of the early work in psychometrics was developed in order to measure intelligence. More recently psychometric theory has
been used in measurement of personality, attitudes and beliefs, academic
achievement, and in health related fields, to measure quality of life.
There are two branches to psychometric theory - classical
test theory (CTT), and the more recent item response
theory (IRT).
The key concepts of classical test theory are reliability and validity. A reliable measure is measuring something consistently, while a valid measure is
measuring what it is supposed to measure. A reliable measure may be consistent without necessarily being valid, .e.g., a
measurement instrument like a broken ruler may always under-measure a quantity by the same amount each time (consistently), but
the resulting quantity is still wrong, that is, invalid.
Both reliability and validity may be assessed mathematically. Internal consistency may be assessed by correlating performance
on two halves of a test (split-half reliability); the value of the Pearson product-moment correlation coefficient is adjusted with the
Spearman-Brown prediction formula
to correspond to the correlation between two full-length tests. A commonly used measure is Cronbach's α, which is equivalent to the mean of all
possible split-half coefficients. Stability over repeated measures is assessed with the Pearson coefficient, as is the
equivalence of different versions of the same measure (different forms of an intelligence test, for example). Other measures are
also used.
Validity may be assessed by correlating measures with a criterion measure known to be valid. When the criterion measure is
collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion
is collected later the goal is to establish predictive validity. A measure has construct validity if it is related
to other variables as required by theory. Content validity, or face validity, is simply a demonstration that the items of
a test are drawn from the domain being measured; it does not guarantee that the test actually measures phenomena in that
domain.
Predictive or concurrent validity cannot exceed the square of the correlation between two versions of the same measure.
Item response theory models the relationship between latent traits and responses to test items. Among other advantages, it has the ability to provide a
reliable estimate of the exact score of a test-taker on the latent trait. For example, a university student's knowledge of
history can be deduced from his or her score on a university test and then be compared reliably with a high school student's
knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and
assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of
a norm group randomly selected from the population. In fact, all measures derived from
classical test theory are dependent on the sample tested, while those derived from item response theory are not.
See also standardized test.
External Links
|