Reliability & Validity
The importance of a test achieving a reasonable level of reliability and validity cannot be overemphasized. To the extent a test lacks reliability, the meaning of individual scores is ambiguous. A score of 80, say, may be no different than a score of 70 or 90 in terms of what a student knows, as measured by the test. If a test is not reliable, it is not valid.
Reliability of a Test
Despite differences between the format and construction of various tests, there are two standards by which tests (as compared to items) are assessed. These two standards are reliability and validity.
Reliability refers to the consistency of test scores; how consistent a particular student’s test scores are from one testing to another. In theory, if test A is administered to Class X, and one week later is administered again to the same class, individual scores should be about the same both times (assuming unchanging conditions for both sessions, including familiarity with the test). If the students received radically different scores the second time, the test would have low reliability. Seldom, however, does a teacher administer a test to the same students more than once, so the reliability coefficient must be calculated a different way. Conceptually, this is done by dividing a homogeneous test into two parts (usually even and odd items) and treating them as two tests administered at one sitting. The calculation of the reliability coefficient, in effect, compares all possible halves of the test to all other possible halves.
One of the best estimates of reliability of test scores from a single administration of a test is provided by the Kuder-Richardson Formula 20 (KR20). On the “Standard Item Analysis Report” attached, it is found in the top center area. For example, in this report the reliability coefficient is .87. For good classroom tests, the reliability coefficients should be .70 or higher.
To increase the likelihood of obtaining higher reliability, a teacher can:
- increase the length of the test;
- include questions that measure higher, more complex levels of learning, and include questions with a range of difficulty with most questions in the middle range; and
- if one or more essay questions are included on the test, grade them as objectively as possible.
Validity of a Test
Content or curricular validity is generally used to assess whether a classroom test is measuring what it is supposed to measure. For example, a test is said to have content validity if it closely parallels the material which has been taught and the thinking skills that have been important in the course. Whereas reliability is expressed as a quantitative measure (e.g., .87 reliability), content validity is obtained through a rational or logical analysis of the test. That is, one logically compares the test content with the course content and determines how well the former represents the latter.
A quantitative method of assessing test validity is to examine each test item. This is accomplished by reviewing the discrimination (IDis) of each item. If an item has a discrimination measure of 25 percent or higher, it is said to have validity, that is, it is doing what it is suppose to be doing – discriminating between those that are knowledgeable and those that are not knowledgeable.