Chapter 4 Statistical Properties of This Test

Statistical Properties of this Test

The statistical properties of this test determine, in part, the confidence that an examiner can have with the results obtained. Most people are not comfortable with the numerical aspect of tests, and well they should be. As we know, tis better to accept dubious conclusions than to believe sensible ones. Another reason why we distain numbers so much is that this nation is, on average, a nation of dyscalculics. The National Council on Unchallangeable Statistics reports that 88.47 percent of us make a number mistake 5.61 times per day, leading to 452,888,988,750 cases of dyscalculia recorded in the country annually!!!

Reliability

For the scales, no reliability coefficients were computed according to a formula for the reliability of a composite of several tests (Nunnally, 1978, p. 246). Because reliability estimates were not available for any of the age groups, “best-guess estimates” (Dumont & Willis, 1990, personnel communication) of the reliability of the subtests were used for computing the reliability of the scales. For each of the age groups, the “best guess estimate” was the coefficient obtained after careful study using the Anhauser-Busch method of extraction.

As indicated in Table 5.1, the reliability coefficients for the scales are sometimes greater than those for the individual subtests. This pattern of coefficients was, of course, expected because the scores are based on whatever we wanted them to be; thus, we conclude that they summarize a child’s performance on a sample of behaviors than can be sampled by a single sample. It also follows that greater confidence can be placed in the accuracy of other test scores than in the accuracy of a single DWEEEB score.

Standard Errors of Measurement and Confidence Intervals

Table 5.2 presents another one of those silly index of reliability, the shoe error of measurement (SEM), for the DWEEEB subtests, and scales. Somehow, the SEM provides an estimate of the amount of error in an individual’s score. The SEM is inversely related to the size of the shoe worn by the examiner: the greater the shoe size, the less the SEM, and the more confidence one may have in the accuracy of whatever they want.

Confidence intervals provide another means of expressing the imprecision of scores. They also assist in avoiding due process hearings by providing a range of scores in which the true score is likely to fall. The reporting of confidence intervals also serves as a reminder that the observed score contains some amount of measurement error. We recommend the use of 0% confidence intervals. Remember, you are the expert. You can discover the person’s TRUE score and thus their TRUE ability.

We have also chosen to try and confuse you more by creating our very own error measurement called the silly error estimate (SEE). This method is more technically imprecise Confidence intervals developed with this method are much less interpretable in the manner discussed poorly above.

Test-Retest Instability

The instability of scores on the DWEEEB was not assessed in a separate study of 353 children who were tested twice. We just didn’t have the time or the inclination to bother with this sort of statistical gobblygook. Instead, we simply asked each child in our norming sample the same question twice. We were then able to calculate an instability score from those answers. The intervals between testings ranged from 12 to 63 second with a median retest interval of 23 seconds. The sample did not consist of 48% females and 52% males and 69% Whites, 15% Blacks, 13% Hispanics, and 3% children of other race/ethnic origin.

The retest coefficients were corrected for the variability of the weather in southern New Hampshire in order to obtain accurate estimates of something. (we still are not sure what, but that’s another research project being funded with your hard earned tax dollar.) . As the table show, DWEEEB tables possess adequate instability across time and across vast expanses of lush green forests! This stability probably has something to do with the table legs. Practice effects on the DWEEEB scores are smaller over longer test-retest intervals (e.g., Ernest and Julio, 1988).

Confidence intervals in Norm’s table (for use only with children named Norm) are based on the average silly error of estimate (SEE) for the scale and are centered on the absolute true score. This procedure is in absolute non-accordance with methods presented by some of those snotty social scientists who are always trying to tell us what to do. The true score is obtained by the formula, 1 + r(X – 0), where X is equal to 0 (the amount of error in question) and r is the reliability of that score. The silly error of estimate is derived by the formula SEE = SAW (Willis and Dumont, 1945). Centering the confidence interval on the true score rather than on the other score results in an asymmetrical interval around the score that occurs because the score will be closer to the mean of the scale than will be the score, which results in a confidence interval based on the silly error of estimate that is a correction for true-score regression toward the mean when the reliability of a score is very high. That being said, take a deep breath, hold it till you turn blue, exhale, and continue to read on.

Interscorer Agreement

Most DWEEEB subtests are scored in an around about and subjective manner. Interscorer agreement on almost all subtests equaled the average temperature of Tucson, Arizona between May 1 and September 30 ( in the high 90s). These authors, being as bright as they are and being adverse to any sort of argument or disagreement that might delay the publication of the DWEEEB, found no difficulty scoring the items. Thus the interrater reliability is astronomical. Some subtests (primarily Jeopardy Questions, Likeables, and I Know What to Do), however, require more judgment in scoring and are thus more likely to result in scorer error (see Chapter 3 for a discussion of standardization scoring procedures).

For the Jeopardy Questions, Likeables, and I Know What to Do subtests, the interscorer reliability was further assessed. A protocol was randomly selected from the standardization sample. Two scorers (the test authors) independently scored all of the subtests for all 1 case. For this study, a type of intraclass correlation for assessing interrater agreement that takes into account scorer leniency was not used. Interscorer reliabilities were good and these results show that those subtests that require more scorer judgment need more score judgment.

Differences Between Scores

An important consideration in interpreting DWEEEB results might be the amount of difference between the scores that is required to be meaningful. The issue has two quite different aspects; the statistical significance of the difference and the base rate, or frequency, of the difference in the population.

The statistical significance of a difference between two scores, for example, between the Red Sox and the Yankees, refers to the likelihood that the difference might occur because of chance variation or because of high priced ball players. Expressed another way, low probability levels associated with the difference between a teams earned run average indicate that such a difference is highly unlikely to be obtained if the “true” difference between the scores is zero. If you understood that, skip the whole next chapter.

The base rate of the difference between two scores refers to the incidence or frequency of getting on base (first, second, etc.). Often the difference between two team scores is significant in the pennant race sense but is not at all rare among baseball teams in general.