At a higher level, the use of Iteman output has two steps: first, to identify which items perform poorly, and secondly to diagnose the problems present in those items. This drives the ultimate goal of improving reliability and validity. The following are some definitions of, and considerations for, item statistics.
Item Difficulty
The P value (Multiple Choice)
The P value is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.
The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly. For a test where we expect examinees to do poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.
The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees. In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.
The Item Mean (Polytomous)
The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.
The minimum item mean bound represents what you consider the cut point for the item mean being too low.
The maximum item mean bound represents what you consider the cut point for the item mean being too high. The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.
Item Discrimination
Multiple Choice Items
The item point-biserial (r-pbis) correlation. The Pearson point-biserial correlation (r-pbis) is a measure of the discrimination, or differentiating strength, of the item. It ranges from -1.0 to 1.0. A good item is able to differentiate between examinees of high and low ability, and will have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item, because then the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”
The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.
The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.
The item biserial (r-bis) correlation. The biserial correlation is also a measure of the discrimination, or differentiating strength, of the item. It ranges from -1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial. The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).
Polytomous Items
Pearson’s r correlation. The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from -1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item.
The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.
The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.
Eta coefficient. The eta coefficient is computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score. As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.
DIF Statistics
Differential item functioning (DIF) occurs when the performance of an item differs across groups of examinees. These groups are typically called the reference (usually majority) and focal (usually minority) groups. The goal of this analysis is to flag items that are potentially biased against one group.
There are a number of ways to evaluate DIF. The current version of Iteman utilizes the Mantel-Haenszel statistic, where each group is split into several ability levels, and the probability of a correct response compared between the focal and reference groups for each level. Results of this analysis are added into both the CSV and RTF output files.
Mantel-Haenszel
The Mantel-Haenszel (M-H) coefficient is reported for each item as an odds ratio. The coefficient is a weighted average of the odds ratios for each θ level. If the odds ratio is less than 1.0, then the item is more likely to be correctly endorsed by the reference group than the focal group. Likewise, odds ratios greater than 1.0 indicate that the focal group was more likely to correctly endorse the item than the focal group. The RTF file contains the overall M-H coefficient for an item; the CSV output file also includes the odds ratios for each θ level. These ratios can be used to determine if the DIF present was constant for all abilities (uniform DIF) or varied conditional on θ (crossing DIF). The M-H coefficient is not sensitive to crossing DIF, so null results should be checked to confirm that there wasn’t crossing DIF present.
z-test Statistic
The negative of the natural logarithm of the M-H odds ratio was divided by its standard error to obtain the z-test statistic used to test the significance of the M-H against a null of zero DIF (odds ratio of 1.0). This test statistic is provided in the CSV output file.
p
The two tailed p value associated with the z test for DIF. Items with p values less than .05 will be flagged as having significant DIF.
Bias Against
The group the item is biased against when the p value is less than .05. In the context of the M-H test for DIF, the group that the item is biased against has a lower probability of a correct response than the other group, controlling for ability level.
M-H Formulas
The Mantel-Haenszel odds ratio for score group k is defined as
where
C and I denote correct and incorrect responses to the item, respectively,
R is the reference group,
F is the focal group.
The Mantel-Haenszel DIF coefficient is a weighted average of the score group odds ratios and is defined as
where N is the number of examinees in score group k.
Option statistics
Each option has a P value and an r-pbis. The values for the keyed response serve as the statistics for the item as a whole, but it is the values for the incorrect options (the distractors) that provide the opportunity to diagnose issues with the item. A high P for a distractor means that many examinees are choosing that distractor; a high positive r-pbis means that many high-ability examinees are choosing that distractor. Such a situation identifies a distractor that is too attractive, and could possibly be argued as correct.
Conditional Standard Error of Measurement
Classical test theory attempted to address the concept of CSEM, as was later done by item response theory (IRT). However, IRT's approach is much more conceptually appealing, as it quantifies the error of the observed score around the true score on the latent scale. The CSEM from classical test theory cannot address the latent scale because the latent scale does not exist in classical test theory. Since the True Score is test-dependent, that's all it can do. There are two approaches suggested by Frederic Lord, III and IV.
where : x = number-correct score
n = number of items
where :
and
= variance of the proportion correct
= mean of the number correct scores
= variance of the number correct scores
Comments
0 comments
Article is closed for comments.