At a higher level, the use of Iteman output has two steps: first, to identify which items perform poorly, and secondly, to diagnose the problems present in those items. The following are some definitions of, and considerations for, item statistics.
The P value (Multiple Choice)
The P value is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0; a high value means that the item is easy, and a low value means that the item is difficult.
The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly. For a test where we expect examinees to do poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20, for example.
The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees. In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.
The Item Mean (Polytomous)
The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.
The minimum item mean bound represents what you consider the cut point for the item mean being too low.
The maximum item mean bound represents what you consider the cut point for the item mean being too high. The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.
Multiple Choice Items
The item point-biserial (r-pbis) correlation. The Pearson point-biserial correlation (r-pbis) is a measure of the discrimination, or differentiating strength, of the item. It ranges from 0.0 to 1.0. A good item is able to differentiate between examinees of high and low ability, and will have a higher point-biserial but rarely above 0.50. A negative point-biserial is indicative of a very poor item because the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”
The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.
The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.
The item biserial (r-bis) correlation. The biserial correlation is also a measure of the discrimination, or differentiating strength, of the item. It ranges from 1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Because the biserial is an estimate of Pearson’s r, it will be larger in absolute magnitude than the corresponding point-biserial. The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).
Pearson’s r correlation. The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from 1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item.
The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Because the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.
The maximum item-total correlation bound is almost always 1.0 because it is typically desired that the r-pbis be as high as possible.
Differential item functioning (DIF) occurs when the performance of an item differs across groups of examinees. These groups are typically called the reference (usually majority) and focal (usually minority) groups. The goal of this analysis is to flag items that are potentially biased against one group.
There are a number of ways to evaluate DIF. The current version of Xcalibre utilizes the Mantel-Haenszel statistic, where each group is split into several ability levels with the probability of a correct response compared between the focal and reference groups for each level. See Appendix C for the equations. Results of this analysis are added into both the CSV and RTF output files.
The Mantel-Haenszel (M-H) coefficient is reported for each item as an odds ratio. The coefficient is a weighted average of the odds ratios for each θ level. If the odds ratio is less than 1.0, then the item is more likely to be correctly endorsed by the reference group than the focal group. Likewise, odds ratios greater than 1.0 indicate that the focal group was more likely to correctly endorse the item than the focal group. The RTF file contains the overall M-H coefficient for an item; the CSV output file also includes the odds ratios for each θ level. These ratios can be used to determine if the DIF present was constant for all abilities (uniform DIF) or varied conditional on θ (crossing DIF). The M-H coefficient is not sensitive to crossing DIF, so null results should be checked to confirm that there wasn’t crossing DIF present.
The negative of the natural logarithm of the M-H odds ratio was divided by its standard error to obtain the z-test statistic used to test the significance of the M-H against a null of zero DIF (odds ratio of 1.0). This test statistic is provided in the CSV output file.
The two tailed p value is associated with the z test for DIF. Items with p values less than .05 will be flagged as having significant DIF.
The group the item is biased against when the p value is less than .05. In the context of the M-H test for DIF, the group that the item is biased against has a lower probability of a correct response than the other group, controlling for ability level.
Each option has a p value and an r-pbis. The values for the keyed response serve as the statistics for the item as a whole, but it is the values for the incorrect options (the distractors) that provide the opportunity to diagnose issues with the item. A high p for a distractor means that many examinees are choosing that distractor; a high positive r-pbis means that many high-ability examinees are choosing that distractor. Such a situation identifies a distractor that is too attractive and could possibly be argued as correct.