Iteman provides three default output files:
- A Microsoft Word DOCX report;
- A CSV file of item statistics;
- A CSV file of examinee scores.
The CSV file of test and item statistics includes the same statistics as are in the DOCX report but in CSV form so you can manipulate the data in a spreadsheet or easily upload it into item banking software (i.e., Assess.ai). Because they are the same, we will only discuss the Word output file here.
Introduction And Specifications Sections
The primary output, the DOCX report, is presented as a formal report that is designed for two purposes. First, it can be provided to test developers or subject matter experts (SMEs) as part of the test development cycle. Second, it serves as validity documentation that can be submitted for accreditation or similar endeavors.
The report begins with a title page followed by summary information of the input specifications (e.g., flag parameters). This is important for historical purposes; if the report is read a few years from now, it will be evident how Iteman was set up to produce the report.
Test-Level Output: Summary Statistics
Next, the report provides test-level summary statistics based on raw number-correct scores. This is done for the total score (scored items only) and all domains or content areas. There are three tables of test-level statistics:
This table provides descriptive statistics, such as score mean, score SD, minimum, maximum, and mean P & Rpbis. In Table 2: Summary Statistics, we have a 50 item test, spread across 4 domains. The average total score was 32.06, ranging from 8 to 50. The average Rpbis was a decent 0.25. Descriptions of these columns are below.
Label |
Explanation |
Score |
Portion of the test that the row is describing. |
Items |
Number of items in that portion of the test. |
Mean |
Average number correct test items. |
SD |
Standard deviation, a measure of dispersion (a range of ± two SDs from the mean includes approximately 95% of the examinees, if their number-correct scores are normally distributed). |
Min score |
Minimum number of items an examinee answered correctly. |
Max score |
Maximum number of items an examinee answered correctly. |
Mean P |
Average item difficulty statistic for that portion; also the average proportion-correct score if there are no omitted responses (not reported if there are no multiple choice items). |
Item Mean |
Average of the item means for polytomous items (not reported if there are no polytomous items). |
Mean R |
Average item-total correlation for that portion of the test. |
This table provides the average score for examinee groups, overall, and for each domain. Here, examinees are divided into male and female, with the males scoring slightly higher overall.
This table provides the coefficient alpha estimate of reliability, classical standard error of measurement based on alpha. This example has an alpha of 0.81, indicating moderate reliability, though arguably good for only 50 items.
Three forms of split-half reliability are computed. First the test is randomly divided into two halves and the Pearson product-moment correlation is computed between the total score for the two halves. Also provided is the split-half correlation between the total scores for the first half and the second half of the test, and the odd- and even-numbered items on the test. Because these correlations are computed using half the total number of items, the Spearman-Brown corrected correlations are also provided.
Table 5 shows a inter-correlation matrix of the domain scores, providing a simple insight into the factor structure of the assessment. If the Alpha is low and one domain does not correlate with others, it might be loading on an unrelated factor.
Table 6 lists a frequency distribution of scores as well as cumulative frequency.
Item Level Statistics
The arguably most useful piece of Iteman output is the quantile plot. This graph describes the performance of the item. It begins by dividing the sample up into 3-7 groups based on their overall score, then calculating the proportion of each group that selected each response. These are then plotted on a graph with a line connecting all the points for a given response.
This example displays the pattern typically found for a high quality item; the line for the correct answer has a strong positive slope and is above 0.5, while the incorrect answers are below 0.5 and/or negative slope. The stronger the discrimination of the item (Rpbis), the more positive the slope will be.
This graph is telling us that for examinees in the lowest 20% of ability, it is almost equal probability that they will select the key (C) or the distractor A. A good number, about 24%, also select the distractor B. Hardly anyone selects D. Then, as we move to the right, which indicates higher ability, the proportion selecting the key continues to increase. The highest group has 79% selecting the correct answer, and the remainder selecting A or B.
You can see how this plot does a great job of visually depicting how all the answers are performing, relating to examinee ability. It is often just as clear in the case of bad items, in which case we can dissect the graph to determine likely causes, such as a distractor being too attractive or there being 2 answers that are arguably correct.
Several tables of statistics are provided for each item. The first two provide summary statistics for the item as a whole, the third provides option statistics, while the fourth states the quantile plot in numeric form.
For a deeper review about interpreting classical item statistics, see this blog post.
Definitions for the fields in the first two tables are provided below for dichotomous and polytomous items.
Multiple-Choice Items
Label |
Explanation |
N |
Number of examinees that responded to the item. |
P |
Proportion correct. |
Domain Rpbis* |
Point-biserial correlation of keyed response with domain score. |
Domain Rbis* |
Biserial correlation of keyed response with domain score. |
Total Rpbis |
Point-biserial correlation of keyed response with total score. |
Total Rbis |
Biserial correlation of keyed response with total score. |
Alpha w/o |
Coefficient alpha of the test if the item was removed. |
Flags |
Any flags, given the bounds provided; LP = Low P, HP = High P, LR = Low r_{pbis} , HR = High r_{pbis} , K = Key error (r_{pbis} for a distractor is higher than r_{pbis} for key), DIF for any item with a significant DIF test result. |
Polytomous Items
Label |
Explanation |
N |
Number of examinees that responded to the item. |
Mean |
Average score for the item. |
Domain r* |
Correlation of item (Pearson’s r) with domain score. |
Domain Eta*^{+} |
Coefficient eta from an ANOVA using item and domain scores. |
Total r |
Correlation of item (Pearson’s r ) with total score. |
Total Eta^{+} |
Coefficient eta from an ANOVA using item and total scores. |
Alpha w/o |
Coefficient alpha of the test if the item was removed. |
Flags |
Any flags, given the bounds provided; same as dichotomous except that mean score instead of P. |
Comments
0 comments
Article is closed for comments.