KAI Evaluation Framework
A prototype for scaling rubric-based scoring of student reading comprehension as KAI expands from one RCT to three pilot contexts. Grounded in the Lemons/Balasubramanian research group's work on reading intervention for students with IDD.
Grounded in the science-of-reading framework
Stress-testing across response quality levels
Disagreement flagged for human review
Evaluator correctly scored a non-standard-grammar response high on comprehension
The Rubric
The rubric is grounded in the science-of-reading framework and the published work of the Lemons and Balasubramanian research group on multicomponent reading intervention for students with IDD. I leaned specifically on Heidlage et al. (2024), the parent-implemented intervention paper, because its framing of how children with IDD demonstrate literacy gains directly informs how the rubric should and should not score responses. The most important design decision is the Response Validity dimension: students with IDD often communicate in ways that look non-standard on the surface, and a naive evaluator would conflate communication style with comprehension. The rubric explicitly instructs the judge to separate the two and to honor the presumed competence principle that anchors the team's broader research.
Literal Comprehension
0-3Did the student accurately identify information directly stated in the passage? Measures whether the student can locate and reproduce factual content from the text.
Main Idea Identification
0-3Did the student capture the central point of the passage, not just surface details? This is the primary outcome measure in the KAI RCT.
Inferential Reasoning
0-3Did the student draw appropriate inferences beyond what is directly stated? Tests comprehension beyond sight-word recognition.
Response Validity
0-3Does the response demonstrate genuine comprehension, accounting for non-standard communication patterns?
Critical: non-standard grammar, spelling, or sentence structure does NOT lower this score. Only comprehension content matters.
Scoring Results Across Synthetic Examples
Each of the 6 examples below was scored using the evaluator. The chart shows how the four rubric dimensions performed across response quality levels, from strong comprehension to off-topic responses to the critical test case probing the presumed competence principle.
Rubric Dimension Scores by Test Case
Scores across all 4 rubric dimensions for each test case
The Critical Test Case
This is the most important example in the set. It tests whether the evaluator honors the presumed competence principle: the idea that students with IDD can demonstrate genuine comprehension even when their responses look non-standard on the surface.
Student Response
"plant make food from sun they need water too leaves help them grow"
This response demonstrates correct comprehension of photosynthesis. The student identifies that plants make food, that they use sunlight and water, and that leaves are involved. Every key concept from the passage is present. If the evaluator scores this low on comprehension dimensions, it has failed its most important test, because what it is actually measuring then is writing ability, not comprehension. Those are different things.
The evaluator correctly scored this response high on all comprehension dimensions despite non-standard grammar.
Dimension Scores
| Dimension | Score |
|---|---|
| Literal Comprehension | 3 |
| Main Idea Identification | 3 |
| Inferential Reasoning | 2 |
| Response Validity | 3 |
Literal Comprehension
The student accurately identifies multiple factual details directly stated in the passage: that plants make food, that sunlight is involved, that water is needed, and that leaves play a role. All stated facts are accurate.
Main Idea Identification
The student clearly captures the central point of the passage: plants make their own food through a process involving sunlight and water. The phrase 'plant make food from sun' directly addresses the main idea.
Inferential Reasoning
The student demonstrates inferential reasoning by connecting the process to growth ('leaves help them grow'), which requires understanding that food production serves a purpose beyond just making food.
Response Validity
The response demonstrates clear and genuine comprehension of the passage content. The non-standard grammar is completely irrelevant to the comprehension demonstrated. The content shows the student understood what they read.
Where This Goes Next
As KAI moves from one RCT to three concurrent pilots, evaluation methodology has to scale without losing rigor.
Multilingual Rubric Calibration
Handle Hindi, Odia, and code-switched responses without treating non-English as a deficit.
Log-Only Evaluation
Work purely from interaction traces with explicit uncertainty quantification.
Human-AI Scoring Agreement
Calibrate against existing rubric scorers from the RCT.
Cross-Pilot Analysis
Compare efficacy across US vs. India, different IDD profiles, different settings.
Multimodal Evaluation
Incorporate spoken responses for students more comfortable speaking than writing.