KAI Evaluation Framework

View Jupyter Notebook (ipynb)View Notebook (PDF)

A prototype for scaling rubric-based scoring of student reading comprehension as KAI expands from one RCT to three pilot contexts. Grounded in the Lemons/Balasubramanian research group's work on reading intervention for students with IDD.

0rubric dimensions

Grounded in the science-of-reading framework

0synthetic test cases

Stress-testing across response quality levels

0-passscoring

Disagreement flagged for human review

Critical test passed

Evaluator correctly scored a non-standard-grammar response high on comprehension

The Rubric

The rubric is grounded in the science-of-reading framework and the published work of the Lemons and Balasubramanian research group on multicomponent reading intervention for students with IDD. I leaned specifically on Heidlage et al. (2024), the parent-implemented intervention paper, because its framing of how children with IDD demonstrate literacy gains directly informs how the rubric should and should not score responses. The most important design decision is the Response Validity dimension: students with IDD often communicate in ways that look non-standard on the surface, and a naive evaluator would conflate communication style with comprehension. The rubric explicitly instructs the judge to separate the two and to honor the presumed competence principle that anchors the team's broader research.

Literal Comprehension

0-3

Did the student accurately identify information directly stated in the passage? Measures whether the student can locate and reproduce factual content from the text.

Main Idea Identification

0-3

Did the student capture the central point of the passage, not just surface details? This is the primary outcome measure in the KAI RCT.

Inferential Reasoning

0-3

Did the student draw appropriate inferences beyond what is directly stated? Tests comprehension beyond sight-word recognition.

Response Validity

0-3

Does the response demonstrate genuine comprehension, accounting for non-standard communication patterns?

Critical: non-standard grammar, spelling, or sentence structure does NOT lower this score. Only comprehension content matters.

Scoring Results Across Synthetic Examples

Each of the 6 examples below was scored using the evaluator. The chart shows how the four rubric dimensions performed across response quality levels, from strong comprehension to off-topic responses to the critical test case probing the presumed competence principle.

Rubric Dimension Scores by Test Case

Scores across all 4 rubric dimensions for each test case

Critical Test case: tests presumed competence principle

The Critical Test Case

This is the most important example in the set. It tests whether the evaluator honors the presumed competence principle: the idea that students with IDD can demonstrate genuine comprehension even when their responses look non-standard on the surface.

Critical Test Case: Presumed Competence

Student Response

"plant make food from sun they need water too leaves help them grow"

This response demonstrates correct comprehension of photosynthesis. The student identifies that plants make food, that they use sunlight and water, and that leaves are involved. Every key concept from the passage is present. If the evaluator scores this low on comprehension dimensions, it has failed its most important test, because what it is actually measuring then is writing ability, not comprehension. Those are different things.

PASSED

The evaluator correctly scored this response high on all comprehension dimensions despite non-standard grammar.

Dimension Scores

Dimension	Score
Literal Comprehension	3
Main Idea Identification	3
Inferential Reasoning	2
Response Validity	3

Literal Comprehension

The student accurately identifies multiple factual details directly stated in the passage: that plants make food, that sunlight is involved, that water is needed, and that leaves play a role. All stated facts are accurate.

Main Idea Identification

The student clearly captures the central point of the passage: plants make their own food through a process involving sunlight and water. The phrase 'plant make food from sun' directly addresses the main idea.

Inferential Reasoning

The student demonstrates inferential reasoning by connecting the process to growth ('leaves help them grow'), which requires understanding that food production serves a purpose beyond just making food.

Response Validity

The response demonstrates clear and genuine comprehension of the passage content. The non-standard grammar is completely irrelevant to the comprehension demonstrated. The content shows the student understood what they read.

Where This Goes Next

As KAI moves from one RCT to three concurrent pilots, evaluation methodology has to scale without losing rigor.

Multilingual Rubric Calibration

Handle Hindi, Odia, and code-switched responses without treating non-English as a deficit.

Log-Only Evaluation

Work purely from interaction traces with explicit uncertainty quantification.

Human-AI Scoring Agreement

Calibrate against existing rubric scorers from the RCT.

Cross-Pilot Analysis

Compare efficacy across US vs. India, different IDD profiles, different settings.

Multimodal Evaluation

Incorporate spoken responses for students more comfortable speaking than writing.