Leveraging Large Language Models to Assess Short Text Responses

January 2026

Educational practitioners and researchers often score short, unstructured text for the presence or strength of domain-specific constructs. Manual scoring, however, faces limitations, including time- and labor-intensiveness. Large language models (LLMs) offer an automated alternative to manual scoring, yet questions remain regarding LLM implementation and performance when scoring text requires domain-specific knowledge. Drawing from two assessments of aspiring principals’ teacher-hiring capacities, this study demonstrates a four-stage workflow for implementing LLM-generated scoring of open-ended text while evaluating six LLMs across three prompting methods. Models with higher performance on language comprehension benchmarks and more detailed prompting methods reduced scoring variability and demonstrated strong alignment to trained human scorers. Further, we highlight key design considerations, including how many LLM scoring iterations are necessary, how many entries must be scored manually for precise estimates of consistency, and checks for algorithmic bias.

Keywords

AI, large language models, assessment, school leadership

Education level

K-12 Education

Topics

Methods