Search and Filter

Leveraging Large Language Models to Assess Short Text Responses

Educational practitioners and researchers often score short, unstructured text for the presence or strength of domain-specific constructs. Manual scoring, however, faces limitations, including time- and labor-intensiveness. Large language models (LLMs) offer an automated alternative to manual scoring, yet questions remain regarding LLM implementation and performance when scoring text requires domain-specific knowledge. Drawing from two assessments of aspiring principals’ teacher-hiring capacities, this study demonstrates a four-stage workflow for implementing LLM-generated scoring of open-ended text while evaluating six LLMs across three prompting methods. Models with higher performance on language comprehension benchmarks and more detailed prompting methods reduced scoring variability and demonstrated strong alignment to trained human scorers. Further, we highlight key design considerations, including how many LLM scoring iterations are necessary, how many entries must be scored manually for precise estimates of consistency, and checks for algorithmic bias.

Keywords
AI, large language models, assessment, school leadership
Education level
Topics
Document Object Identifier (DOI)
10.26300/07w3-by46
EdWorkingPaper suggested citation:
Rubin, Jacob M., and Jason A. Grissom. (). Leveraging Large Language Models to Assess Short Text Responses. (EdWorkingPaper: -1385). Retrieved from Annenberg Institute at Brown University: https://doi.org/10.26300/07w3-by46

Machine-readable bibliographic record: RIS, BibTeX