Browse by Topics
- Covid-19 Education Research for Recovery
- Early childhood
- K-12 Education
- Post-secondary education
- Access and admissions
- Education outside of school (after school, summer…)
- Educator labor markets
- Educator preparation, professional development, performance and evaluation
- Finance
- Inequality
- Markets (vouchers, choice, for-profits, vendors)
- Methodology, measurement and data
- Multiple outcomes of education
- Parents and communities
- Politics, governance, philanthropy, and organizations
- Program and policy effects
- Race, ethnicity and culture
- Standards, accountability, assessment, and curriculum
- Students with Learning Differences
Breadcrumb
Search EdWorkingPapers
Julie Cohen
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education
Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that focuses on low-inference instructional practices, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that has been demonstrated to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
Different methods for assessing pre-service teachers’ instruction: Why measures matter
Teacher preparation programs are increasingly expected to use data on pre-service teacher (PST) skills to drive program improvement and provide targeted supports. Observational ratings are especially vital, but also prone to measurement issues. Scores may be influenced by factors unrelated to PSTs’ instructional skills, including rater standards and mentor teachers’ skills. Yet we know little about how these measurement challenges play out in the PST context. Here we investigate the reliability and sensitivity of two observational measures. We find measures collected during student teaching are especially prone to measurement issues; only 3-4% of variation in scores reflects consistent differences between PSTs, while 9-17% of variation can be attributed to the mentors with whom they work. When high scores stem not from strong instructional skills, but instead from external circumstances, we cannot use them to make consequential decisions about PSTs’ individual needs or readiness for independent teaching.
Experimental Evidence on the Robustness of Coaching Supports in Teacher Education
Many novice teachers learn to teach “on-the-job,” leading to burnout and attrition among teachers and negative outcomes for students in the long term. Pre-service teacher education is tasked with optimizing teacher readiness, but there is a lack of causal evidence regarding effective ways for preparing new teachers. In this paper, we use a mixed reality simulation platform to evaluate the causal effects and robustness of an individualized, directive coaching model for candidates enrolled in a university-based teacher education program, as well as for undergraduates considering teaching as a profession. Across five conceptual replication studies, we find that targeted, directive coaching significantly improves candidates’ instructional performance during simulated classroom sessions, and that coaching effects are robust across different teaching tasks, study timing, and modes of delivery. However, coaching effects are smaller for a sub-population of participants not formally enrolled in a teacher preparation program. These participants differed from teacher candidates in multiple ways, including by demographic characteristics, as well as by their prior experiences learning about instructional methods. We highlight implications for research and practice.
Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions
In conversation, uptake happens when a speaker builds on the contribution of their interlocutor by, for example, acknowledging, repeating or reformulating what they have said. In education, teachers' uptake of student contributions has been linked to higher student achievement. Yet measuring and improving teachers' uptake at scale is challenging, as existing methods require expensive annotation by experts. We propose a framework for computationally measuring uptake, by (1) releasing a dataset of student-teacher exchanges extracted from US math classroom transcripts annotated for uptake by experts; (2) formalizing uptake as pointwise Jensen-Shannon Divergence (pJSD), estimated via next utterance classification; (3) conducting a linguistically-motivated comparison of different unsupervised measures and (4) correlating these measures with educational outcomes. We find that although repetition captures a significant part of uptake, pJSD outperforms repetition-based baselines, as it is capable of identifying a wider range of uptake phenomena like question answering and reformulation. We apply our uptake measure to three different educational datasets with outcome indicators. Unlike baseline measures, pJSD correlates significantly with instruction quality in all three, providing evidence for its generalizability and for its potential to serve as an automated professional development tool for teachers.
Measuring Teaching Practices at Scale: A Novel Application of Text-as-Data Methods
Valid and reliable measurements of teaching quality facilitate school-level decision-making and policies pertaining to teachers. Using nearly 1,000 word-to-word transcriptions of 4th- and 5th-grade English language arts classes, we apply novel text-as-data methods to develop automated measures of teaching to complement classroom observations traditionally done by human raters. This approach is free of rater bias and enables the detection of three instructional factors that are well aligned with commonly used observation protocols: classroom management, interactive instruction, and teacher-centered instruction. The teacher-centered instruction factor is a consistent negative predictor of value-added scores, even after controlling for teachers’ average classroom observation scores. The interactive instruction factor predicts positive value-added scores. Our results suggest that the text-as-data approach has the potential to enhance existing classroom observation systems through collecting far more data on teaching with a lower cost, higher speed, and the detection of multifaceted classroom practices.
Measuring Teaching Practices at Scale: A Novel Application of Text-as-Data Methods
Valid and reliable measurements of teaching quality facilitate school-level decision-making and policies pertaining to teachers, but conventional classroom observations are costly, prone to rater bias, and hard to implement at scale. Using nearly 1,000 word-to-word transcriptions of 4th- and 5th-grade English language arts classes, we apply novel text-as-data methods to develop automated, objective measures of teaching to complement classroom observations. This approach is free of rater bias and enables the detection of three instructional factors that are well aligned with commonly used observation protocols: classroom management, interactive instruction, and teacher-centered instruction. The teacher-centered instruction factor is a consistent negative predictor of value-added scores, even after controlling for teachers’ average classroom observation scores. The interactive instruction factor predicts positive value-added scores.
Policy Implementation, Principal Agency, and Strategic Action: Improving Teaching Effectiveness in New York City Middle Schools
Ten years ago, the reform of teacher evaluation was touted as a mechanism to improve teacher effectiveness. In response, virtually every state redesigned its teacher evaluation system. Recently, a growing narrative suggests these reforms failed and should be abandoned. This response may be overly simplistic. We explore the variability of New York City principals’ implementation of policies intended to promote teaching effectiveness. Drawing on survey, interview, and administrative data, we analyze whether principals believe they can use teacher evaluation and tenure policies to improve teaching effectiveness, and how such perceptions influence policy implementation. We find that principals with greater perceived agency are more likely to strategically employ tenure and evaluation policies. Results have important implications for principal training and policy implementation.