Euigyum Kim | Publications

Peer-Reviewed Journals

All journal articles have been peer-reviewed and are listed in reverse chronological order by date. Publications marked with KCI are indexed in the Korea Citation Index, managed by the National Research Foundation of Korea. KCI-Excellent denotes a journal recognized by KCI as Excellent Accredited (top 10% in Social Sciences field).

Leveraging Large Language Model for Automatic Translation of Educational Content: Exploring the Effectiveness of Curriculum-Aware Prompt Engineering (written in Korean) AIED EMP

Euigyum Kim, Hyo Jeong Shin

Korean Journal of Educational Research, 2025 KCI-Excellent

abstract paper

Despite the globalization of educational content, language remains a significant barrier. When translating educational content, multilingual translation has become crucial to meet this challenge, with an emphasis on incorporating the cultural context of the target country and the educational context of the learners. However, existing machine translation systems often fail to adequately account for these contextual factors. This study explores the potential of the Large Language Model(LLM) to improve the translation of assessment items through In-context Learning. Two prompt engineering strategies are compared: the ‘assessment-aware prompt’, which includes only the specifications of the assessment, and the ‘curriculum-aware prompt’, which includes the educational and cultural context of the target country in addition to the assessment specifications. From the comparison of linguistic features and the expert reviews, we found that the curriculum-aware translation produced more valid and feasible results, highlighting the effectiveness of LLM-based automatic translation methods that integrate curriculum context.

Conference & Workshop Proceedings

STAIR-AIG: Optimizing the Automated Item Generation Process through Human-AI Collaboration for Critical Thinking Assessment AIED EMP

Euigyum Kim, Seewoo Li, Salah Khalil, Hyo Jeong Shin

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), 2025

abstract paper

The advent of artificial intelligence (AI) has marked a transformative era in educational mea- surement and evaluation, particularly in the development of assessment items. Large language models (LLMs) have emerged as promising tools for scalable automatic item generation (AIG), yet concerns remain about the valid- ity of AI-generated items in various domains. To address this issue, we propose STAIR-AIG (Systematic Tool for Assessment Item Review in Automatic Item Generation), a human-in-the- loop framework that integrates expert judgment to optimize the quality of AIG items. To explore the functionality of the tool, AIG items were generated in the domain of critical think- ing. Subsequently, the human expert and four OpenAI LLMs conducted a review of the AIG items. The results show that while the LLMs demonstrated high consistency in their rating of the AIG items, they exhibited a tendency towards leniency. In contrast, the human expert provided more variable and strict evaluations, identifying issues such as the irrelevance of the construct and cultural insensitivity. These findings highlight the viability of STAIR-AIG as a structured human-AI collaboration approach that facilitates rigorous item review, thus opti- mizing the quality of AIG items. Furthermore, STAIR-AIG enables iterative review processes and accumulates human feedback, facilitating the refinement of models and prompts. This, in turn, would establish a more reliable and com- prehensive pipeline to improve AIG practices.

Comparing Human and LLM Evaluations on AI-Generated Critical Thinking Items: Implications for Reliable Applications of Automatic Item Generation AIED EMP

Euigyum Kim, Salah Khalil, Hyo Jeong Shin

Proceedings of the 2nd Workshop on Automated Evaluation of Learning and Assessment Content (EvalLAC), 2025

abstract paper

As a core 21st-century skill, critical thinking (CT) has garnered increasing attention in today’s information society. Although growing interest has led to the development of various CT assessments and frameworks, research on leveraging large language models (LLMs) for the automatic generation and validation of CT items remains limited. To address this gap, this study examines AI-generated CT items developed based on MACAT’s PACIER framework. We employed a human-in-the-loop evaluation approach, in which a human expert and four LLMs independently rated each item on a three-point quality scale and conducted qualitative reviews to identify item-level issues. The results demonstrate marked differences between the human and LLM evaluations. The human reviewer delivered more discerning and variable evaluations, whereas the LLMs exhibited greater uniformity and consistency, but tended to be permissive and generous in their judgments. Notably, the human expert identified subtle flaws that the LLMs failed to detect, such as imprecise terminology, overly suggestive answer choices, and culturally biased content, all of which pose threats to the validity of the assessment. These insights affirm the essential role of human engagement in validating and optimizing the automatic item generation (AIG) process for complex latent constructs such as CT.

Leveraging Large Language Model-based Translation of Educational Content: A Study of Prompt Engineering Strategies for the Effective Application of Machine Translation AIED

Euigyum Kim, Hyo Jeong Shin, Suyoung Lim, Minah Kim, Seungyeon Jeong, Alina A. von Davier

Proceedings of the 25th International Conference on Education Research (ICER), 2025

abstract paper

This study investigates the feasibility of Large Language Model (LLM)-based machine translation for educational content, using the example of critical thinking assessment items through an in-context learning prompting strategy. While global access to education has expanded, language barriers remain a challenge, requiring effective translation solutions. Traditional machine translation often fails to capture pedagogical and cultural nuances, while human translation struggles with efficiency and scalability. LLM offers a promising alternative by generating high-quality translations with improved contextual understanding. However, to fully realize their potential, effective prompt design is critical. This study explores prompt engineering strategies tailored for educational purposes by developing and comparing two strategies: an assessment-aware prompt, which integrates assessment specifications, and a curriculum-aware prompt, which incorporates educational and cultural contexts from national curriculum documents in addition to assessment information. Translations were analyzed both qualitatively and quantitatively based on pedagogical validity and linguistic features. Results indicate that curriculum-aware prompting significantly improves translation by increasing syntactic complexity, aligning with cognitive development, and reducing translation artifacts. Expert evaluations also show a strong preference for curriculum-aware translations due to their superior syntactic structure and use of culturally appropriate terminology. These findings support the effectiveness of in-context learning approaches that integrate national curriculum data in optimizing LLM-driven translation.

A Study on Test Equating for an Algorithmic Problem-Solving Assessment Using Item Response Theory (written in Korean) EMP

Euigyum Kim, Hyo Jeong Shin

Proceedings of the 2025 Annual Conference of Korean Educational Research Association, 2025

paper

Technical Reports

Operational Automatic Scoring of Text Responses in 2016 ePIRLS: Performance and Linguistic Variance AIED EMP

Hyo Jeong Shin, Nico Andersen, Andrea Horbach, Euigyum Kim, Jisoo Baik, Fabian Zehner

International Association for the Evaluation of Educational Achievement (IEA), 2024

abstract paper

In this project report, we report on the feasibility of automatic scoring systems for text responses from the 2016 ePIRLS. We show that the multilingual automatic scoring approach used in this study can be applied to different languages and countries, despite their linguistic variance. To measure linguistic variance, we used a variant of the conventional type-token-ratio, which we refer to as STTR. We utilized two systems for automatic scoring: fuzzy lexical matching (FLM) and supervised classifiers based on semantics. FLM prioritizes accuracy but requires significant manual scoring work by human raters. The supervised classifiers were trained using a pre-trained deep neural network (XLM-R) for multilingual texts and support vector machines. Results showed that automatic scoring models can score accurately (κ = .755 on average using XLM-R) and efficiently (26.1% reduction of manual scoring on average) across languages and countries, in the presence of linguistic variance. However, performance varied widely across items, highlighting the importance of investigating the determinants of automatic scoring performance. It was found that higher levels of linguistic variance were associated with lower automatic scoring performance. In addition, linguistic variance and automatic scoring model performance were significantly related to several item- and student-level characteristics. The paper concludes with a discussion of the implications of operationalizing automatic scoring.

Peer-Reviewed Journals

Conference & Workshop Proceedings

Technical Reports

Book Chapters