This section showcases my scholarly output, including peer-reviewed journal articles, conference and workshop proceedings, technical reports, and book chapters. Publications are listed in reverse chronological order and can be filtered by research topics for easier exploration. For the most up-to-date list and citation data, please check out my Google Scholar page.
All journal articles have been peer-reviewed and are listed in reverse chronological order by date. Publications marked with KCI are indexed in the Korea Citation Index, managed by the National Research Foundation of Korea. KCI-Excellent denotes a journal recognized by KCI as Excellent Accredited (top 10% in Social Sciences field).
Despite the globalization of educational content, language remains a significant barrier. When translating educational content, multilingual translation has become crucial to meet this challenge, with an emphasis on incorporating the cultural context of the target country and the educational context of the learners. However, existing machine translation systems often fail to adequately account for these contextual factors. This study explores the potential of the Large Language Model(LLM) to improve the translation of assessment items through In-context Learning. Two prompt engineering strategies are compared: the ‘assessment-aware prompt’, which includes only the specifications of the assessment, and the ‘curriculum-aware prompt’, which includes the educational and cultural context of the target country in addition to the assessment specifications. From the comparison of linguistic features and the expert reviews, we found that the curriculum-aware translation produced more valid and feasible results, highlighting the effectiveness of LLM-based automatic translation methods that integrate curriculum context.
The advent of artificial intelligence (AI) has marked a transformative era in educational mea- surement and evaluation, particularly in the development of assessment items. Large language models (LLMs) have emerged as promising tools for scalable automatic item generation (AIG), yet concerns remain about the valid- ity of AI-generated items in various domains. To address this issue, we propose STAIR-AIG (Systematic Tool for Assessment Item Review in Automatic Item Generation), a human-in-the- loop framework that integrates expert judgment to optimize the quality of AIG items. To explore the functionality of the tool, AIG items were generated in the domain of critical think- ing. Subsequently, the human expert and four OpenAI LLMs conducted a review of the AIG items. The results show that while the LLMs demonstrated high consistency in their rating of the AIG items, they exhibited a tendency towards leniency. In contrast, the human expert provided more variable and strict evaluations, identifying issues such as the irrelevance of the construct and cultural insensitivity. These findings highlight the viability of STAIR-AIG as a structured human-AI collaboration approach that facilitates rigorous item review, thus opti- mizing the quality of AIG items. Furthermore, STAIR-AIG enables iterative review processes and accumulates human feedback, facilitating the refinement of models and prompts. This, in turn, would establish a more reliable and com- prehensive pipeline to improve AIG practices.
As a core 21st-century skill, critical thinking (CT) has garnered increasing attention in today’s information society. Although growing interest has led to the development of various CT assessments and frameworks, research on leveraging large language models (LLMs) for the automatic generation and validation of CT items remains limited. To address this gap, this study examines AI-generated CT items developed based on MACAT’s PACIER framework. We employed a human-in-the-loop evaluation approach, in which a human expert and four LLMs independently rated each item on a three-point quality scale and conducted qualitative reviews to identify item-level issues. The results demonstrate marked differences between the human and LLM evaluations. The human reviewer delivered more discerning and variable evaluations, whereas the LLMs exhibited greater uniformity and consistency, but tended to be permissive and generous in their judgments. Notably, the human expert identified subtle flaws that the LLMs failed to detect, such as imprecise terminology, overly suggestive answer choices, and culturally biased content, all of which pose threats to the validity of the assessment. These insights affirm the essential role of human engagement in validating and optimizing the automatic item generation (AIG) process for complex latent constructs such as CT.
This study investigates the feasibility of Large Language Model (LLM)-based machine translation for educational content, using the example of critical thinking assessment items through an in-context learning prompting strategy. While global access to education has expanded, language barriers remain a challenge, requiring effective translation solutions. Traditional machine translation often fails to capture pedagogical and cultural nuances, while human translation struggles with efficiency and scalability. LLM offers a promising alternative by generating high-quality translations with improved contextual understanding. However, to fully realize their potential, effective prompt design is critical. This study explores prompt engineering strategies tailored for educational purposes by developing and comparing two strategies: an assessment-aware prompt, which integrates assessment specifications, and a curriculum-aware prompt, which incorporates educational and cultural contexts from national curriculum documents in addition to assessment information. Translations were analyzed both qualitatively and quantitatively based on pedagogical validity and linguistic features. Results indicate that curriculum-aware prompting significantly improves translation by increasing syntactic complexity, aligning with cognitive development, and reducing translation artifacts. Expert evaluations also show a strong preference for curriculum-aware translations due to their superior syntactic structure and use of culturally appropriate terminology. These findings support the effectiveness of in-context learning approaches that integrate national curriculum data in optimizing LLM-driven translation.
In this project report, we report on the feasibility of automatic scoring systems for text responses from the 2016 ePIRLS. We show that the multilingual automatic scoring approach used in this study can be applied to different languages and countries, despite their linguistic variance. To measure linguistic variance, we used a variant of the conventional type-token-ratio, which we refer to as STTR. We utilized two systems for automatic scoring: fuzzy lexical matching (FLM) and supervised classifiers based on semantics. FLM prioritizes accuracy but requires significant manual scoring work by human raters. The supervised classifiers were trained using a pre-trained deep neural network (XLM-R) for multilingual texts and support vector machines. Results showed that automatic scoring models can score accurately (κ = .755 on average using XLM-R) and efficiently (26.1% reduction of manual scoring on average) across languages and countries, in the presence of linguistic variance. However, performance varied widely across items, highlighting the importance of investigating the determinants of automatic scoring performance. It was found that higher levels of linguistic variance were associated with lower automatic scoring performance. In addition, linguistic variance and automatic scoring model performance were significantly related to several item- and student-level characteristics. The paper concludes with a discussion of the implications of operationalizing automatic scoring.