AAAI 2025

DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

Jinxiang Xie^1,2, Yilin Li¹, Xunjian Yin¹, Xiaojun Wan¹

¹Peking University | ²Beijing Jiaotong University

Paper PDF Code

Figure 1: Architecture of the DSGram method. It begins with the input of sentence pairs (original and corrected). LLMs are employed to generate dynamic weights, which are refined through a judgment matrix and a consistency check. These dynamic weights are normalized and used to score sub-metrics, producing an overall evaluation score.

Abstract

Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics.

In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models.

Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

Motivation

Traditional GEC evaluation metrics have significant limitations when dealing with LLM-based systems. As shown below, BLEU fails to differentiate between over- and under-correction, while SOME cannot capture over-correction. DSGram addresses these gaps with a comprehensive evaluation framework.

Figure 2: Running examples and evaluation results of existing metrics. BLEU fails to differentiate between over- and under-correction, whereas SOME cannot capture over-correction. Blue highlights over-correction; red indicates poor fluency.

Key Contributions

New Sub-Metrics: We introduce redesigned sub-metrics for GEC evaluation — Semantic Coherence, Edit Level, and Fluency — that address the over-editing problem in LLM-based GEC models.
Dynamic Weighting with AHP: We propose a novel dynamic weighting method integrating the Analytic Hierarchy Process with LLMs to ascertain the context-dependent importance of evaluation criteria.
Evaluation Datasets: We present DSGram-Eval (human-annotated) and DSGram-LLMs (GPT-4 simulated), both built on CoNLL-2014 and BEA-2019 test sets for rigorous evaluation.
Superior Correlation: DSGram achieves higher correlation with human judgments than all conventional reference-based and reference-free metrics on the SEEDA benchmark.

Method

DSGram comprises two main components: score generation and weight generation. By applying context-specific weights to the generated scores, an overall evaluation score is obtained.

Three Sub-Metrics

Semantic Coherence: Degree to which original meaning is preserved. Edit Level: Whether corrections are necessary and appropriate. Fluency: Grammatical correctness and natural flow.

Dynamic Weighting via AHP

Uses LLMs to construct pairwise comparison matrices for each sentence, with consistency checks and eigenvector normalization. Formal texts emphasize Edit Level; casual texts prioritize Fluency.

Figure 3: DSGram score computation for two different sentences. Sentence (a) is a casual dialogue where Fluency is emphasized. Sentence (b) is a formal expression where Edit Level receives greater weight.

Sub-Metrics Analysis

We redesigned the sub-metrics to reduce redundancy and improve coverage. The original SOME metrics showed high correlation (0.89) between Grammaticality and Fluency. Our new sub-metrics achieve a more balanced distribution.

SOME sub-metrics: High correlation (0.89) between Grammaticality and Fluency

DSGram sub-metrics: More evenly distributed correlation

Results

0.8764

Pearson Correlation with Human Scores (AHP Dynamic Weighting)

0.8544

Pearson Correlation with Human Scores (Average Weighting Baseline)

GEC Systems Evaluated on SEEDA Benchmark

DSGram's correlation with human feedback surpasses all conventional reference-based metrics (M², ERRANT, BLEU) and reference-free metrics (GLEU, Scribendi Score). Fine-tuned LLaMA3-8B and LLaMA2-13B models on DSGram-LLMs dataset also outperform their few-shot counterparts, demonstrating the framework's practicality with cost-effective models.

Citation

@misc{xie2025dsgramdynamicweightingsubmetrics, title={DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models}, author={Jinxiang Xie and Yilin Li and Xunjian Yin and Xiaojun Wan}, year={2025}, eprint={2412.12832}, archivePrefix={arXiv}, primaryClass={cs.CL}, doi={https://doi.org/10.1609/aaai.v39i24.34746}, url={https://arxiv.org/abs/2412.12832}, }

Acknowledgments: This work was done during the author's research internship at Peking University. We thank Prof. Xiaojun Wan and all colleagues from the Wangxuan Institute of Computer Technology for their guidance and support.