ACL 2026 Findings

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

Consistency via Algorithmic Prompting and Stable Thinking
Jinxiang Xie1, Zihao Li2, Wei He3, Rui Ding4, Shi Han4, Dongmei Zhang4
1Nanjing University  |  2Tsinghua University  |  3Peking University  |  4Microsoft Research
CAST Framework Overview
Figure 1: Overview of the CAST framework. Traditional methods (left) operate with uncontrolled reasoning paths, resulting in wide, high-entropy output distributions. CAST (right) mitigates instability via Algorithmic Prompting and Thinking-before-Speaking, collapsing the generation process into a sharply concentrated output distribution.
Abstract
Text analysis of tabular data relies on two core operations: summarization for corpus-level theme extraction and tagging for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics.

To address this challenge, we introduce CAST (Consistency via Algorithmic Prompting and Stable Thinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation.

To measure progress, we introduce CAST-S and CAST-T, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2%, while maintaining or improving output quality.

Key Contributions

  • Formalization of TADA: We formalize Text Analysis for Data Analysis (TADA) as a tabular-centric paradigm, highlighting stability as a functional necessity for integrating probabilistic LLM outputs into deterministic OLAP workflows.
  • CAST Framework: A novel approach that constrains generation via Algorithmic Prompting and intermediate commitments, reducing the entropy of latent paths without expensive search-based methods.
  • Stability Metrics: We introduce CAST-S and CAST-T, stability-focused evaluation metrics combining semantic matching with order sensitivity (Kendall's Tau) to capture human-perceived consistency.
  • Strong Empirical Results: Up to 16.2% improvement in Stability Score across multiple LLM backbones, with no regression in accuracy.

Method

CAST addresses the instability problem by constraining the LLM's latent reasoning trajectory through two complementary mechanisms:

Algorithmic Prompting

Specifies an algorithmic scaffold for the task, translating classic deterministic workflows into a structured prompt sequence. This scaffold acts as a strong prior over valid reasoning transitions, effectively pruning high-entropy paths.

Thinking-before-Speaking

Enforces the scaffold by requiring the model to produce well-defined intermediate states (domain, topic schema, clusters) before emitting the final output. By committing to these states, the model follows a more stable reasoning path.

TADA Operations
Figure 2: Illustration of the summarization and tagging operations for TADA. These atomic operations can be composed and reused in complex TADA tasks.

Empirical Observation

We empirically demonstrate that requiring relevant intermediate states demonstrably sharpens the model's output distribution. As shown below, CAST produces the sharpest and most concentrated distribution, indicating substantially improved run-to-run stability.

Output Length Stability
Figure 3: Output-length stability under different prompting strategies. KDE-smoothed distributions of summary length compare (i) direct prompting, (ii) irrelevant intermediate states, (iii) relevant intermediate states, and (iv) the full CAST prompt. CAST produces the sharpest distribution with outputs tightly clustered around a central value.

Results

16.2%
Maximum Stability Score Improvement
32
Dataset-Query Pairs Evaluated
5,100+
Items Across 4 Diverse Domains
Tagging Pipeline
Figure 4: The CAST framework for tagging, illustrating a pipeline that begins with query decomposition and domain identification to guide the core algorithmic prompting stage, and concludes with output validation.

Citation

@misc{xie2026castachievingstablellmbased, title={CAST: Achieving Stable LLM-based Text Analysis for Data Analytics}, author={Jinxiang Xie and Zihao Li and Wei He and Rui Ding and Shi Han and Dongmei Zhang}, year={2026}, eprint={2602.15861}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.15861}, }

Acknowledgments: This work was done during the author's internship at Microsoft Research. We thank all colleagues and mentors from the DKI group for their support and valuable feedback.