Publication Details

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

ČEGIŇ Ján and ŠIMKO Jakub. LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, 2025, pp. 10476-10496. ISBN 979-8-8917-6189-6. Available from: https://aclanthology.org/2025.naacl-long.526/

Czech title

LLM versus zavedené techniky augmentace textu pro klasifikaci: Kdy přínosy převyšují náklady?

Type

conference paper

Language

english

Authors

Čegiň Ján, Ing. (DCGM FIT BUT)
Šimko Jakub, doc. Ing., Ph.D. (DCGM FIT BUT)

URL

https://aclanthology.org/2025.naacl-long.526/

Keywords

data-efficient training, data augmentation, analysis

Abstract

The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. Previous studies have compared LLM-based augmentations with established augmentation techniques, but the results are contradictory: some report superiority of LLM-based augmentations, while other only marginal increases (and even decreases) in performance of downstream classifiers. A research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.

Published

2025

Pages

10476-10496

Proceedings

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Conference

2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, Albuquerque, New Mexico, US

ISBN

979-8-8917-6189-6

Publisher

Association for Computational Linguistics

Place

Albuquerque, New Mexico, US

DOI

10.18653/v1/2025.naacl-long.526

BibTeX

@INPROCEEDINGS{FITPUB13329,
   author = "J\'{a}n \v{C}egi\v{n} and Jakub \v{S}imko",
   title = "LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?",
   pages = "10476--10496",
   booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
   year = 2025,
   location = "Albuquerque, New Mexico, US",
   publisher = "Association for Computational Linguistics",
   ISBN = "979-8-8917-6189-6",
   doi = "10.18653/v1/2025.naacl-long.526",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13329"
}