Publication Details
LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?
data-efficient training, data augmentation, analysis
The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. Previous studies have compared LLM-based augmentations with established augmentation techniques, but the results are contradictory: some report superiority of LLM-based augmentations, while other only marginal increases (and even decreases) in performance of downstream classifiers. A research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.
@INPROCEEDINGS{FITPUB13329, author = "J\'{a}n \v{C}egi\v{n} and Jakub \v{S}imko", title = "LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?", pages = "10476--10496", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", year = 2025, location = "Albuquerque, New Mexico, US", publisher = "Association for Computational Linguistics", ISBN = "979-8-8917-6189-6", doi = "10.18653/v1/2025.naacl-long.526", language = "english", url = "https://www.fit.vut.cz/research/publication/13329" }