Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective
Published in ICLR, 2025
This paper explores the critical role of synthetic data in enhancing the post-training performance of large language models (LLMs) from a novel reverse-bottleneck perspective.
Recommended citation: Zeyu Gan, Yong Liu. Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective. In The Thirteenth International Conference on Learning Representations, 2025. https://arxiv.org/abs/2410.01720