Loading...

Elaris Computing Nexus

Elaris Computing Nexus


Personalized Text to Speech Synthesis through Few Shot Speaker Adaptation with Contrastive Learning


Elaris Computing Nexus

Received On : 23 July 2025

Revised On : 12 September 2025

Accepted On : 26 October 2025

Published On : 02 November 2025

Volume 01, 2025

Pages : 206-218


Abstract

Personalized text-to-speech (TTS) synthesis has the goal of producing natural and expressive speech that emulates the voice of a target speaker with a minimum of data. The models of traditional neural TTS, including Tacotron 2 and Fast Speech 2, need to be trained in large amounts of speaker-specific data and can thus not easily be personalized quickly. We suggest CL-FS-TTS (Contrastive Learning based Few-Shot Text-to-Speech) to solve this problem, a new framework that uses contrastive speaker representation learning to adapt the speaker using only 1030 seconds of reference audio. The CL-FS-TTS architecture has two encoders: a content encoder that identifies linguistic features of text and a speaker encoder trained with the help of supervised contrastive learning to maximize speaker dissimilarity. In adaptation, the model matches speaker embeddings with generated mel-spectrograms with a contrastive consistency loss, enhancing voice and prosodic consistency. We compare CL-FS-TTS with Tacotron 2, Fast Speech 2, AdaSpeech, YourTTS, and Meta-TTS in terms of Mean Opinion Score (MOS), Speaker Similarity Score (SSS), Mel Cepstral Distortion (MCD) and Word Error Rate (WER). The experimental outcomes indicate that CL-FS-TTS has a higher naturalness and similarity of the speaker besides 40% less adaptation time in comparison with baselines. The suggested model lays the foundation of an efficient and strong model of high-quality personalized TTS synthesis in the situation of data scarcity.

Keywords

Few-Shot Speaker Adaptation, Contrastive Learning, Personalized Text-To-Speech (TTS), Speaker Embedding, Neural Speech Synthesis.

  1. N. Kaur and P. Singh, “Conventional and contemporary approaches used in text to speech synthesis: a review,” Artificial Intelligence Review, vol. 56, no. 7, pp. 5837–5880, Nov. 2022, doi: 10.1007/s10462-022-10315-0.
  2. T. Gopalakrishnan, S. A. Imam, and A. Aggarwal, “Fine Tuning and Comparing Tacotron 2, Deep Voice 3, and FastSpeech 2 TTS Models in a Low Resource Environment,” 2022 IEEE International Conference on Data Science and Information System (ICDSIS), pp. 1–6, Jul. 2022, doi: 10.1109/icdsis55133.2022.9915932.
  3. É. Székely, P. Mihajlik, M. S. Kádár, and L. Tóth, “Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication,” Interspeech 2025, pp. 2735–2739, Aug. 2025, doi: 10.21437/interspeech.2025-1726.
  4. W. Wang, Y. Song, and S. Jha, “USAT: A Universal Speaker-Adaptive Text-to-Speech Approach,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2590–2604, 2024, doi: 10.1109/taslp.2024.3393714.
  5. C. Hong, J. Hyuk Lee, and H. Kook Kim, “Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis,” IEEE Access, vol. 12, pp. 190711–190727, 2024, doi: 10.1109/access.2024.3515206.
  6. T. Saeki, S. Takamichi, and H. Saruwatari, “Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 749–756, Dec. 2021, doi: 10.1109/asru51503.2021.9687904.
  7. X. Cheng, Y. Wang, C. Liu, D. Hu, and Z. Su, “HiFi-GANw: Watermarked Speech Synthesis via Fine-Tuning of HiFi-GAN,” IEEE Signal Processing Letters, vol. 31, pp. 2440–2444, 2024, doi: 10.1109/lsp.2024.3456673.
  8. M. D. Fakhrezi, Yusra, Muhammad Fikry, Pizaini, and Suwanto Sanjaya, “End-to-End Text-to-Speech for Minangkabau Pariaman Dialect Using Variational Autoencoder with Adversarial Learning (VITS),” Knowbase: International Journal of Knowledge in Database, vol. 5, no. 1, pp. 81–94, Jun. 2025, doi: 10.30983/knowbase.v5i1.9909.
  9. W. Han, M. Kang, C. Kim, and E. Yang, “Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting,” ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, Apr. 2025, doi: 10.1109/icassp49660.2025.10890553.
  10. S. Bhushan, V. Prakash Mishra, V. Rishiwal, S. Arunkumar, and U. Agarwal, “Advancing Text-to-Speech Systems for Low-Resource Languages: Challenges, Innovations, and Future Directions,” IEEE Access, vol. 13, pp. 155729–155758, 2025, doi: 10.1109/access.2025.3605236.
  11. P. Pham Ngoc, C. Tran Quang, and M. Luong Chi, “ADAPT-TTS: High-Quality Zero-Shot Multi-Speaker Text-To-Speech Adaptive-Based for Vietnamese,” Journal of Computer Science and Cybernetics, vol. 39, no. 2, pp. 159–173, Jun. 2023, doi: 10.15625/1813-9663/18136.
  12. Z. Chen, Z. Ai, Y. Ma, X. Li, and S. Xu, “Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, May 2024, doi: 10.1186/s13636-024-00351-9.
  13. X. Liu, X. Ma, W. Song, Y. Zhang, and Y. Zhang, “High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN,” Scientific Reports, vol. 15, no. 1, Feb. 2025, doi: 10.1038/s41598-025-90507-0.
  14. N. Kumar, A. Narang, and B. Lall, “Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1679–1693, 2022, doi: 10.1109/taslp.2022.3169634.
  15. C. Qiang et al., “Learning Speech Representation from Contrastive Token-Acoustic Pretraining,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10196–10200, Apr. 2024, doi: 10.1109/icassp48485.2024.10447797.
  16. Y. Xue, N. Chen, Y. Luo, H. Zhu, and Z. Zhu, “CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion,” Speech Communication, vol. 165, p. 103139, Nov. 2024, doi: 10.1016/j.specom.2024.103139.
  17. H. Shi and T. Sakai, “Self-Supervised and Few-Shot Contrastive Learning Frameworks for Text Clustering,” IEEE Access, vol. 11, pp. 84134–84143, 2023, doi: 10.1109/access.2023.3302913.
CRediT Author Statement

The author reviewed the results and approved the final version of the manuscript.

Acknowledgements

We would like to thank Reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions, which helped us to improve the quality of the manuscript.

Funding

No funding was received to assist with the preparation of this manuscript.

Ethics Declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Availability of Data and Materials

Data sharing is not applicable to this article as no new data were created or analysed in this study.

Author Information

Contributions

All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.

Corresponding Author



Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit: https://creativecommons.org/licenses/by-nc-nd/4.0/

Cite this Article

Natarajan K, “Personalized Text to Speech Synthesis through Few Shot Speaker Adaptation with Contrastive Learning”, Elaris Computing Nexus, pp. 206-218, 2025, doi: 10.65148/ECN/2025019.

Copyright

© 2025 Natarajan K. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.