MegaTTS3
MegaTTS3 is an open-source lightweight and efficient text-to-speech (TTS) synthesis system developed by ByteDance. Its main features include:
- Lightweight and Efficient: The backbone network of TTS Diffusion Transformer has only 0.45B parameters.
- High-Quality Voice Cloning: It has excellent voice cloning capabilities, capable of generating similar voices based on provided audio samples.
- Bilingual Support: Supports Chinese and English, as well as mixed-language contexts.
- Controllability: Supports accent intensity control and plans to support more precise pronunciation/duration adjustments.
The project also includes other useful sub-modules, such as:
- Aligner: A speech-text alignment model that can be used for tasks like dataset filtering, speech segmentation, and phoneme recognition.
- Grapheme-to-Phoneme Model: A grapheme-to-phoneme conversion model.
- WaveVAE: A waveform VAE used to compress speech into low-dimensional acoustic latent variables, which can serve as training targets for speech synthesis models or be used for speech conversion.
In summary, MegaTTS3 is a powerful and flexible TTS system with high-quality voice cloning capabilities and bilingual support, while also providing a series of useful tools to support speech processing tasks.