DISCLAIMER: This guide is somewhat technical, there are probably cases / errors that are not included in this guide, and can cause you to waste time and effort without any success. This guide is for learning and exchange only; do not use the product for any illegal activities. Please comply with local laws.
0. Introduction
This guide will provide steps to use GPT-Sovits( ) to clone someone's voice in your own PC.This project is written in Chinese. I am not sure if English computers can display Chinese characters. If you see some squares, this may mean that you lack the Chinese character set.
1. Requirements
This project need GPU. My GPU is an RTX 4070, but you can run it with any GPU that is able to install CUDA.This project can also run on Linux and macOS, but you will need to solve some environmental problems. This guide only provides the Windows version.
Python3.9 or Python3.10
CUDA
2. Installation
2.1 Installation with pre-packaged version (highly recommand)
Download the zip from , unzip it, enter the GPT-SoVITS-<version> folder and double-click on go-webui.bat.2.2 Installation manually
1.Install Dependencespip install -r requirements.txt
2. Install FFmpeg
3. Prepare pretrained models
Download pretrained models from and place them in GPT_SoVITS/pretrained_models.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from and place them in tools/uvr5/uvr5_weights.
Users in China region can download these two models by entering the links below and clicking "Download a copy"
2.3 Installation with Docker
ReadAfter installation, double-click on go-webui.bat, and the script will automatically open your browser, or you can manually open
---------------------------------------------------------------------------------------------------
if you see this on your browser, it means you install successfully.
Remember: Don't close the terminal opened by go-webui.bat.
3. Preparation
Source Voice
You need to find an audio clip of the person's voice you want to clone, with a duration of about 20 to 30 minutes being sufficient. A duration of 10 minutes or less is also acceptable. The important part is that the audio needs to be clean, without the voices of other people and without background music. Such as , you could use yt-dlp to download the audio of this video.After you obtain the audio, place this video in a folder, and remember the path.
If you cannot find an audio without background music, you could use UVR5 to remove the BGM from the audio.
4. Start
1. Audio Slicer
First, you need to input your audio file or floder path in Audio slicer input (file or folder), you can also change the output path.After you input the paths for input/output, click the "Start audio slicer" button. Then, you can see the log in the Audio Slicer output log. Wait about 10 seconds; when you see "切割结束" (slicing finished), it means it's okay.
2. Speech to text
Second, you need to perform speech recognition on the sliced audio clips to generate voice/text annotation files. Like WhisperClick "Start batch ASR"; it may take a long time to download the model when you run it for the first time.
Look at the Terminal/Powershell, this means it is running.
When you see these, it means it's done.
Still, remember the path of slicer_opt.list; you can also open the slicer_opt.list file, which looks like this:
3. Label (optionally)
Put your slicer_opt.list file path in ".list annotation file path", then click "Open labelling WebUI", it will open automatically, or you could enterThis step requires you to check if the audio_slice matches the text, and you can remove some bad audio slices (click 'Yes', then click 'Delete Audio'). You need to correct the text. Click "Submit Text" and then click "Next Index" to check the next page. After checking all slices, click "Save File" and close this tab.
4. Formatting
1. click 1-GPT-SOVITS-TTS
2. change model name
3. input labelling file path
4. input audio slice folder
5. click "Start one-click formatting"
After you see "一键三连进程结束", this means it's done.
5. Fine-tuned training
Click 1B-Fine-tuned training1. Batch size per GPU: My GPU 4070 can be set to 6, so you can change it based on your own GPU..
2. Total epochs and save frequency: How many epochs do you want to train for. High epochs doesn't necessarily mean better.
3. click Start SoVITS training
When you see "SoVITS训练完成", this means it's done.
Then click "Start GPT training"
When you see "GPT训练完成", this means it's done.
6. inference
1. Click 1C-inference2. Click refreshing model paths
3. Choose the GPT model and SoVITS model that you just trained.
4. Click Open TTS inference WEBUI
you will see this:
The important parts are Step 2 and Step 3. You can choose a voice slice from output/slicer_opt and the text of the slice from output/asr_opt. This slice is important because the voice you generate is highly based on this voice slice. For example, if you want the voice to sound exciting, you should choose a voice slice that also sounds exciting.
You can listen to it online or download it.