Guide Fake GPT-SoVITS Guide

Henri IV · Mar 6, 2024

DISCLAIMER: This guide is somewhat technical, there are probably cases / errors that are not included in this guide, and can cause you to waste time and effort without any success. This guide is for learning and exchange only; do not use the product for any illegal activities. Please comply with local laws.

0. Introduction

This guide will provide steps to use GPT-Sovits(

Please, Log in or Register to see links and images

) to clone someone's voice in your own PC.

This project is written in Chinese. I am not sure if English computers can display Chinese characters. If you see some squares, this may mean that you lack the Chinese character set.

1. Requirements

This project need GPU. My GPU is an RTX 4070, but you can run it with any GPU that is able to install CUDA.
This project can also run on Linux and macOS, but you will need to solve some environmental problems. This guide only provides the Windows version.
Python3.9 or Python3.10
CUDA

Please, Log in or Register to see links and images

2. Installation

2.1 Installation with pre-packaged version (highly recommand)

Download the zip from

Please, Log in or Register to see links and images

, unzip it, enter the GPT-SoVITS-<version> folder and double-click on go-webui.bat.

2.2 Installation manually

1.Install Dependences
pip install -r requirements.txt
2. Install FFmpeg

Please, Log in or Register to see links and images

3. Prepare pretrained models
Download pretrained models from

Please, Log in or Register to see links and images

and place them in GPT_SoVITS/pretrained_models.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from

Please, Log in or Register to see links and images

and place them in tools/uvr5/uvr5_weights.
Users in China region can download these two models by entering the links below and clicking "Download a copy"

Please, Log in or Register to see links and images
Please, Log in or Register to see links and images

For Chinese ASR (additionally), download models from

Please, Log in or Register to see links and images

,

Please, Log in or Register to see links and images

, and

Please, Log in or Register to see links and images

and place them in tools/damo_asr/models.

2.3 Installation with Docker

Read

Please, Log in or Register to see links and images

After installation, double-click on go-webui.bat, and the script will automatically open your browser, or you can manually open

Please, Log in or Register to see links and images

---------------------------------------------------------------------------------------------------

if you see this on your browser, it means you install successfully.

Remember: Don't close the terminal opened by go-webui.bat.

Please, Log in or Register to see links and images

3. Preparation

Source Voice

You need to find an audio clip of the person's voice you want to clone, with a duration of about 20 to 30 minutes being sufficient. A duration of 10 minutes or less is also acceptable. The important part is that the audio needs to be clean, without the voices of other people and without background music. Such as , you could use yt-dlp to download the audio of this video.
After you obtain the audio, place this video in a folder, and remember the path.

If you cannot find an audio without background music, you could use UVR5 to remove the BGM from the audio.

Please, Log in or Register to see links and images

4. Start

1. Audio Slicer

First, you need to input your audio file or floder path in Audio slicer input (file or folder), you can also change the output path.

Please, Log in or Register to see links and images

After you input the paths for input/output, click the "Start audio slicer" button. Then, you can see the log in the Audio Slicer output log. Wait about 10 seconds; when you see "切割结束" (slicing finished), it means it's okay.

Please, Log in or Register to see links and images

2. Speech to text

Second, you need to perform speech recognition on the sliced audio clips to generate voice/text annotation files. Like Whisper

Please, Log in or Register to see links and images

Click "Start batch ASR"; it may take a long time to download the model when you run it for the first time.
Look at the Terminal/Powershell, this means it is running.

Please, Log in or Register to see links and images

When you see these, it means it's done.

Please, Log in or Register to see links and images

Still, remember the path of slicer_opt.list; you can also open the slicer_opt.list file, which looks like this:

Please, Log in or Register to see links and images

3. Label (optionally)

Put your slicer_opt.list file path in ".list annotation file path", then click "Open labelling WebUI", it will open automatically, or you could enter

Please, Log in or Register to see links and images

This step requires you to check if the audio_slice matches the text, and you can remove some bad audio slices (click 'Yes', then click 'Delete Audio'). You need to correct the text. Click "Submit Text" and then click "Next Index" to check the next page. After checking all slices, click "Save File" and close this tab.

4. Formatting

Please, Log in or Register to see links and images

1. click 1-GPT-SOVITS-TTS
2. change model name
3. input labelling file path
4. input audio slice folder
5. click "Start one-click formatting"

Please, Log in or Register to see links and images

After you see "一键三连进程结束", this means it's done.

5. Fine-tuned training

Click 1B-Fine-tuned training

Please, Log in or Register to see links and images

1. Batch size per GPU: My GPU 4070 can be set to 6, so you can change it based on your own GPU..
2. Total epochs and save frequency: How many epochs do you want to train for. High epochs doesn't necessarily mean better.
3. click Start SoVITS training
When you see "SoVITS训练完成", this means it's done.
Then click "Start GPT training"

Please, Log in or Register to see links and images

When you see "GPT训练完成", this means it's done.

6. inference

1. Click 1C-inference
2. Click refreshing model paths
3. Choose the GPT model and SoVITS model that you just trained.
4. Click Open TTS inference WEBUI

you will see this:

Please, Log in or Register to see links and images

The important parts are Step 2 and Step 3. You can choose a voice slice from output/slicer_opt and the text of the slice from output/asr_opt. This slice is important because the voice you generate is highly based on this voice slice. For example, if you want the voice to sound exciting, you should choose a voice slice that also sounds exciting.

You can listen to it online or download it.

Please, Log in or Register to see links and images

Guide Fake GPT-SoVITS Guide

Henri IV

Tier 3 Sub

0. Introduction​

This project is written in Chinese. I am not sure if English computers can display Chinese characters. If you see some squares, this may mean that you lack the Chinese character set.​

1. Requirements​

2. Installation​

2.1 Installation with pre-packaged version (highly recommand)​

2.2 Installation manually​

2.3 Installation with Docker​

3. Preparation​

Source Voice​

4. Start​

1. Audio Slicer​

2. Speech to text​

3. Label (optionally)​

4. Formatting​

5. Fine-tuned training​

6. inference​

0. Introduction

This project is written in Chinese. I am not sure if English computers can display Chinese characters. If you see some squares, this may mean that you lack the Chinese character set.

1. Requirements

2. Installation

2.1 Installation with pre-packaged version (highly recommand)

2.2 Installation manually

2.3 Installation with Docker

3. Preparation

Source Voice

4. Start

1. Audio Slicer

2. Speech to text

3. Label (optionally)

4. Formatting

5. Fine-tuned training

6. inference