Training a new AI voice for Piper TTS with only 4 words

Jul 2025 - 20 min read

Over the last 15 years (!) I have used Text-to-Speech (TTS) systems for various projects including a fairly elaborate home automation system,¹ a distributed public announcement system and a monitoring system.²

I’ve wanted a specific voice for my projects for some time, as well as some kind of excuse to get me interested in AI theory and local AI experiments.

Recently I had an idea to train my favourite TTS engine, Piper TTS with a single phrase from a demo of a commercial TTS engine. Here’s how I did it, with all the details and code do you can do it too!

Why might you want to do this specifically? Perhaps you’d like a custom voice for Home Assistant with little manual effort – home assistant can use Piper TTS voices directly.

You could clone a TTS engine like me, your own voice, or a character from a game or whatever else you like. The process is the same.

About a decade ago. I left blogging about it too late and got overwhelmed… ↩︎
Even for custom status announcements for my FPV drone controller, a Radiomaster Pocket. ↩︎

Background

State of the art

Generally, up until about 2020 the voice synthesis engines available to the hobbyist lagged behind the commercial offerings. We had the likes of espeak-ng – a technically capable, but somewhat synthetic sounding engine, picotts³ and festival which both sound better than espeak, but not fantastic especially when compared to the commercial festival offerings.

espeak-ng demo

0:00 / 00:02

In 2023, along came Piper TTS which really broke the mold. Piper is AI based, but critically is incredibly fast and has low requirements so it can run on weak hardware – or strong hardware with low overhead. The voices it produces are quite realistic – not perfect but certainly far better than the engines of yore. Piper is part of the Open Home Foundation and widely used, so its future seems secure.

Festival demo

0:00 / 00:02

Alongside piper there have been a number of other AI based TTS engines – some of them are incredibly realistic to the point where it’s not really possible to tell the result isn’t a human voice. For instance, CosyVoice and Chatterbox TTS.

Piper demo

0:00 / 00:04

Chatterbox TTS is a particular achievement in that it is “zero shot.” This means it can clone a voice using a single phrase – without any additional training! There is a demo available on Hugging Face which allows you to try it out.

The state of the art models are in varying states. They may require expensive hardware to be as performant as Piper, or otherwise are in a fairly experimental state. As such I wanted to use Piper.

Used as the default TTS engine on Android for a long time. There’s also a fork with improvements called nanotts. ↩︎

Training a Piper voice traditionally

Piper has some documentation on how to train a new voice. Plus, there are lots of youtube tutorials available.

Generally training involves collecting a dataset of audio samples together with the corresponding text, and feeding them into the training scripts which require a recent GPU.

If you have a large dataset, you can train from scratch. However, it is possible to “fine-tune” an existing model with a smaller dataset – in this context it means grabbing a checkpoint from an existing model and training it further with your own data.

Typically, around 2000 epochs are required to train a new voice from scratch, while fine-tuning requires an additional 1000 epochs. Presumably, epochs will take longer with a larger dataset.

About 1300 phrases seem to be recommended for fine-tuning, and 13,000 for training from scratch.⁴

There are tools available to help you collect the dataset, such as piper-recording-studio.

Based on looking into the training done here: https://brycebeattie.com/files/tts/ ↩︎

The idea

As you’ve likely guessed, I wanted to use Chatterbox TTS to generate a dataset from a single phrase, and then use that to fine-tune Piper TTS. Given the layers of transformation, I didn’t expect the result to be an exact copy.

So I picked a sample phrase generated by the legacy TTS engine.

Sample generated by mystery commercial TTS engine

0:00 / 00:03

That same phrase regenerated with Chatterbox TTS:

Sample regenerated with Chatterbox TTS

0:00 / 00:03

They sound similar! Chatterbox has made the voice sound less robotic. This is OK, but it does make it sound slightly less sinister…⁵

Next I had to write scripts to generate a dataset from this. That meant running Chatterbox TTS locally.⁶

Sinister is good when you hope to build an AI that will take over the world in some kind of weird, ironic way. ↩︎
OK I could run it on the cloud, or a hugging face notebook, but I wanted to run it locally. ↩︎

Hardware

While my main desktop PC is a powerful machine, it has a rather ancient GPU: a Nvidia GTX 980Ti⁷ which has only 6GB of VRAM and the Maxwell architecture which is no longer supported my most PyTorch projects.

The messy test setup. Yes, I know the fans are blocked in the picture. Oops!

As it happens, I ended up with a friend’s Tesla P4 GPU that we had previously used for transcoding. It has a Pascal architecture and 8GB of VRAM, which is marginally good enough.⁸ Apparently it is approximately equivalent to an under-clocked 1080ti. I bet newer cards would be far faster though as they have been designed with AI explicitly in mind.

The Nvidia Tesla P4 used for training, complete with 40mm fans with a 3d-printed bracket

Given this card is a (circa 2016) datacenter GPU, it has no video output so I could not put it in my main desktop which has only one PCIe slot and no integrated graphics. I have a lot of spare hardware lying around so I threw together a case-less system.⁹

The system has an ancient first generation i7 and only 8GB of RAM, but given the training happens entirely on the GPU, this shouldn’t be a problem.

Hot GPU due to 37c ambient. All terminal screenshots have to include htop, it's the law!

We were going through a heat wave in the UK, so I set up the system outside under my pergola to keep it cool. …I’m hoping there won’t be any stormy weather before training is done!

This is a 2015 GPU, yet it’s surprisingly good enough for my gaming needs. ↩︎
Experimenting with LLMs, I got a 25x speed boost vs CPU inference. Neat! ↩︎
See also, a GL.iNet router running as a switch + bridge and JetKVM to make it easier to manage. ↩︎

Software

I used NixOS as a base, given I can set up a new environment in minutes on a new machine.¹⁰ Docker was used together with a wrapper script to make executing the various scripts I developed for each stage easier – I suppose I could have used a nix-shell, but something told me I didn’t want to effectively package and port the whole thing to Nix given the training scripts are experimental and likely to change.

Suffice to say, given the experimental nature of the Piper TTS training scripts, I hit a good chunk of dependency hell. This is understandable given training is a one-shot thing – the training scripts are not packaged or maintained quite to the same level as the main project.

Anyway, if you want to play along I’m including all the scripts I wrote in this post. Beware; while they worked for me when I wrote the post they probably won’t later – the code is littered with unpinned dependencies and deprecation warnings.

“I run NixOS, BTW.” I had to mention it somewhere, right? ↩︎

Procedure

I used TRAINING.md as a guide, so I recommend you read it too first as is covers some things that I don’t.

Dockerfile and wrapper script

Assuming you have Docker and GPU drivers installed, I’ll start with the Dockerfile, which includes a few hacks to pin dependencies where APIs have otherwise changed. I based it on the recommendation in the training guide.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
ARG CUDA_IMAGE="12.9.0-devel-ubuntu24.04"
FROM nvidia/cuda:${CUDA_IMAGE}
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors \
    && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

ENV CUDA_DOCKER_ARCH=all
ENV GGML_CUDA=1

RUN apt install -y rsync sox ffmpeg python3.12-venv cmake

RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip install chatterbox-tts openai-whisper pytorch-lightning

WORKDIR /usr/local/src
RUN git clone https://github.com/rhasspy/piper.git
WORKDIR /usr/local/src/piper/src/python
RUN sed -i 's/piper-phonemize~=1.1.0/piper-phonemize~=2023.11.14-4/' requirements.txt
RUN git checkout 2023.11.14-2
RUN pip install --upgrade wheel setuptools
#RUN pip install -e .
RUN pip install piper-phonemize-fix onnxruntime-gpu pytorch-lightning==1.9.3 \
    cython --no-deps numpy -e .
RUN pip install "torch<2.6.0" # https://github.com/rhasspy/piper/issues/718
RUN ./build_monotonic_align.sh

WORKDIR /app
# used by ./docker-wrapper for writable xgd/home dir when run with --user $UID
RUN mkdir -p /home/user
ARG UID
RUN chown $UID -R /home/user
ENV HOME=/home/user
ENV HF_HOME=/app/.cache/huggingface
ENV NUMBA_CACHE_DIR=/app/.cache/numba_cache
ENV XDG_CACHE_HOME=/app/.cache
ENV CUDA_LAUNCH_BLOCKING=1
ENV TORCH_CPP_LOG_LEVEL=ERROR
ENV TQDM_DISABLE=1

dockerfile Download Copy

The Dockerfile above sets up the environment with all the required dependencies for all the tools used here.

Crucially it sets up some caching configuration so you don’t have to hammer the Hugging Face servers to download weights every time.¹¹ To actually run the scripts, I created a handy wrapper – a mainstay in many of my docker-powered projects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/usr/bin/env bash
set -e
#
# build, with output
docker build --build-arg UID=$UID .

# build (from cache) and run via image hash, avoiding tag
HASH=$(docker build --build-arg UID=$UID -q .)
docker run \
    --interactive --tty \
    --volume $(pwd):/app \
    --user $UID \
    --rm \
    --device nvidia.com/gpu=all \
    --shm-size=512M \
    $HASH "$@"

docker-wrapper.sh Download Copy

The wrapper script above allows you to execute scripts in the current directory, but in the environment from the Dockerfile. File ownership is even preserved.

To use it, run ./docker-wrapper.sh <script>. All of the scripts included in this page assume you run them this way.

Note the GPU passthrough – in NixOS I had to set hardware.nvidia-container-toolkit.enable = true; for this to work.

Only when combined with the wrapper script. ↩︎

Corpus text gathering

I combined the corpus from the aforementioned piper-recording-studio with the phrases from the EdgeTX firmware and some quotes I’ve kept over the years. This gave me a corpus of about 1300 phrases and a reasonable coverage of the English language sounds.

1-gen-corpus-txt.py

The (trivial) script is available above. I included it as it might save a few minutes parsing things manually if adapted.

Generating audio samples

Here’s the tricky part; in theory all I had to do was run the Chatterbox TTS engine over the corpus text, and save the each generated audio sample to a file.

Using my Ryzen 5950X resulted in a generation speed of about 8 seconds per phrase which was fast enough – I did most of the generation on my desktop before setting up the training system.

However, while this seemed to work OK I quickly realised some of the produced files were nonsensical or corrupted. Given the stochastic nature of AI, and the fact that Chatterbox is brand new, this is not surprising.

For instance, here’s Chatterbox TTS trying to say “zero”:

Chatterbox TTS trying to say zero

0:00 / 00:02

…to me that sounds like “Win, beach and Prius.” Madness! For some reason zero trips it up nearly every time. Certain other numbers to too! I speculate this could be due to a small input phrase length triggering a bug, or simply bad training data.

If this corruption made it to my training set, it could throw off the resulting model entirely. I had to find a way to automatically fix this problem without manually inspecting every file.

Fortunately, re-attempting the generation of a phrase with Chatterbox TTS can sometimes result in the correct output – as the engine is stochastic, it is non-deterministic unlike traditional TTS engines such as Festival or espeak.

Verifying audio samples

Luckily for me, in addition to the advancements in Text-to-Speech, there have been parallel leaps in transcription engines too. Whisper, from the ironically named OpenAI,¹² is now the gold standard.

Like Piper, it’s also lightweight enough to run on a low-power system and fast.

The plan was to use Whisper to transcribe the generated audio samples, and then compare the transcription to the original text. If they match, the sample is good! If not, my script retries a few more times.

I foresaw this requiring a bit of normalisation – as false negatives could occur due to:

Americanisms: “color” vs “colour”
Punctuation: “Hello world!” vs “Hello world”
Numbers: “zero” vs “0” – phonetic vs numeric
Combined words: “wheelbase” vs “wheel base”
Probably other things.

At first I used regexes to replace non-word characters with spaces. I realised I’d need something better if I wanted more than the resulting 88% coverage though.

I considered using some kind of fuzzy matching with the Levenshtein distance or a lexicon to make up for the problems in this particular corpus, but that felt like a bad solution.

Revisiting, I thought about what my kids had been learning at school – phonetics! Word sounds are broken down into phonemes, which are not sensitive to spelling variations, punctuation or Americanisms. If I could covert the original text, I could compare the phonemes instead and bypass the issues above.

Earlier I had noticed that Piper TTS has a dependency on espeak-ng. Curiosity had got the better of me and I had looked into it. It turns out espeak-ng is used to outsource the phoneme conversion. Luckily, this is exposed in the Piper python library.

Here’s a quick demonstration of its effectiveness:

What you can see is a collection of unicode characters representing the International Phonetic Alphabet.

Employing this method, coverage rose to 98% (1644/1677) – I was happy with that! Here’s the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#!/usr/bin/env python3
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
import json
from os import makedirs
from os.path import join
from os.path import exists
import re
from os import remove
import whisper
from piper_phonemize import phonemize_espeak

chatterbox_model = ChatterboxTTS.from_pretrained(device="cpu")

csv_fp = "LJSpeech-1.1/metadata.csv"
wav_dir = "LJSpeech-1.1/wavs"

csv = []

makedirs(wav_dir, exist_ok=True)

whisper_model = whisper.load_model("turbo")


def tts(text, fp, overwrite=False):
    if exists(fp) and not overwrite:
        return

    # low exaggeration so it sounds more neutral
    wav = chatterbox_model.generate(
        text, audio_prompt_path="nina2.wav", exaggeration=0.2
    )
    ta.save(fp, wav, chatterbox_model.sr)


def normalise_text(text, gaps=True):
    return re.sub(r"\W+", " " if gaps else "", text).strip()


def verify(fp, text):
    result = whisper_model.transcribe(fp)["text"].strip()

    verified = phonemize_espeak(normalise_text(text), "en-us") == phonemize_espeak(
        normalise_text(result), "en-us"
    ) or phonemize_espeak(
        normalise_text(text, gaps=False), "en-us"
    ) == phonemize_espeak(
        normalise_text(result, gaps=False), "en-us"
    )

    if not verified:
        print(f"Verification failed for {fp}: expected '{text}', got '{result}'")

    return verified


with open("corpus.txt", "r") as f:
    lines = f.read().splitlines()

for id, text in enumerate(lines):
    print(id + 1, "/", len(lines), text)
    fp = join(wav_dir, f"{id}.wav")

    for x in range(3):
        tts(text, fp, overwrite=x > 0)

        if verify(fp, text):
            break

        print(f"Verification failed for {id}.wav, trying again...")
    else:
        print(f"Verification failed for {id}.wav after {x} attempts, skipping...")
        remove(fp)
        continue

    csv.append((id, text))

with open(csv_fp, "w") as f:
    for id, text in csv:
        # Write the CSV file in LJSpeech format
        f.write(f"{id}|{text}\n")


print(f"Score: {len(csv)}/{len(lines)} ({len(csv) / len(lines) * 100:.0f}%)")

2-gen-ljspeech-dataset.py Download Copy

You may have noticed that the script above writes the data in a specific format, “LJSpeech” compatible. This is a common format for TTS datasets, and the Piper TTS training scripts support it.

LJSpeech is a public domain dataset of English with about 13,000 phrases. It has a simple CSV file with wav files indexed by a numeric ID. It is established as the de-facto standard for TTS training sets.

Ok, in this case they live up to their namesake (and published papers) ↩︎

Converting the dataset for training

Piper TTS training scripts expect the dataset to be in a specific format – 22050Hz mono. Chatterbox produces the files at 24000Hz so I needed to first convert the sample rate, else I suppose the model would learn from files effectively slowed down.

After this, the LJSpeech-compatible dataset needs to e converted to another format, as it is probably due to PyTorch. There are scripts to do this in the Piper repository. 3-gen-training-data.sh does this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
set -e
set -x

rsync -r --delete LJSpeech-1.1/ LJSpeech-1.1-22050/

# reduce vol due to clipping. Specify format due to file extension
find LJSpeech-1.1-22050 -type f -name "*.wav" -exec sox -v 0.95 {} -r 22050 -t wav {}.22050tmp \;
find LJSpeech-1.1-22050 -type f -name "*.wav" -exec mv {}.22050tmp {} \;

python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir LJSpeech-1.1-22050/ \
  --output-dir piper_training_dir/ \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 22050

rm -rf LJSpeech-1.1-22050

3-gen-training-data.sh Download Copy

Training

Now for the main bit! There was a lot of wiring and converting to get to this point but the rest is easy provided the GPU is set up correctly. Here’s the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/usr/bin/env bash
set -e
set -x

python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 12 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 1 \
    --quality high \
    --precision 32

4-train.sh Download Copy

Versus the example in the Piper documentation:

I set --batch-size to 12 (empirically) as the Tesla P4 has 8GB of VRAM and the example batch size is 32, which apparently requires 24GB of VRAM. I did a bit of searching about the implications here, and only speed and the measurement of progress seems at stake instead of the quality of the resulting model. Note that the maximum phonemic length for every file is a factor here.
I used the LJSpeech piper checkpoint in high quality from here. I had to tell it --quality high too.
I se --epochs to 3000, as I was fine-tuning an existing model already at 2000, wanting to add 1000 as explained earlier.

It’s possible to observe the training progress in real time using TensorBoard – a tool that generates graphs based on the logs stored within the training directory.

For speed, I just used nix-shell: nix-shell -p python3Packages.tensorboard --run "tensorboard --logdir ./piper_training_dir". I then made an ssh tunnel to open http://localhost:6006 in my browser. Here’s what it looked like from the first few minutes:

Tensorboard, showing the random result of early training

The aim is to see convergence of the loss function, in this case known as loss_disc_all. I presume disc stands for discriminator, and is what the loss function is called.

The training process will save the weights every so often, in structures known as checkpoints. In theory this happens every 1 epoch as configured above, but empirically I discovered that it happens every 100 epochs. Perhaps that’s the minimum interval.

Exporting the model

Assuming training was successful, the final step is to export the model the standard format: Weights in and onnx file + some metadata in a JSON file.

The resulting model can be simplified with the onnxsim tool.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env bash
set -eo pipefail

checkpoint="$(find ./ -name '*.ckpt' -type f -printf '%T+ %p\n' | sort -r | head -n1 | cut -d' ' -f2-)";
echo "${checkpoint}"

onnx="tts.onnx";
python3 -m piper_train.export_onnx \
    "${checkpoint}" \
    "${onnx}.unoptimized";

# https://github.com/daquexian/onnx-simplifier
onnxsim "${onnx}.unoptimized" "${onnx}";
rm -f "${onnx}.unoptimized";

cp config.json "${onnx}.json"

echo "${onnx}";

5-export.sh Download Copy

Results

When running the data generation script, it was interesting to see the mistakes caught. They range from bizarre to amusing, and some are just plain misleading. Here are a few examples:

... LJSpeech-1.1/wavs/4.wav: expected '4', got ' he now is recording'
... LJSpeech-1.1/wavs/187.wav: expected 'timer 2 elapsed', got ' Timmer 2 elapsed.'
... LJSpeech-1.1/wavs/187.wav: expected 'timer 2 elapsed', got ' Timmer 2 elapsed.'
... LJSpeech-1.1/wavs/6.wav: expected '6', got '26'
... LJSpeech-1.1/wavs/471.wav: expected 'Dig mode', got 'Dick mode.'

As mentioned above, I ended up with a dataset 1644 phrases.

Starting the training resulted in a mountain of deprecation warnings. Things change fast in the AI world! See the terminal screenshot to see what I mean.

Training was interrupted on purpose to relocate the system outside (under cover!) due to heat. I expected the system to resume seamlessly, but unfortunately I had loss quite a number of epochs as the checkpoint was older than expected.

This can be seen in the loss function graph – grey is version 10, with the blue (version 11) starting before version 10 ended.

Over 1000 epochs, the graph shows convergence! The noise is quite high though, probably due to the low batch size.

The process took around 5 days. Exporting was successful, though simplifying the model only took about 0.2 MB from the 100MB model.

Here’s the test phrase generated by the new model!

Final result

0:00 / 00:02

It works! It sounds similar to what Chatterbox TTS generated. At that point I realised I should have used a better example phrase throughout this post. Here’s a longer clip of the same voice, generated by the new model:

A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky.

This is from the Piper project, used as a test phrase.

A longer clip

0:00 / 00:11

Given the unknown copyright status of the generated weights, I will not release them. The process here was used to generate a custom voice based on an example from a commercial TTS engine, but it could apply to any other source of speech.

Conclusion

I successfully fine-tuned a Piper TTS voice based on a single phrase. The process was quite tedious due to the level of moving parts dependency issues, waiting for processing and the manual steps; however it was definitely worth it and I hope this article makes it easier for anyone else to do the same.

I think the main application for this is to create a custom voice for home-assistant.

What’s next?

Clip silence

In the generated chatterbox sample from earlier there is a significant period of low level noise at the end. You can see this on the spectrogram. I wonder how much of the training set has this problem, and how much it affects the model. I will use sox to clip this from the training set and try again to see what difference it makes.

Perhaps it’s something to do with the audio watermarking employed by Chatterbox TTS?

Train from scratch

It would be trivial to use the same procedure to produce a much larger training set, perhaps based on the LJSpeech corpus text. This would give me the best possible representation – though depending on the quality of Chatterbox versus the checkpoint I used for this article, it might not sound better!

I suspect I’d also need some much beefier hardware and much more time. That can be arranged.

Given I think I’d need about 13,000 phrases to train from scratch, it could take 200x longer to train assuming a linear relationship between the number of phrases and the training time, with 2000 epochs.

This could take years on my current hardware at that rate, so I’d probably have get access to something more powerful. Because of this, I think the fine-tuning approach is the right compromise.

Other voices

How about making some Piper TTS voices based on some evil AIs from various films or games? What would be fun.

Input lexicon

There are some specific words that a lot of TTS engines struggle with. I could pre-translate the text to ensure these are pronounced correctly in the training set.

Thanks for reading! If you enjoyed this article or have comments, please consider sharing it on Hacker news, Twitter, Hackaday, Lobste.rs, Reddit and/or LinkedIn.

You can email me with any corrections or feedback.

Tags:

linux

python

software

About me