Speech synthesis data preparation process

Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.

Data Preparation Overview

Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker diarization, segmentation, and transcription steps.

Emilia-Pipe Processing Pipeline

Emilia-Pipe is a processing pipeline designed specifically for TTS data preparation, comprising the following key steps:

Step	Description
Normalization	Normalize audio to ensure consistent volume and quality
Source Separation	Process long audio into pure speech without background music (BGM)
Speaker Diarization	Extract medium‑length single‑speaker speech data
Fine Segmentation Based on VAD	Split speech into 3‑30 second single‑speaker segments
ASR	Obtain text transcriptions for speech segments
Filtering	Quality control to obtain the final processed dataset

The source code of the Emilia preprocessing tool is available on GitHub: Amphion/preprocessors/Emilia

Speaker Diarization

Speaker diarization is a key step in TTS data preparation, used to identify “who spoke when”. This technology is essential for extracting single‑speaker speech segments from multi‑speaker conversations, podcasts, and other audio sources.

More detailed information about speaker diarization technology can be found at Speaker Diarization 3.1

RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns of an RTTM file mean the following:

Column Name	Description
Type	Segment type; should always be SPEAKER
File ID	File name; the base name of the recording (without extension), e.g., rec1_a
Channel ID	Channel ID (starting from 1); should always be 1
Turn Onset	Turn start time (seconds from the beginning of the recording)
Turn Duration	Turn duration (seconds)
Orthography Field	Should always be
Speaker Type	Should always be
Speaker Name	Speaker identifier; should be unique within each file
Confidence Score	System confidence score (probability); should always be
Signal Lookahead Time	Should always be

Efficiency in Practice

In real‑world production environments, using GPUs can dramatically increase processing efficiency. Tests show that a single A800 GPU can process roughly 3,000 hours of audio data per day in batch mode.

Open‑source Chinese Speech Datasets

Chinese open‑source speech datasets suitable for training speech synthesis models.

Dataset Name	Duration (hours)	Number of Speakers	Quality
aidatatang_200zh	200	600	Medium
aishell1	180	400	Medium
aishell3	85	218	Medium
primewords	99	296	Medium
thchs30	34	40	Medium
magicdata	755	1080	Medium
Emilia	200,000+	Emilia	Low
WenetSpeech4TTS	12,800	N/A	Low
CommonVoice	N/A	N/A	Low

Multilingual Open‑source Speech Datasets

English and multilingual open‑source speech datasets suitable for training speech synthesis models.

Dataset Name	Duration (hours)	Number of Speakers	Quality
LibriTTS‑R	585	2456	High
Hi‑Fi TTS	291	10	Very High
LibriHeavy	60000+	7000+	16kHz
MLS English	44500	5490	16kHz
MLS German	1966	176	16kHz
MLS Dutch	1554	40	16kHz
MLS French	1076	142	16kHz
MLS Spanish	917	86	16kHz
MLS Italian	247	65	16kHz
MLS Portuguese	160	42	16kHz
MLS Polish	103	11	16kHz

Processing Speech Data with lhotse

lhotse is a data‑management framework designed specifically for speech processing, providing a complete workflow for handling audio data. Its core concept is manifest‑based data representation:

Data Representation

Audio Data Representation: Stored in RecordingSet/Recording, containing metadata such as sources, sampling_rate, num_samples, duration, and channel_ids.
Annotation Data Representation: Stored in SupervisionSet/SupervisionSegment, containing information such as start, duration, transcript, language, speaker, and gender.

Data Processing Workflow

lhotse uses the concept of a Cut as a view or pointer to an audio segment, primarily including MonoCut, MixedCut, PaddingCut, and CutSet types. The processing workflow is as follows:

Load manifests as a CutSet, enabling uniform slicing, multi‑threaded feature extraction, padding, and generating Sampler and DataLoader for PyTorch
Feature extraction supports various extractors such as PyTorch fbank & MFCC, torchaudio, librosa, etc.
Feature normalization supports mean‑variance normalization (CMVN), global normalization, per‑sample normalization, and sliding‑window normalization

Parallel Processing

lhotse supports multi‑process parallel processing, example code:

from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch Integration

lhotse integrates seamlessly with PyTorch:

CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding, and dynamic batch sizes
Provides various samplers such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supporting checkpointing and dynamic batch‑size generation based on total speech duration
Batch I/O supports pre‑compute mode (for slow I/O) and on‑the‑fly feature extraction mode (for data augmentation)

Command‑line Tools

lhotse’s command‑line utilities are very practical, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:

lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.

Conclusion

TTS data preparation is a multi‑step, complex process that spans audio processing, speaker diarization, and speech recognition, among other technical areas. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying a solid foundation for building natural and fluent speech synthesis systems.

For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.