TTS Data Preparation Process

Published on October 17, 2024

Table of Contents

Data preparation is a crucial first step in developing a high-quality text-to-speech (TTS) system. This article briefly summarizes the conversion process from raw audio to the final training dataset, some open source speech databases suitable for training speech synthesis models.

Overview of Data Preparation Link to this section

The training of a TTS system requires a large amount of high-quality, structured speech data. In order to obtain such a dataset, we need a complete set of data processing processes, including steps such as audio normalization, speaker separation, segmentation and transcription.

Emilia-Pipe Process Link to this section

Emilia-Pipe is a processing pipeline designed for TTS data preparation, which contains the following key steps:

Step	Description
Normalization	Normalization of audio to ensure consistent volume and quality
Source separation	Processes long audio into pure speech without background music (BGM).
Speaker Separation	Extracts medium-length, single-speaker speech data
VAD-based fine segmentation	Slices speech into 3-30 second single-speaker segments
ASR	Get text transcription of speech segments
Filtering	Quality control to obtain the final processed dataset

The source code for the Emilia preprocessing tool is available on GitHub: Amphion/preprocessors/Emilia

Speaker Separation Link to this section

Speaker Diarization is a key step in TTS data preparation that identifies “who is talking when”. This technique is essential for extracting a single speaker’s voice segment from audio such as a multi-person conversation or podcast.

More details on speaker separation techniques can be found in Speaker Diarization 3.1.

RTTM (Rich Transcription Time Marked) is an annotation format commonly used in speech processing to record speaker transition information.The meanings of the columns of an RTTM file are listed below:

Column Name	Description
Type	Type of segment; should always be SPEAKER.
File ID	File name; base name of the recording (without extension), e.g. rec1_a
Channel ID	Channel ID (indexed from 1); should always be 1
Turn Onset	Turn onset time (number of seconds from the start of the recording)
Turn Duration	Turn Duration (in seconds)
Orthography Field	Should always be .
Speaker Type	Should always be <NA
Speaker Name	Speaker name; should be unique within each file range
Confidence Score	System Confidence Score (probability); should always be <NA
Signal Lookahead Time	Should always be .

Efficiency in practice Link to this section

In a real production environment, using GPUs can dramatically improve processing efficiency. According to the test, using one A800 GPU for batch processing can process about 3000 hours of audio data in a single day.

Chinese Open Source Speech Data Link to this section

Mandarin open source speech data suitable for training speech synthesis models.

Data name	Number of hours	Number of speakers	Quality
aidatatang_200zh	200	600	Medium
aishell1	180	400	Medium
aishell3	85	218	Medium
primewords	99	296	in
thchs30	34	40	Medium
magicdata	755	1080	Mid
Emilia	200,000+	Emilia	Low
WenetSpeech4TTS	12,800	N/A	Low
CommonVoice	N/A	N/A	Low

Multilingual Open Source Speech Data Link to this section

English and multilingual open source speech data suitable for training speech synthesis models.

Data name	Number of hours	Number of speakers	Quality
LibriTTS-R	585	2456	High
Hi-Fi TTS	291	10	Very High
LibriHeavy	60000+	7000+	16kHz
MLS English	44500	5490	16kHz
MLS German	1966	176	16kHz
MLS Dutch	1554	40	16kHz
MLS French	1076	142	16kHz
MLS Spanish	917	86	16kHz
MLS Italian	247	65	16kHz
MLS Portuguese	160	42	16kHz
MLS Polish	103	11	16kHz

Speech data processing with lhotse Link to this section

lhotse is a data management framework designed specifically for speech processing, providing a complete process for processing audio data. Its core concept is a manifest-based data representation:

Data Representation Link to this section

Audio data representation: through RecordingSet/Recording to store audio metadata, including audio source (sources), sampling rate (sampling_rate), number of samples (num_samples), duration (duration) and channel ID (channel_ids).
Annotation data representation: store the annotation information through SupervisionSet/SupervisionSegment, including start, duration, transcript, language, speaker and gender. ).

Data Processing Flow Link to this section

lhotse uses the concept of Cut as a view or pointer to an audio clip, which mainly includes the types MonoCut, MixedCut, PaddingCut and CutSet. The processing flow is as follows:

Load the manifests as CutSet, which can perform equal-length cuts, multi-threaded feature extraction, end-filling, and generate Sampler and DataLoader for PyTorch
Feature extraction supports multiple extractors, such as PyTorch fbank & MFCC, torchaudio, librosa and so on.
Feature normalization supports mean-variance normalization (CMVN), global normalization, sample-by-sample normalization and sliding window normalization.

Parallel Processing Link to this section

lhotse supports multi-process parallel processing, sample code is as follows:

python
from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch Integration Link to this section

lhotse integrates seamlessly with PyTorch:

CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding and dynamic batch size
Provides multiple samplers, such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supports state recovery, and dynamic batch size generation based on total speech duration
Batch I/O supports pre-calculated mode (for slow I/O) and on-the-fly feature extraction mode (for data enhancement).

Command Line Tools Link to this section

lhotse’s command line tools are quite useful, including combine, copy, copy-feats and multiple cut operations such as append, decompose, describe, etc., which simplify the data processing process:

bash
lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare function for many open source datasets, which can easily download and process these standard speech datasets.

To summarize Link to this section

TTS data preparation is a multi-step and complex process that involves multiple technical areas such as audio processing, speaker separation and speech recognition. With tools such as Emilia-Pipe and a well-established processing flow, we can transform raw audio into high-quality TTS training datasets, laying the foundation for building a natural and smooth speech synthesis system.

For teams wishing to develop a TTS system, it is recommended that sufficient resources be invested in the data preparation phase, as the quality of the data directly determines the performance of the final model.

Overview of Data Preparation Link to this section #

Emilia-Pipe Process Link to this section #

Speaker Separation Link to this section #

Efficiency in practice Link to this section #

Chinese Open Source Speech Data Link to this section #

Multilingual Open Source Speech Data Link to this section #

Speech data processing with lhotse Link to this section #

Data Representation Link to this section #

Data Processing Flow Link to this section #

Parallel Processing Link to this section #

PyTorch Integration Link to this section #

Command Line Tools Link to this section #

To summarize Link to this section #

References Link to this section #