TTS Data Preparation Process

Published on

Table of Contents

Data preparation is a crucial first step in developing a high-quality text-to-speech (TTS) system. This article briefly summarizes the conversion process from raw audio to the final training dataset, some open source speech databases suitable for training speech synthesis models.

Overview of Data Preparation Link to this section

The training of a TTS system requires a large amount of high-quality, structured speech data. In order to obtain such a dataset, we need a complete set of data processing processes, including steps such as audio normalization, speaker separation, segmentation and transcription.

Emilia-Pipe Process Link to this section

Emilia-Pipe is a processing pipeline designed for TTS data preparation, which contains the following key steps:

StepDescription
NormalizationNormalization of audio to ensure consistent volume and quality
Source separationProcesses long audio into pure speech without background music (BGM).
Speaker SeparationExtracts medium-length, single-speaker speech data
VAD-based fine segmentationSlices speech into 3-30 second single-speaker segments
ASRGet text transcription of speech segments
FilteringQuality control to obtain the final processed dataset

The source code for the Emilia preprocessing tool is available on GitHub: Amphion/preprocessors/Emilia

Speaker Separation Link to this section

Speaker Diarization is a key step in TTS data preparation that identifies “who is talking when”. This technique is essential for extracting a single speaker’s voice segment from audio such as a multi-person conversation or podcast.

More details on speaker separation techniques can be found in Speaker Diarization 3.1.

RTTM (Rich Transcription Time Marked) is an annotation format commonly used in speech processing to record speaker transition information.The meanings of the columns of an RTTM file are listed below:

Column NameDescription
TypeType of segment; should always be SPEAKER.
File IDFile name; base name of the recording (without extension), e.g. rec1_a
Channel IDChannel ID (indexed from 1); should always be 1
Turn OnsetTurn onset time (number of seconds from the start of the recording)
Turn DurationTurn Duration (in seconds)
Orthography FieldShould always be .
Speaker TypeShould always be <NA
Speaker NameSpeaker name; should be unique within each file range
Confidence ScoreSystem Confidence Score (probability); should always be <NA
Signal Lookahead TimeShould always be .

Efficiency in practice Link to this section

In a real production environment, using GPUs can dramatically improve processing efficiency. According to the test, using one A800 GPU for batch processing can process about 3000 hours of audio data in a single day.

Chinese Open Source Speech Data Link to this section

Mandarin open source speech data suitable for training speech synthesis models.

Data nameNumber of hoursNumber of speakersQuality
aidatatang_200zh200600Medium
aishell1180400Medium
aishell385218Medium
primewords99296in
thchs303440Medium
magicdata7551080Mid
Emilia200,000+EmiliaLow
WenetSpeech4TTS12,800N/ALow
CommonVoiceN/AN/ALow

Multilingual Open Source Speech Data Link to this section

English and multilingual open source speech data suitable for training speech synthesis models.

Data nameNumber of hoursNumber of speakersQuality
LibriTTS-R5852456High
Hi-Fi TTS29110Very High
LibriHeavy60000+7000+16kHz
MLS English44500549016kHz
MLS German196617616kHz
MLS Dutch15544016kHz
MLS French107614216kHz
MLS Spanish9178616kHz
MLS Italian2476516kHz
MLS Portuguese1604216kHz
MLS Polish1031116kHz

Speech data processing with lhotse Link to this section

lhotse is a data management framework designed specifically for speech processing, providing a complete process for processing audio data. Its core concept is a manifest-based data representation:

Data Representation Link to this section

  1. Audio data representation: through RecordingSet/Recording to store audio metadata, including audio source (sources), sampling rate (sampling_rate), number of samples (num_samples), duration (duration) and channel ID (channel_ids).

  2. Annotation data representation: store the annotation information through SupervisionSet/SupervisionSegment, including start, duration, transcript, language, speaker and gender. ).

Data Processing Flow Link to this section

lhotse uses the concept of Cut as a view or pointer to an audio clip, which mainly includes the types MonoCut, MixedCut, PaddingCut and CutSet. The processing flow is as follows:

  • Load the manifests as CutSet, which can perform equal-length cuts, multi-threaded feature extraction, end-filling, and generate Sampler and DataLoader for PyTorch

  • Feature extraction supports multiple extractors, such as PyTorch fbank & MFCC, torchaudio, librosa and so on.

  • Feature normalization supports mean-variance normalization (CMVN), global normalization, sample-by-sample normalization and sliding window normalization.

Parallel Processing Link to this section

lhotse supports multi-process parallel processing, sample code is as follows:

python
from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch Integration Link to this section

lhotse integrates seamlessly with PyTorch:

  • CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding and dynamic batch size

  • Provides multiple samplers, such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supports state recovery, and dynamic batch size generation based on total speech duration

  • Batch I/O supports pre-calculated mode (for slow I/O) and on-the-fly feature extraction mode (for data enhancement).

Command Line Tools Link to this section

lhotse’s command line tools are quite useful, including combine, copy, copy-feats and multiple cut operations such as append, decompose, describe, etc., which simplify the data processing process:

bash
lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare function for many open source datasets, which can easily download and process these standard speech datasets.

To summarize Link to this section

TTS data preparation is a multi-step and complex process that involves multiple technical areas such as audio processing, speaker separation and speech recognition. With tools such as Emilia-Pipe and a well-established processing flow, we can transform raw audio into high-quality TTS training datasets, laying the foundation for building a natural and smooth speech synthesis system.

For teams wishing to develop a TTS system, it is recommended that sufficient resources be invested in the data preparation phase, as the quality of the data directly determines the performance of the final model.

References Link to this section