Speech synthesis data preparation process

Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.

Data Preparation Overview

Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker separation, segmentation, and transcription, among other steps.

Emilia‑Pipe Processing Pipeline

Emilia‑Pipe is a processing pipeline specifically designed for TTS data preparation, comprising the following key steps:

StepDescription
NormalizationNormalize audio to ensure consistent volume and quality
Source SeparationConvert long recordings into pure speech without background music (BGM)
Speaker SeparationExtract medium‑length single‑speaker speech segments
Fine Segmentation Based on VADSplit speech into 3‑30 s single‑speaker fragments
ASRObtain text transcriptions for speech segments
FilteringQuality control to produce the final processed dataset

The source code of the Emilia preprocessing tools is available on GitHub: Amphion/preprocessors/Emilia

Speaker Diarization

Speaker diarization is a key step in TTS data preparation that identifies “who spoke when.” This technology is essential for extracting single‑speaker speech segments from multi‑speaker dialogues, podcasts, and other audio sources.

More detailed information about speaker diarization can be found at Speaker Diarization 3.1

RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns of an RTTM file mean the following:

Column NameDescription
TypeSegment type; should always be SPEAKER
File IDFile name; the base name of the recording (without extension), e.g., rec1_a
Channel IDChannel ID (starting from 1); should always be 1
Turn OnsetStart time of the turn (seconds from the beginning of the recording)
Turn DurationDuration of the turn (seconds)
Orthography FieldShould always be
Speaker TypeShould always be
Speaker NameSpeaker identifier; must be unique within each file
Confidence ScoreSystem confidence score (probability); should always be
Signal Lookahead TimeShould always be

Efficiency in Practice

In production environments, using a GPU can dramatically increase processing efficiency. Tests show that a single A800 GPU can batch‑process roughly 3,000 hours of audio data per day.

Open‑Source Chinese Speech Data

Chinese open‑source speech corpora suitable for training speech synthesis models.

Dataset NameDuration (hours)Number of SpeakersQuality
aidatatang_200zh200600Medium
aishell1180400Medium
aishell385218Medium
primewords99296Medium
thchs303440Medium
magicdata7551080Medium
Emilia200,000+EmiliaLow
WenetSpeech4TTS12,800N/ALow
CommonVoiceN/AN/ALow

Open‑Source Multilingual Speech Data

English and multilingual open‑source speech corpora suitable for training speech synthesis models.

Dataset NameDuration (hours)Number of SpeakersQuality
LibriTTS‑R5852456High
Hi‑Fi TTS29110Very High
LibriHeavy60,000+7,000+16 kHz
MLS English44,5005,49016 kHz
MLS German1,96617616 kHz
MLS Dutch1,5544016 kHz
MLS French1,07614216 kHz
MLS Spanish9178616 kHz
MLS Italian2476516 kHz
MLS Portuguese1604216 kHz
MLS Polish1031116 kHz

Using lhotse to Process Speech Data

lhotse is a data‑management framework designed specifically for speech processing, offering a complete workflow for handling audio data. Its core concept is manifest‑based data representation:

Data Representation

  1. Audio data representation: Stored in RecordingSet/Recording, containing metadata such as sources, sampling_rate, num_samples, duration, and channel_ids.

  2. Annotation data representation: Stored in SupervisionSet/SupervisionSegment, containing fields like start, duration, transcript, language, speaker, and gender.

Data Processing Workflow

lhotse uses the concept of a Cut as a view or pointer to an audio segment, with types including MonoCut, MixedCut, PaddingCut, and CutSet. The workflow is:

Parallel Processing

lhotse supports multi‑process parallelism; an example is shown below:

from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch Integration

lhotse integrates seamlessly with PyTorch:

Command‑Line Tools

lhotse’s command‑line utilities are very handy, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:

lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.

Summary

TTS data preparation is a multi‑step, complex process that spans audio processing, speaker diarization, and speech recognition. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying a solid foundation for building natural and fluent speech synthesis systems.

For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.

#potofhoney #speech-synthesis #data-engineering #lhotse