TTS Data Preparation Process
Published on
Table of Contents
Data preparation is a crucial first step in developing a high-quality text-to-speech (TTS) system. This article briefly summarizes the conversion process from raw audio to the final training dataset, some open source speech databases suitable for training speech synthesis models.
Overview of Data Preparation Link to this section
The training of a TTS system requires a large amount of high-quality, structured speech data. In order to obtain such a dataset, we need a complete set of data processing processes, including steps such as audio normalization, speaker separation, segmentation and transcription.
Emilia-Pipe Process Link to this section
Emilia-Pipe is a processing pipeline designed for TTS data preparation, which contains the following key steps:
| Step | Description |
|---|---|
| Normalization | Normalization of audio to ensure consistent volume and quality |
| Source separation | Processes long audio into pure speech without background music (BGM). |
| Speaker Separation | Extracts medium-length, single-speaker speech data |
| VAD-based fine segmentation | Slices speech into 3-30 second single-speaker segments |
| ASR | Get text transcription of speech segments |
| Filtering | Quality control to obtain the final processed dataset |
The source code for the Emilia preprocessing tool is available on GitHub: Amphion/preprocessors/Emilia
Speaker Separation Link to this section
Speaker Diarization is a key step in TTS data preparation that identifies “who is talking when”. This technique is essential for extracting a single speaker’s voice segment from audio such as a multi-person conversation or podcast.
More details on speaker separation techniques can be found in Speaker Diarization 3.1.
RTTM (Rich Transcription Time Marked) is an annotation format commonly used in speech processing to record speaker transition information.The meanings of the columns of an RTTM file are listed below:
| Column Name | Description |
|---|---|
| Type | Type of segment; should always be SPEAKER. |
| File ID | File name; base name of the recording (without extension), e.g. rec1_a |
| Channel ID | Channel ID (indexed from 1); should always be 1 |
| Turn Onset | Turn onset time (number of seconds from the start of the recording) |
| Turn Duration | Turn Duration (in seconds) |
| Orthography Field | Should always be |
| Speaker Type | Should always be <NA |
| Speaker Name | Speaker name; should be unique within each file range |
| Confidence Score | System Confidence Score (probability); should always be <NA |
| Signal Lookahead Time | Should always be |
Efficiency in practice Link to this section
In a real production environment, using GPUs can dramatically improve processing efficiency. According to the test, using one A800 GPU for batch processing can process about 3000 hours of audio data in a single day.
Chinese Open Source Speech Data Link to this section
Mandarin open source speech data suitable for training speech synthesis models.
| Data name | Number of hours | Number of speakers | Quality |
|---|---|---|---|
| aidatatang_200zh | 200 | 600 | Medium |
| aishell1 | 180 | 400 | Medium |
| aishell3 | 85 | 218 | Medium |
| primewords | 99 | 296 | in |
| thchs30 | 34 | 40 | Medium |
| magicdata | 755 | 1080 | Mid |
| Emilia | 200,000+ | Emilia | Low |
| WenetSpeech4TTS | 12,800 | N/A | Low |
| CommonVoice | N/A | N/A | Low |
Multilingual Open Source Speech Data Link to this section
English and multilingual open source speech data suitable for training speech synthesis models.
| Data name | Number of hours | Number of speakers | Quality |
|---|---|---|---|
| LibriTTS-R | 585 | 2456 | High |
| Hi-Fi TTS | 291 | 10 | Very High |
| LibriHeavy | 60000+ | 7000+ | 16kHz |
| MLS English | 44500 | 5490 | 16kHz |
| MLS German | 1966 | 176 | 16kHz |
| MLS Dutch | 1554 | 40 | 16kHz |
| MLS French | 1076 | 142 | 16kHz |
| MLS Spanish | 917 | 86 | 16kHz |
| MLS Italian | 247 | 65 | 16kHz |
| MLS Portuguese | 160 | 42 | 16kHz |
| MLS Polish | 103 | 11 | 16kHz |
Speech data processing with lhotse Link to this section
lhotse is a data management framework designed specifically for speech processing, providing a complete process for processing audio data. Its core concept is a manifest-based data representation:
Data Representation Link to this section
Audio data representation: through RecordingSet/Recording to store audio metadata, including audio source (sources), sampling rate (sampling_rate), number of samples (num_samples), duration (duration) and channel ID (channel_ids).
Annotation data representation: store the annotation information through SupervisionSet/SupervisionSegment, including start, duration, transcript, language, speaker and gender. ).
Data Processing Flow Link to this section
lhotse uses the concept of Cut as a view or pointer to an audio clip, which mainly includes the types MonoCut, MixedCut, PaddingCut and CutSet. The processing flow is as follows:
Load the manifests as CutSet, which can perform equal-length cuts, multi-threaded feature extraction, end-filling, and generate Sampler and DataLoader for PyTorch
Feature extraction supports multiple extractors, such as PyTorch fbank & MFCC, torchaudio, librosa and so on.
Feature normalization supports mean-variance normalization (CMVN), global normalization, sample-by-sample normalization and sliding window normalization.
Parallel Processing Link to this section
lhotse supports multi-process parallel processing, sample code is as follows:
from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter
num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
cuts: CutSet = cuts.compute_and_store_features(
extractor=Fbank(),
storage=LilcomChunkyWriter('feats'),
executor=ex)PyTorch Integration Link to this section
lhotse integrates seamlessly with PyTorch:
CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding and dynamic batch size
Provides multiple samplers, such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supports state recovery, and dynamic batch size generation based on total speech duration
Batch I/O supports pre-calculated mode (for slow I/O) and on-the-fly feature extraction mode (for data enhancement).
Command Line Tools Link to this section
lhotse’s command line tools are quite useful, including combine, copy, copy-feats and multiple cut operations such as append, decompose, describe, etc., which simplify the data processing process:
lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simplelhotse provides prepare function for many open source datasets, which can easily download and process these standard speech datasets.
To summarize Link to this section
TTS data preparation is a multi-step and complex process that involves multiple technical areas such as audio processing, speaker separation and speech recognition. With tools such as Emilia-Pipe and a well-established processing flow, we can transform raw audio into high-quality TTS training datasets, laying the foundation for building a natural and smooth speech synthesis system.
For teams wishing to develop a TTS system, it is recommended that sufficient resources be invested in the data preparation phase, as the quality of the data directly determines the performance of the final model.