← Back to projects
AI Voice Cloning Infrastructure live

Voice Pipeline

Voice Pipeline is an open-source, end-to-end system for building custom F5-TTS voice models from rights-cleared source audio. It acquires media, normalizes audio, transcribes with word-level timestamps, extracts speaker-isolated clips into a structured dataset, and finetunes a local model — all without requiring a GPU. Designed as the foundation for owned-voice deployments in client apps and AI assistants.

Last updated: January 5, 2026

Highlights

  • Stage-isolated pipeline (acquire → normalize → transcribe → extract → train) so failures in one step don't poison the dataset or block recovery.
  • Idempotent, resumable runs driven by JSON state files; the runner itself stays stateless.
  • Manual review gate via `--train-only` mode lets operators prune false-positive clips before committing hours of finetuning.
  • Word-level timestamps from `faster-whisper` produce clean speaker-isolated cuts without manual splicing.
  • Validated end-to-end on Apple Silicon CPU — no GPU dependency for either inference or finetuning.
  • Architected to slot into a LiveKit Agents custom TTS node for real-time conversational deployments.

Technology Stack

F5-TTS Voice Cloning faster-whisper Transcription yt-dlp Media Acquisition FFmpeg Audio Normalization Python 3.11 Pipeline Runtime PyTorch Local Inference JSON State-Driven Orchestration Apple Silicon CPU Compatible LiveKit Agents Integration-Ready

Screenshots

Overview

Voice Pipeline is FCT Technologies’ infrastructure for building custom AI voice models. It takes rights-cleared source audio — a CEO’s recorded sessions, a brand voice actor’s catalog, a podcaster’s archive, a client’s existing narration — and produces a finetuned text-to-speech model that the client owns outright. No cloud TTS subscription. No per-character billing. No vendor lock-in.

The pipeline is open source under MIT. The audio you feed in and the model you train are yours.

What this case study proves

Most teams reaching for a custom AI voice end up tied to a cloud provider, paying per character generated forever and shipping their voice data through a third-party API. Voice Pipeline shows the other path: a local-first system that does the same job — acquisition, transcription, dataset curation, model finetuning — without any of that lock-in.

Beyond cost, the architectural choices here matter for production deployments. The pipeline is built around three principles that make it trustworthy for real work: stage isolation so partial failures don’t corrupt the dataset, idempotent state so any run can resume from where the last one stopped, and a deliberate manual review gate before training commits compute. Those are the same principles FCT applies to revenue-touching systems, not just experimental ML pipelines.

Who this is a fit for

This kind of infrastructure is a strong fit for:

  • Product teams shipping branded AI assistants that need a recognizable, owned voice.
  • Client-facing services (IVR, audio guides, in-app narration) where per-call cloud TTS fees compound into real money.
  • Content operations producing consistent voiceover at volume without re-recording or licensing fresh ElevenLabs voices for each project.
  • Compliance-sensitive deployments where audio data cannot leave the customer’s infrastructure.

Particularly useful when the client already owns the source audio (a recorded spokesperson, a hired voice actor’s catalog, internal recordings) and wants that voice operationalized as a model they control.

Architecture

Voice Pipeline is built as a sequence of strongly-isolated stages, each with explicit success and failure semantics:

  1. Acquire — sources are declared in a JSON registry. yt-dlp handles URL-based media; local files drop into a watch folder. Each acquisition writes to a per-run work directory; nothing touches the dataset until later stages succeed.
  2. Normalizeffmpeg converts to 24 kHz mono WAV with EBU R128 loudness normalization. Consistent levels across the dataset matter more than people expect for finetune quality.
  3. Transcribefaster-whisper produces segment- and word-level timestamps. Default is base.en for clean studio audio; swap to large-v3 for noisier sources.
  4. Extract — eligible segments are cut into individual WAV clips with paired text sidecars in F5-TTS’s expected metadata.csv format. Per-source provenance manifests preserve the link back to the originating audio.
  5. Train — F5-TTS finetunes the F5TTS_v1_Base checkpoint against the curated dataset. CPU-compatible on Apple Silicon; CUDA-accelerated when available.
  6. Notify — optional Telegram digest fires at terminal state, with full pipeline summary.

The runner is stateless. All progress lives in source-material.json, processed.json, and run-log.json. Any run can be killed, resumed, or rerun without corrupting state.

The manual review gate

F5-TTS finetune quality is dominated by dataset purity. A single multi-speaker clip or off-target voice in the dataset measurably degrades the resulting model. Voice Pipeline acknowledges this by offering a deliberate two-phase workflow:

  1. --skip-training builds the dataset only. The operator reviews the extracted clips in Finder (Quick Look + spacebar audition), deletes false positives, and inspects the surviving set.
  2. --train-only then runs F5-TTS finetuning against the cleaned dataset, skipping acquisition entirely.

This gate is the difference between a shippable voice model and a noisy one. It costs an hour of human review and saves a multi-hour wasted training run.

Why local-first matters commercially

Cloud TTS providers charge per character generated and route your audio through their infrastructure on every call. For a high-volume product — an IVR handling thousands of customer interactions a day, an in-app assistant speaking on every screen, audio guides serving every visitor — those per-character fees compound into a real line item. And the voice itself is a perpetual rental.

Voice Pipeline produces a model the client owns. Once trained, inference runs on the client’s own server. No usage fees. No vendor switching costs. No data leaves the perimeter. For commercial deployments, that’s the difference between a recurring cost center and a one-time capability investment.

What’s in the repo

The full pipeline — runner script, F5-TTS integration, state schemas, setup documentation, and reference skill docs for each underlying tool — is published at github.com/fcttechnologies/VoicePipeline under MIT license. The repo includes no training data and no trained checkpoints; those are your responsibility under whatever rights apply to your source audio.

Operational notes

  • Two virtual environments are kept isolated because the pipeline runner and F5-TTS have incompatible dependency trees. Setup documentation handles the split.
  • JSON state files are intentionally human-editable so operators can adjust priorities, retry failures, or annotate processed sources without touching code.
  • No automatic speaker diarization. The pipeline assumes the source material is mostly the target speaker. Multi-speaker sources require either pre-filtered compilations or the manual review pass — by design.
  • Apple Silicon CPU is the validated target. MPS backend currently produces silent output due to a PyTorch + F5-TTS interaction; CPU works reliably and is the default.

What comes next

Voice Pipeline is the foundation layer. The next stage of FCT’s voice work plugs trained models into a LiveKit Agents runtime so the voice can speak in real time over WebRTC — voice messages, voice agents, full conversational deployments. The pipeline is architected so any model it produces can drop into that runtime without rework.