Documentation
¶
Overview ¶
Package coqui provides a local Coqui TTS-backed TTS provider that connects to either a Coqui XTTS v2 server or a standard Coqui TTS server via its REST API. It implements the tts.Provider interface.
Two API modes are supported:
APIModeStandard (default): targets the standard Coqui TTS server (ghcr.io/coqui-ai/tts-cpu). Synthesis is performed via GET /api/tts with URL query parameters; voice catalogue is retrieved from GET /details.
APIModeXTTS: targets the Coqui XTTS v2 API server. Synthesis is performed via POST /tts_to_audio/ with a JSON body; voice catalogue is retrieved from GET /studio_speakers; voice cloning is available via POST /clone_speaker.
Because both servers operate in batch mode (one HTTP call per utterance rather than a streaming socket), SynthesizeStream accumulates incoming text fragments into complete sentences and then dispatches concurrent HTTP requests with a small lookahead buffer to minimise perceived latency.
Typical usage (standard server):
p := coqui.New("http://localhost:5002",
coqui.WithLanguage("en"),
coqui.WithTimeout(15*time.Second),
// APIModeStandard is the default; this line is optional:
coqui.WithAPIMode(coqui.APIModeStandard),
)
audio, err := p.SynthesizeStream(ctx, textCh, voiceProfile)
Typical usage (XTTS v2 server):
p := coqui.New("http://localhost:8002",
coqui.WithLanguage("en"),
coqui.WithAPIMode(coqui.APIModeXTTS),
)
audio, err := p.SynthesizeStream(ctx, textCh, voiceProfile)
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type APIMode ¶
type APIMode string
APIMode selects which Coqui server API the provider will target.
const ( // APIModeXTTS targets the Coqui XTTS v2 API server (/tts_to_audio/). // It supports voice cloning via /clone_speaker and voice listing via // /studio_speakers. APIModeXTTS APIMode = "xtts" // APIModeStandard targets the standard Coqui TTS server (/api/tts). // This is the default mode. Voice listing is performed via /details. // Voice cloning is not supported in this mode. APIModeStandard APIMode = "standard" )
type Option ¶
type Option func(*Provider)
Option is a functional option for configuring a Coqui Provider.
func WithAPIMode ¶
WithAPIMode sets the server API mode. Use APIModeStandard (default) for the standard Coqui TTS Docker image (ghcr.io/coqui-ai/tts-cpu) or APIModeXTTS for the XTTS v2 API server.
func WithLanguage ¶
WithLanguage sets the BCP-47 language code sent to the TTS server (e.g., "en", "de", "fr"). Defaults to "en" if not set.
func WithTimeout ¶
WithTimeout sets the per-request HTTP timeout for calls to the TTS server. Defaults to 30 s if not set.
type Provider ¶
type Provider struct {
// contains filtered or unexported fields
}
Provider implements tts.Provider backed by a locally-running Coqui TTS server. It is safe for concurrent use; multiple SynthesizeStream calls may run in parallel.
func New ¶
New creates a new Coqui Provider that targets the TTS server at serverURL (e.g., "http://localhost:5002"). serverURL must be non-empty. Functional options may override the language, per-request timeout, and API mode. The default API mode is APIModeStandard.
func (*Provider) CloneVoice ¶
CloneVoice creates a new speaker voice by uploading WAV audio samples to the XTTS server via POST /clone_speaker. Each element of samples must be a valid WAV-encoded audio file.
Voice cloning is only supported in APIModeXTTS. In APIModeStandard, this method always returns an error.
Returns a VoiceProfile for the cloned voice or an error if the request fails. A nil or empty samples slice returns an error rather than sending an empty request.
func (*Provider) ListVoices ¶
ListVoices retrieves the list of available voices from the Coqui server.
In APIModeXTTS, it calls GET /studio_speakers and maps each entry to a VoiceProfile. In APIModeStandard, it calls GET /details and returns one VoiceProfile per speaker for multi-speaker models, or a single VoiceProfile (identified by model name) for single-speaker models.
func (*Provider) SynthesizeStream ¶
func (p *Provider) SynthesizeStream(ctx context.Context, text <-chan string, voice tts.VoiceProfile) (<-chan []byte, error)
SynthesizeStream consumes text fragments from the text channel, accumulates them into complete sentences (split on '.', '!', '?' followed by whitespace or EOF), and for each sentence issues an HTTP synthesis request to the Coqui server. WAV responses are stripped of their file headers and the raw PCM is emitted on the returned channel in the original sentence order.
Up to sentenceLookaheadBuf HTTP requests may be in-flight concurrently to hide network/server latency while preserving output ordering.
The returned channel is closed when all text has been synthesised or when ctx is cancelled. The caller must drain the channel to prevent goroutine leaks.