Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 1 addition & 9 deletions fern/customization/speech-configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,7 @@ This plan defines the parameters for when the assistant begins speaking after th
- **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn.
- **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan.

We offer different providers that can be audio-based, text-based, or audio-text based:

**Audio-based providers:**

- **Krisp**: Audio-based model that analyzes prosodic and acoustic features such as changes in intonation, pitch, and rhythm to detect when users finish speaking. Since it's audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Vapi offers configurable acknowledgement words and a well-configured stop speaking plan to handle this properly.

Configure Krisp with a threshold between 0 and 1 (default 0.5), where 1 means the user definitely stopped speaking and 0 means they're still speaking. Use lower values for snappier conversations and higher values for more conservative detection.

When interacting with an AI agent, users may genuinely want to interrupt to ask a question or shift the conversation, or they might simply be using backchannel cues like "right" or "okay" to signal they're actively listening. The core challenge lies in distinguishing meaningful interruptions from casual acknowledgments. Since the audio-based model signals end-of-turn after each word, configure the stop speaking plan with the right number of words to interrupt, interruption settings, and acknowledgement phrases to handle backchanneling properly.
We offer different providers that can be text-based or audio-text based:

**Audio-text based providers:**

Expand Down
42 changes: 8 additions & 34 deletions fern/customization/voice-pipeline-configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -185,9 +185,6 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
- **livekit**: Advanced model trained on conversation data (English only)
- **vapi**: VAPI-trained model (non-English conversations or LiveKit alternative)

**Audio-based providers:**
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)

**Audio-text based providers:**
- **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only)
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
Expand All @@ -201,7 +198,6 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
- **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
- **LiveKit**: English conversations where Deepgram is not the transcriber of choice.
- **Vapi**: Non-English conversations with default stop speaking plan settings
- **Krisp**: Non-English conversations with a robustly configured stop speaking plan

### Deepgram Flux configuration

Expand Down Expand Up @@ -305,32 +301,6 @@ The system continuously analyzes the latest user message and applies the first m
- Scenarios requiring predictable, rule-based endpointing behavior
- Fallback option when other smart endpointing providers aren't suitable

### Krisp threshold configuration

Krisp's audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they're still speaking.

**Threshold settings:**

- **0.0-0.3:** Very aggressive detection - responds quickly but may interrupt users mid-sentence
- **0.4-0.6:** Balanced detection (default: 0.5) - good balance between responsiveness and accuracy
- **0.7-1.0:** Conservative detection - waits longer to ensure users have finished speaking

**Configuration example:**

```json
{
"startSpeakingPlan": {
"smartEndpointingPlan": {
"provider": "krisp",
"threshold": 0.5
}
}
}
```

**Important considerations:**
Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate `acknowledgementPhrases` and `numWords` settings to handle backchanneling properly.

### Assembly turn detection

AssemblyAI's turn detection model uses a neural network to detect when someone has finished speaking. The model understands the meaning and flow of speech to make better decisions about when a turn has ended.
Expand Down Expand Up @@ -613,15 +583,19 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output

**Optimized for:** Text-based endpointing with longer timeouts for different speech patterns and international support.

### Audio-based endpointing (Krisp example)
### Non-English smart endpointing (Vapi example)

```json
{
"startSpeakingPlan": {
"waitSeconds": 0.4,
"smartEndpointingPlan": {
"provider": "krisp",
"threshold": 0.5
"provider": "vapi"
},
"transcriptionEndpointingPlan": {
"onPunctuationSeconds": 0.1,
"onNoPunctuationSeconds": 1.5,
"onNumberSeconds": 0.5
}
},
"stopSpeakingPlan": {
Expand All @@ -640,7 +614,7 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
}
```

**Optimized for:** Non-English conversations with robust backchanneling configuration to handle audio-based detection limitations.
**Optimized for:** Non-English conversations with Vapi's heuristic endpointing and robust backchanneling configuration.

### Audio-text based endpointing (Assembly example)

Expand Down
Loading