From 12a1092e27176c4e8288e64740632dc8dc470fda Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 7 Apr 2026 20:37:35 +0000 Subject: [PATCH] fix: remove krisp as a smart endpointing provider from docs Krisp is not a valid smartEndpointingPlan provider. The API only supports 'vapi', 'livekit', and 'custom-endpointing-model'. The docs incorrectly listed krisp as an audio-based smart endpointing provider, causing users to get API validation errors when trying to use it. Changes: - Remove krisp from the smart endpointing providers list - Remove the 'Krisp threshold configuration' section - Replace the 'Audio-based endpointing (Krisp example)' config example with a 'Non-English smart endpointing (Vapi example)' - Remove krisp from the speech-configuration.mdx providers list Note: Krisp references for background speech denoising (a separate feature) are intentionally left unchanged. Co-authored-by: Sahil Suman --- fern/customization/speech-configuration.mdx | 10 +---- .../voice-pipeline-configuration.mdx | 42 ++++--------------- 2 files changed, 9 insertions(+), 43 deletions(-) diff --git a/fern/customization/speech-configuration.mdx b/fern/customization/speech-configuration.mdx index 53a9bb27f..11a676383 100644 --- a/fern/customization/speech-configuration.mdx +++ b/fern/customization/speech-configuration.mdx @@ -39,15 +39,7 @@ This plan defines the parameters for when the assistant begins speaking after th - **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn. - **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan. - We offer different providers that can be audio-based, text-based, or audio-text based: - - **Audio-based providers:** - - - **Krisp**: Audio-based model that analyzes prosodic and acoustic features such as changes in intonation, pitch, and rhythm to detect when users finish speaking. Since it's audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Vapi offers configurable acknowledgement words and a well-configured stop speaking plan to handle this properly. - - Configure Krisp with a threshold between 0 and 1 (default 0.5), where 1 means the user definitely stopped speaking and 0 means they're still speaking. Use lower values for snappier conversations and higher values for more conservative detection. - - When interacting with an AI agent, users may genuinely want to interrupt to ask a question or shift the conversation, or they might simply be using backchannel cues like "right" or "okay" to signal they're actively listening. The core challenge lies in distinguishing meaningful interruptions from casual acknowledgments. Since the audio-based model signals end-of-turn after each word, configure the stop speaking plan with the right number of words to interrupt, interruption settings, and acknowledgement phrases to handle backchanneling properly. + We offer different providers that can be text-based or audio-text based: **Audio-text based providers:** diff --git a/fern/customization/voice-pipeline-configuration.mdx b/fern/customization/voice-pipeline-configuration.mdx index c2234c7ac..2bddd87e9 100644 --- a/fern/customization/voice-pipeline-configuration.mdx +++ b/fern/customization/voice-pipeline-configuration.mdx @@ -185,9 +185,6 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh - **livekit**: Advanced model trained on conversation data (English only) - **vapi**: VAPI-trained model (non-English conversations or LiveKit alternative) - **Audio-based providers:** - - **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm) - **Audio-text based providers:** - **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only) - **assembly**: Transcriber with built-in end-of-turn detection (English only) @@ -201,7 +198,6 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh - **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection - **LiveKit**: English conversations where Deepgram is not the transcriber of choice. - **Vapi**: Non-English conversations with default stop speaking plan settings -- **Krisp**: Non-English conversations with a robustly configured stop speaking plan ### Deepgram Flux configuration @@ -305,32 +301,6 @@ The system continuously analyzes the latest user message and applies the first m - Scenarios requiring predictable, rule-based endpointing behavior - Fallback option when other smart endpointing providers aren't suitable -### Krisp threshold configuration - -Krisp's audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they're still speaking. - -**Threshold settings:** - -- **0.0-0.3:** Very aggressive detection - responds quickly but may interrupt users mid-sentence -- **0.4-0.6:** Balanced detection (default: 0.5) - good balance between responsiveness and accuracy -- **0.7-1.0:** Conservative detection - waits longer to ensure users have finished speaking - -**Configuration example:** - -```json -{ - "startSpeakingPlan": { - "smartEndpointingPlan": { - "provider": "krisp", - "threshold": 0.5 - } - } -} -``` - -**Important considerations:** -Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate `acknowledgementPhrases` and `numWords` settings to handle backchanneling properly. - ### Assembly turn detection AssemblyAI's turn detection model uses a neural network to detect when someone has finished speaking. The model understands the meaning and flow of speech to make better decisions about when a turn has ended. @@ -613,15 +583,19 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output **Optimized for:** Text-based endpointing with longer timeouts for different speech patterns and international support. -### Audio-based endpointing (Krisp example) +### Non-English smart endpointing (Vapi example) ```json { "startSpeakingPlan": { "waitSeconds": 0.4, "smartEndpointingPlan": { - "provider": "krisp", - "threshold": 0.5 + "provider": "vapi" + }, + "transcriptionEndpointingPlan": { + "onPunctuationSeconds": 0.1, + "onNoPunctuationSeconds": 1.5, + "onNumberSeconds": 0.5 } }, "stopSpeakingPlan": { @@ -640,7 +614,7 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output } ``` -**Optimized for:** Non-English conversations with robust backchanneling configuration to handle audio-based detection limitations. +**Optimized for:** Non-English conversations with Vapi's heuristic endpointing and robust backchanneling configuration. ### Audio-text based endpointing (Assembly example)