From fefae92c20c334e2318b34d57da6ba89e8196a0b Mon Sep 17 00:00:00 2001 From: scttbnsn <80784472+scttbnsn@users.noreply.github.com> Date: Thu, 2 Jul 2026 19:52:36 -0400 Subject: [PATCH 1/2] =?UTF-8?q?=F0=9F=93=9D=20docs:=20filter=20prompt-elic?= =?UTF-8?q?ited=20inline=20tags=20before=20TTS?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 📝 docs(learn): add 'Removing Custom Inline Tags' section to text-to-speech page with PatternPairAggregator + MatchAction.REMOVE snippet and a Tip preferring native extended thinking - 📝 docs(api-reference): cross-link the new section from the Anthropic service Notes Addresses pipecat-ai/pipecat#4901 --- .../server/services/llm/anthropic.mdx | 1 + pipecat/learn/text-to-speech.mdx | 35 +++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/api-reference/server/services/llm/anthropic.mdx b/api-reference/server/services/llm/anthropic.mdx index b2c68cc2..7bcfc373 100644 --- a/api-reference/server/services/llm/anthropic.mdx +++ b/api-reference/server/services/llm/anthropic.mdx @@ -201,6 +201,7 @@ await worker.queue_frame( - **Prompt caching**: When `enable_prompt_caching` is enabled, Anthropic caches repeated context to reduce costs. Cache control markers are automatically added to the most recent user messages. This is most effective for conversations with large system prompts or long conversation histories. - **Extended thinking**: Enabling thinking increases response quality for complex tasks but adds latency. When `type="enabled"`, you must provide a `budget_tokens` value (minimum 1024 with current models). Extended thinking is disabled by default. +- **Prompt-elicited `` tags**: If your system prompt asks the model to reason inside inline tags rather than enabling extended thinking, that reasoning is ordinary text and will be spoken by TTS. Prefer the `thinking` parameter; for inline tags you deliberately keep, see [Removing Custom Inline Tags](/pipecat/learn/text-to-speech#removing-custom-inline-tags). - **Custom clients**: You can pass custom Anthropic client instances (e.g., `AsyncAnthropicBedrock` or `AsyncAnthropicVertex`) via the `client` parameter to use Anthropic models through other cloud providers. - **Retry behavior**: When `retry_on_timeout=True`, the first attempt uses the `retry_timeout_secs` timeout. If it times out, a second attempt is made with no timeout limit. - **System instruction precedence**: If both `system_instruction` (from the constructor) and a system message in the context are set, the constructor's `system_instruction` takes precedence and a warning is logged. diff --git a/pipecat/learn/text-to-speech.mdx b/pipecat/learn/text-to-speech.mdx index 1eb580c1..9dd1ba99 100644 --- a/pipecat/learn/text-to-speech.mdx +++ b/pipecat/learn/text-to-speech.mdx @@ -173,6 +173,41 @@ tts = CartesiaTTSService( # llm -> llm_text_processor -> tts ``` +### Removing Custom Inline Tags + +If your system prompt asks the LLM to wrap content in custom inline tags — for example, instructing it to reason inside `...` before answering — that content streams back as ordinary text and will be spoken by the TTS service. + + + If the goal is genuine model reasoning, prefer the provider's native reasoning + feature (e.g. the `thinking` parameter on `AnthropicLLMService`) over asking + for inline tags in the prompt. Structured reasoning is routed to + `LLMThoughtTextFrame`, which TTS ignores, so nothing leaks into speech and no + filtering is needed. + + +For inline tags you deliberately elicit, drop the tags **and** everything between them before they reach TTS using [`PatternPairAggregator`](/api-reference/server/utilities/text/pattern-pair-aggregator) with `MatchAction.REMOVE`: + +```python +from pipecat.processors.aggregators.llm_text_processor import LLMTextProcessor +from pipecat.utils.text.pattern_pair_aggregator import MatchAction, PatternPairAggregator + +pattern_aggregator = PatternPairAggregator() +pattern_aggregator.add_pattern( + type="thinking", + start_pattern="", + end_pattern="", + action=MatchAction.REMOVE, +) + +# Set the aggregator on an LLMTextProcessor +llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator) + +# add the llm_text_processor to your pipeline after the llm and before the tts +# llm -> llm_text_processor -> tts +``` + +Because this filters the text stream itself, it works with any LLM provider and any custom inline tag. The removed content never reaches the TTS service, so nothing is spoken and nothing extra lands in the conversation context. + ### Text Transforms For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like with `skip_aggregator_types`, or applied globally to all text using `'*'` as the type. From b1176850acd2407a74397daf63072b2ec888164c Mon Sep 17 00:00:00 2001 From: scttbnsn <80784472+scttbnsn@users.noreply.github.com> Date: Thu, 2 Jul 2026 20:57:29 -0400 Subject: [PATCH 2/2] =?UTF-8?q?=F0=9F=93=9D=20docs:=20move=20tag-removal?= =?UTF-8?q?=20guidance=20to=20PatternPairAggregator=20reference?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 🗑️ remove(learn): drop the new text-to-speech section per review - 📝 docs(api-reference): add 'Removing Tagged Content' usage example to pattern-pair-aggregator - 🔄 refactor(api-reference): point the Anthropic note at the new anchor Review feedback from pipecat-ai/docs#976 --- .../server/services/llm/anthropic.mdx | 2 +- .../text/pattern-pair-aggregator.mdx | 25 +++++++++++++ pipecat/learn/text-to-speech.mdx | 35 ------------------- 3 files changed, 26 insertions(+), 36 deletions(-) diff --git a/api-reference/server/services/llm/anthropic.mdx b/api-reference/server/services/llm/anthropic.mdx index 7bcfc373..1ad97dbc 100644 --- a/api-reference/server/services/llm/anthropic.mdx +++ b/api-reference/server/services/llm/anthropic.mdx @@ -201,7 +201,7 @@ await worker.queue_frame( - **Prompt caching**: When `enable_prompt_caching` is enabled, Anthropic caches repeated context to reduce costs. Cache control markers are automatically added to the most recent user messages. This is most effective for conversations with large system prompts or long conversation histories. - **Extended thinking**: Enabling thinking increases response quality for complex tasks but adds latency. When `type="enabled"`, you must provide a `budget_tokens` value (minimum 1024 with current models). Extended thinking is disabled by default. -- **Prompt-elicited `` tags**: If your system prompt asks the model to reason inside inline tags rather than enabling extended thinking, that reasoning is ordinary text and will be spoken by TTS. Prefer the `thinking` parameter; for inline tags you deliberately keep, see [Removing Custom Inline Tags](/pipecat/learn/text-to-speech#removing-custom-inline-tags). +- **Prompt-elicited `` tags**: If your system prompt asks the model to reason inside inline tags rather than enabling extended thinking, that reasoning is ordinary text and will be spoken by TTS. Prefer the `thinking` parameter; for inline tags you deliberately keep, see [Removing Tagged Content](/api-reference/server/utilities/text/pattern-pair-aggregator#removing-tagged-content). - **Custom clients**: You can pass custom Anthropic client instances (e.g., `AsyncAnthropicBedrock` or `AsyncAnthropicVertex`) via the `client` parameter to use Anthropic models through other cloud providers. - **Retry behavior**: When `retry_on_timeout=True`, the first attempt uses the `retry_timeout_secs` timeout. If it times out, a second attempt is made with no timeout limit. - **System instruction precedence**: If both `system_instruction` (from the constructor) and a system message in the context are set, the constructor's `system_instruction` takes precedence and a warning is logged. diff --git a/api-reference/server/utilities/text/pattern-pair-aggregator.mdx b/api-reference/server/utilities/text/pattern-pair-aggregator.mdx index b355cbc3..f30dc713 100644 --- a/api-reference/server/utilities/text/pattern-pair-aggregator.mdx +++ b/api-reference/server/utilities/text/pattern-pair-aggregator.mdx @@ -130,6 +130,31 @@ When a pattern is matched, the handler function receives a `PatternMatch` object ## Usage Examples +### Removing Tagged Content + +To drop content from the text stream entirely, register a pattern with `MatchAction.REMOVE`. The tags and everything between them are removed before reaching downstream processors — nothing is spoken by TTS and nothing lands in the conversation context. This is useful when your prompt elicits inline tags whose content is not meant for the user, such as reasoning tags (e.g., `...`) or annotations intended for other processors: + +```python +from pipecat.processors.aggregators.llm_text_processor import LLMTextProcessor +from pipecat.utils.text.pattern_pair_aggregator import MatchAction, PatternPairAggregator + +pattern_aggregator = PatternPairAggregator() +pattern_aggregator.add_pattern( + type="thinking", + start_pattern="", + end_pattern="", + action=MatchAction.REMOVE, +) + +# Set the aggregator on an LLMTextProcessor +llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator) + +# add the llm_text_processor to your pipeline after the llm and before the tts +# llm -> llm_text_processor -> tts +``` + +Because this filters the text stream itself, it works with any LLM provider and any custom inline tag. + ### Voice Switching in TTS This example demonstrates finding custom `` tags in streaming text to switch voices dynamically in a TTS service like Cartesia. It removes the tags and the content between them, such that the content is treated as if it does not exist. It will not be spoken by the TTS, it will not be added to the context, and it will not be sent to clients via RTVI. Instead, it simply triggers a voice switch side effect. diff --git a/pipecat/learn/text-to-speech.mdx b/pipecat/learn/text-to-speech.mdx index 9dd1ba99..1eb580c1 100644 --- a/pipecat/learn/text-to-speech.mdx +++ b/pipecat/learn/text-to-speech.mdx @@ -173,41 +173,6 @@ tts = CartesiaTTSService( # llm -> llm_text_processor -> tts ``` -### Removing Custom Inline Tags - -If your system prompt asks the LLM to wrap content in custom inline tags — for example, instructing it to reason inside `...` before answering — that content streams back as ordinary text and will be spoken by the TTS service. - - - If the goal is genuine model reasoning, prefer the provider's native reasoning - feature (e.g. the `thinking` parameter on `AnthropicLLMService`) over asking - for inline tags in the prompt. Structured reasoning is routed to - `LLMThoughtTextFrame`, which TTS ignores, so nothing leaks into speech and no - filtering is needed. - - -For inline tags you deliberately elicit, drop the tags **and** everything between them before they reach TTS using [`PatternPairAggregator`](/api-reference/server/utilities/text/pattern-pair-aggregator) with `MatchAction.REMOVE`: - -```python -from pipecat.processors.aggregators.llm_text_processor import LLMTextProcessor -from pipecat.utils.text.pattern_pair_aggregator import MatchAction, PatternPairAggregator - -pattern_aggregator = PatternPairAggregator() -pattern_aggregator.add_pattern( - type="thinking", - start_pattern="", - end_pattern="", - action=MatchAction.REMOVE, -) - -# Set the aggregator on an LLMTextProcessor -llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator) - -# add the llm_text_processor to your pipeline after the llm and before the tts -# llm -> llm_text_processor -> tts -``` - -Because this filters the text stream itself, it works with any LLM provider and any custom inline tag. The removed content never reaches the TTS service, so nothing is spoken and nothing extra lands in the conversation context. - ### Text Transforms For TTS-specific text preprocessing, you can provide custom text transforms that modify text in a just-in-time manner before sending the text off to the TTS service. This is useful for handling special text segments that need to be altered for better pronunciation or clarity, such as spelling out phone numbers, removing URLs, or expanding abbreviations. These text transforms can be mapped to a specific text aggregation type, like with `skip_aggregator_types`, or applied globally to all text using `'*'` as the type.